GDELT: Watching The Entire World Unfold15th March 2015
I think it's a fairly well-accepted observation that we're in the middle of a cloud computing revolution. The cloud is making businesses more agile, lowering the barrier to entry for start-ups, driving seamless user experiences across devices, and empowering individuals with near-limitless computing resources at their fingertips, completely on-demand. The transformation often feels gradual though. Costs are coming down steadily but incrementally. New services are introduced, but are usually sublte improvements on existing services or simply moving workloads that are already done elsewhere to the cloud. Still, there are times when I'm reminded just how powerful these cloud computing platforms are, and the extent to which they enable grand ideas to be made reality. Such is the case with the GDELT Project, which is the subject of this post.
The Global Database of Events, Language, and Tone - more commonly known as The GDELT Project - is, according to its creators, "the largest, most comprehensive, and highest resolution open database of human society ever created". GDELT, funded by Google Ideas in collaboration with various universities, media entities, and non-profits, is a vast collection of global events dating all the way back to 1979 (with plans to go back even further). But the really fascinating part is the breadth and scale of the data it continues to collect each day. The project monitors news from all across the globe, in a variety of different formats and in over 100 different languages, and aggregates it all together into a single database. We're not just talking about copying news articles though, as that by itself would not be all that impressive. Instead, the platform takes all that raw data and uses sophisticated natural language and data mining algorithms to create structured data that researchers can work with (close to 400 million geo-tagged records to date). Best of all, the data and tools created by the project are completely free!
The output created by GDELT comes in two primary forms - the events database and the global knowledge graph. The events database records physical activities happening around the world. The data is organized into over 300 different categories. The categories are primarily socio-political in nature - things like riots, military actions, diplomatic exchanges, and so on. For example, I downloaded the events file from yesterday (of which there are over 120,000 events recorded) and skimmed through the top of the listing. Even in the first few dozen entries there was a fascinating variety of recordings - everything from news related to Tesla's new battery factory to coverage of Julian Asange's trial to the Pope's upcoming visit to Africa. When considering the mind-boggling scale of the data collected just for a single day, and the fact that it's all pre-structured and formatted for computers to process and draw inferences from, it's easy to see how incredibly useful this data could be.
But the events database is only one part of what GDELT does. There's another product called the global knowledge graph. The knowledge graph uses named entity recognition to turn all of the raw data that GDELT collects into a list of every person, organization, location etc. referenced in that data combined with a list of over 230 themes used to describe the context in which the named entity was mentioned. The end result is a graph of connections that describes not only what is happening around the world, but how it's all connected together.
The creation of these two sources of data - the events database and the global knowledge graph - would, by itself, be a pretty big deal. But the project doesn't stop there. Fueled by an acknowledgement that simply exposing the data in raw form still keeps the barrier to entry for a typical user pretty high (there's over 100 GB of data to host, after all), GDELT also provides a variety of options for consuming and analyzing the data. Aside from downloading the raw data in CSV form, it's also hosted as a Google BigQuery service in the cloud (the project IS sponsored by Google, after all). But perhaps more interesting is something called the GDELT Analysis Service, which is a collection of tools and services that lets you perform high-level visualization and analysis of the GDELT data. Examples include interactive heatmaps, word clouds, timelines, and network visualizations. Users provide inputs to describe what they want to analyze, and the system emails a link to the results when the analysis is complete. It's unfortunately not true real-time interaction yet, but I wouldn't be surprised to see these services evolve in that direction. For certain types of very common high-level analyses, pre-built reports have been created that can be subscribed to and emailed to interested parties on a daily basis. These include things like the Global Daily Trend Reports and the World Leaders Index.
Finally, there's a GDELT Blog which tracks news coverage and innovative usage examples for the data and services provided by GDELT around the world. There are examples of everything from real-time conflict and protest maps, to networks of "influence" between world leaders, to maps of global reactions to significant policy changes such as the passage of the Affordable Care Act. These particular applications represent a small sampling of what's possible with the data that GDELT provides.
If all of this sounds a bit wild, trust me - I'm right there with you. My initial reaction upon learning about this project went something like "how the hell am I just now finding out that something like this exists?". Considering the magnitude and potential impact, I think there's been a startling lack of media coverage about GDELT (which is somewhat ironic, considering what it does - I wonder if it has a named entity for itself?). Maybe GDELT is a largely academic exercise that doesn't mean a whole lot to anyone outside of the relatively small group of people in the world that might be equipped to really do something with it. Even if that's the case, it still feels like it has the potential to be a game-changer for those that CAN really use this type of data. I'm looking forward to seeing the creative new tools and applications that arise from this, and I may even try exploring some of those applications myself in a future blog post. But until then, it's simply a reminder that grand ideas that may have seemed far-fetched even a decade ago are entirely possible today, enabled in part by the seismic movements in fields such as cloud computing.