Year 2014 in Review as Seen by a Event Detection System

We examine the significant events of 2014 found by event/trend detection tool Signi-Trend, including Sochi, Ukraine and Russia, Malaysian airlines, and Islamic State (ISIS).

By Erich Schubert.

The following timelines was generated using the event and trend detection tool Signi-Trend (published at KDD 2014 and covered earlier on KDnuggets) on news articles collected for the year 2014.

The category of financial news was removed (which is otherwise overrepresented in the data source) and instead of the raw keywords and headlines, we manually described the trends detected. These are the top 50 trends, with the top 10 trends detected highlighted in bold, everything is ordered chronologically.

2014-01-29: Obama's state of the union address

2014-02-07: Sochi Olympics gay rights protests
2014-02-08: Sochi Olympics first results
2014-02-19: Violence in Ukraine and Maidan in Kiev
2014-02-20: Wall street reaction to Facebook buying WhatsApp
2014-02-22: Yanukovich leaves Kiev
2014-02-28: Crimea crisis begins

2014-03-01: Crimea crisis escalates further
2014-03-02: NATO meeting on Crimea crisis
2014-03-04: Obama presents U.S. fiscal budget 2015 plan
2014-03-08: Malaysia Airlines MH-370 missing in South China Sea
2014-03-08: MH-370: many Chinese on board of missing airplane
2014-03-15: Crimean status referendum (upcoming)
2014-03-18: Crimea now considered part of Russia by Putin
2014-03-21: Russian stocks fall after U.S. sanctions.

2014-04-02: Chile quake and tsunami warning
2014-04-09: False positive? experience + views
2014-04-13: Pro-Russian rebels in Ukraine's Sloviansk
2014-04-17: Russia-Ukraine crisis continues
2014-04-22: French deficit reduction plan pressure
2014-04-28: Soccer World Cup coverage: team lineups

2014-05-14: MERS reports in Florida, U.S.
2014-05-23: Russia feels sanctions impact
2014-05-25: EU elections

2014-06-06: World cup coverage
2014-06-13: Islamic state (ISIS) Camp Speicher massacre in Iraq
2014-06-14: Soccer world cup: Spain surprisingly destroyed by Netherlands

2014-07-05: Soccer world cup quarter finals
2014-07-17: Malaysian Airlines MH-17 shot down over Ukraine
2014-07-18: Russian blamed for 298 dead in airline downing
2014-07-19: Independent crash site investigation demanded
2014-07-20: Israel shelling Gaza causes 40+ casualties in a day

2014-08-07: Russia bans food imports from EU and U.S.
2014-08-08: Obama orders targeted air strikes in Iraq
2014-08-20: ISIS murders journalist James Foley, air strikes continue
2014-08-30: EU increases sanctions against Russia

2014-09-05: NATO summit with respect to IS and Ukraine conflict
2014-09-11: Scottish referendum upcoming - poll results are close
2014-09-23: U.N. on legality of U.S. air strikes in Syria against ISIS
2014-09-26: Star manager Bill Gross leaves Allianz/PIMCO for Janus

2014-10-22: Ottawa parliament shooting
2014-10-26: EU banking review

2014-11-05: U.S. mid-term elections
2014-11-12: Foreign exchange manipulation investigation results
2014-11-17: Japan recession

2014-12-11: CIA prisoner and U.S. torture centers revealed
2014-12-15: Sydney cafe hostage siege
2014-12-17: U.S. and Cuba relations improve unexpectedly
2014-12-18: Putin criticizes NATO, U.S., Kiev
2014-12-28: AirAsia flight QZ-8501 missing

Similar to the result for 2013 it mentions many key geo-political events of 2014.
There probably is one "false positive" there: 2014-04-09 has a lot of articles talking about "experience" and "views", but not all refer to the same topic (we did not do topic modeling yet).

There are also some events missing that we would have liked to appear; many of these barely did not make it into the top 50, but do appear in the top 100, such as the Sony cyber-attack (#51) and the Ferguson riots on November 11 (#66).

Significant Trends in Large Data Streams

In this model, trends are considered significant, if the number of articles is 3 standard deviations higher than the expected value - a very classic definition from statistics. To make this work on streaming data, we used exponentially weighted averages and standard deviations. To reduce spurious trends (in particular first occurrences of terms) we added a simple bias term akin to Laplacian correction that removes such background noise. The main challenge is to scale this up to every term - and term combination - in the data set: Facebook is mentioned every day, but the combination of Facebook and WhatsApp was rarely occurring until they bought it. But also single terms can be useful to track, as seen in below chart: Ukraine trended most when the Malaysia Airlines plane was shot down in July 2014 (bottom chart), although it had more coverage in March 2014.

Ukraine To make this approach scale up to monitoring every word and word pair mentioned over time, we employ a classic hashing/sketching trick. We accept heavy-hitters style inaccuracy in rare terms, but with a high probability we won't miss any frequently mentioned trend by using multiple hash functions: in order to miss a trend, it needs to collide with a more frequent term in every hash function.
Using a fixed amount of memory for the hashtable (we used a 256 MB hash table) we can this way track trends without specifying keywords in advance, even on large data sets such as Twitter. Using our algorithm, data sets such as this much smaller news data set can be processed on a single Raspberry Pi (Model B, 512 MB).

Visualizing Trend Clusters

Essential to understanding the results is visualization. Showing absolute numbers, and the significance as interpreted by the algorithm is helpful; but there may be many words and word pairs trending at the same time. To visualize this, we also created a semantic word cloud, where the words are not randomly placed (as common with word clouds) but reflect the association of words with each other. In the following image, you can see July 20, when two major clusters trend: in the Israel-Gaza conflict many people were killed (green cluster on the left) but also fighting in eastern Ukraine with pro-Russian rebels causes many fatalities. Links in this figure indicate terms that trend together, colors indicate a cluster structure obtained from this data.

wordcloud You can also explore the results and visualization online in a snapshot (sorry, we are currently not crawling news sites in real-time).

Learning More

Details on the approach are published in the KDD 2014 conference proceedings, and you can even watch the presentation online (presented by Michael Weiler).

erich Bio: Dr. Erich Schubert is a research and teaching assistant at the Ludwig-Maximilians-Universität München, Germany. He finished his PhD in 2013 on "Generalized and Efficient Outlier Detection for Spatial, Temporal, and High-Dimensional Data Mining" and is one of the lead authors of the open-source ELKI data mining toolkit. He is expanding his research into text-mining and big data analysis, and interested in post-doc and assistant professor opportunities in his research areas.