Associations and Text Mining of World Events

Applying frequent itemset analysis to text may seem daunting, but parallel hardware and two insights open the door to theme extraction.

By Chris Painter, Sept 2014.

As the data deluge continues it becomes increasingly obvious that we will always need machines to leverage our comprehension of a confused and confusing world. While steady progress is being made on supervised learning techniques for structured data the same cannot be said for unstructured text analysis, which is primarily restricted to the reduction of text to structured data, so that supervised techniques can be applied. This represents a major challenge to data miners, because unstructured text constitutes 70% of all data and it is that 70% where knowledge and ideas can be found.

Luckily help is at hand from a rather unexpected quarter, namely Association Rules. Just as it is possible to say that "people who bought these things also bought that thing", we can also reason that "if a document contains these words then it is likely to contain this word". Readers familiar with such techniques might quake at the combinational problem entailed by natural language, but this can be scaled down by the intelligent reduction of input texts. According to OED just 100 words, such as pronouns and prepositions, account for half of the English corpus, carry little meaning, and can be ignored.

However, even if we reduce the input texts by removing those words, two problems remain, how to make the rules quickly with finite hardware resources, and how to select the best when interestingness measures do not always rank rules in the same order.

Although FPgrowth algorithms produce trees at fantastic pace, they are optimised towards sequential construction; as a consequence semi-parallel approaches such as the APriori algorithm compare poorly in a conventional hardware environment. That's unfortunate because the flow of the APriori algorithm is ideally placed to generate and score Association Rules on the fly, as it carries the context counts that drive all measures of importance. These are also the numbers that can be "Iteratively Proportionally Fitted" so that all measures will rank the rules in the same order. Parallel hardware might therefore solve both the speed and grading problems.

Step forward the Nvidia GPU beloved by all gamers. Currently powering the second fastest computer on the planet it also comes complete with its own language and libraries to ease the programming task.

Using these offerings a prototype news analyser has emerged at, which extracts the most significant themes from newsfeeds 24/7.

It addresses the memory space problem by progressively reducing the frequency threshold until the top rules altogether contain a given number of words. This is made feasible by the speed of the GPU, a parallel version of the APriori algorithm, and Iterative Proportional Fitting to select the top rules on the fly. Here are some examples of raw news text

News pattern: Hostage if British and Islamic and State

The graph generated from top stories

World Top Stories, at 9/25/2014, 4:13PM BST

and the generated rules

meme-machines rules

The commercial advantages of a bias-free comprehensible knowledge extractor are not difficult to see, from News to Patents to Medical Science, the list is as long as human enterprise. The author has a long standing and productive relationship with Ingo Mierswa at RapidMiner, and is currently working to provide this capability as an extension to that software.

Visitors to can check out the Accumulator, a Neo4J database into which all rule observations are aggregated, and which can be interrogated via Prolog; this represents a second strand to the work, and one which hums the Big Data anthem, that noise cancels itself out.

We have a self-improving oracle growing out there, and one that you can question right now. Enjoy!

Chris Painter Chris Painter read Classics at Oxford, majoring in Philosophy and Logic. For the last twenty five years he has pursued the notion of Meaning, both in Economic patterns and free form text.