Journalistic Data Mining

a brilliant talk by Jonathan Stray of AP on investigating millions of documents by visualizing clusters and using other computational tools.

Date:

Jonathan Stray of the Associated Press on investigating thousands (or millions) of documents by visualizing clusters, fantastic talk from the National Institute of Computer-Assisted Reporting. Visualizations built with multidimensional scaling algorithm Glimmer.

Jonathan's background is in CS before he became a reporter and that gives him the ability to do very interesting computational reporting.

Investigating thousands (or millions) of documents by clustering from Jonathan Stray on Vimeo.

meckdevil writes on slashdot about this talk:

Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation - and great potential for journalistic abuse.