Affordable online news archives for academic research

Many researchers need access to multi-year historical repositories of online news articles. We identified three companies that make such access affordable, and spoke with their CEOs.



By Sam Kogan & David Stolin

News media standards

Every time a website posts a new article – like this one – the world’s set of publicly available information becomes a bit richer. Such richness attracts researchers, but trying to exploit it presents them with a needle-in-a-haystack problem. Yes, Google News used to offer an archive you could interrogate, but this hasn’t been an option for years. So what do you do if your research question requires you to gather an accurate history of relevant online articles?

Of course, there are numerous free and open source crawlers and scrapers being used by data-hungry citizens of the web. If you have any experience with this, you will appreciate the complexity of the task. Scraping a single news website often means navigating complicated JavaScript code, paywalls, IP blockers, and so on. Even for larger organizations, doing so in-house for the entire internet is typically out of the question. Fortunately, several startups have done the work for you, and we have spoken with their CEOs about their offerings.

Webhose, led by CEO Ran Geva, scrapes news sites – as well as blogs, forums and product reviews and now even the dark web – and makes the data retrievable in JSON and XML formats for as little as $0.0002 per article (and at no cost for up to 100,000 articles in the most recent month). Research studies relying on Webhose’s data have already started appearing in peer-reviewed journals run by respected scientific publishers such as Elsevier, IEEE, and Taylor & Francis. According to Geva, Webhose data are being used by researchers at George Mason University, the University of Missouri and the University of Pennsylvania, among others. This is in spite of the fact that these institutions already subscribe to eye-wateringly expensive news databases like Factiva, LexisNexis and Proquest. The reason is probably breadth of coverage: if you want to know what information was out there at any given point in time, you shouldn’t limit yourself to the tiny sliver of the total news flow that these incumbent news data providers cover. Webhose.io, by contrast, processes the top 10,000 Alexa-ranked news sites, and then some: over 7 million news items per day across the world and in dozens of languages.

There are caveats. Retrieving articles from Webhose’s online news archive is subject to a $10-per-archival-month minimum charge. Contemporaneous data collection started only in 2014 (however, the archive of news stories, blogs and forums has been backfilled to 2008). At the end of July 2017, the number of covered sites jumped by 40% (but Geva points out that a researcher can easily correct for this inconsistency). Time stamp accuracy is lower for less popular sites (nevertheless, discrepancies do not exceed 24 hours). Paywall-protected content is excluded (although the non-protected text at the start of such articles often gives a good idea of their content).

A similar solution is available from EventRegistry: rich, multi-lingual archival online news data going back to 2014 and retrievable in JSON format. While their cost per article is higher than Webhose’s, there is no per-archival-month fee, which makes the solution cost-efficient if the number of target articles is relatively small yet the search spans multiple months; in addition, there are academic discounts.  Gregor Leban, EventRegistry’s CEO, explains that the company’s academic roots have led to a particular focus on intelligently identifying underlying events that articles refer to, as explained in a research article he co-authored.  According to Leban, this allows the researcher to see how various sources reported about the same event, which is important for studying such phenomena as news bias and information propagation. EventRegistry has already been used by researchers at IBM and Cambridge, Oxford and Stanford Universities, among others, as well as by several leading financial institutions.

Indeed, the finance sector has a particular appetite for ingesting news – after all, news is what moves asset prices – and CityFALCON has been focusing specifically on financial news. CityFALCON’s data coverage starts in 2014, when its news feed first went live, and now includes several thousand sources in multiple languages. A strong emphasis is placed on identifying financial assets, people, places, and sectors mentioned in an article and disambiguating their names, so you need not worry about being swamped with articles on agriculture if you are interested in Apple Inc. Historical data can be downloaded in CSV format and includes the article headline and summary (although not the full text) as well as accurate time stamps. These features make the news archive particularly suitable for so-called event studies which examine the impact of news on asset prices. While the cost of the data access depends on the specific case, CEO Ruzbeh Bacha tells us that CityFALCON is eager to work with academic researchers, students, and startups and is open to providing them with data at a heavy discount or even for free.

It is important to stress that CityFALCON, EventRegistry and Webhose differ in many respects and add new features regularly, so potential users should evaluate each solution for themselves. And of course, there are many other sources of structured news data out there. However, the ones we have profiled above are the only ones (as far as we know) that meet three key criteria to be of wide interest in academic research. First, they include several years of historical data that was collected contemporaneously (retroactive collection means that articles could have been deleted or redacted in the meanwhile, undermining research validity). Second, they cover thousands of online news sources. Third, they are affordable (even to a university student working on a thesis). In other words, by cheaply delivering structured online news data without hindsight bias, these companies are democratizing archival-news-based research.

Of course, these offerings are not perfect. The histories are only a few years long and the coverage of news sources changes over time. What’s more, as a researcher, one would often like to interrogate an archive through complex queries that, for example, request two search terms to be in close proximity to one another – but this is not an option at this time. The above are problems that wouldn’t exist in an ideal world, and we are not arguing that any of the three products we have described is the silver bullet if you need historical news data. But they do take us some way toward the utopia of having the full history of news at our fingertips. If you know of other data providers that serve this purpose, or if you have strong views on the offerings we have over viewed, please leave a comment. Let’s make sure that difficulties with accessing archival news don’t stop us from pursuing interesting research.

Bio: Sam Kogan is an independent consultant. He is interested in finance, natural language processing and AI. Sam has studied at Leiden University, worked at NHS Digital, and has published in Applied Economics journal.

David Stolin is Professor of Finance at Toulouse Business School in France. He holds a Ph.D. from London Business School, and his research interests include investments, corporate governance and fintech.

Related: