Lucid Imagination, hossman, May 7, 2010
One of Solr's lesser known features (at least from my perspective) is the StatsComponent. Stats is a feature that was added in Solr 1.4 and enables Solr to compute various statistics for numeric fields in your documents. It even supports the ability to compute these statistics per facet constraint from other fields.
Since Primary Election season is ramping up (here in the US Anyway) we'll demonstrate Solr's Stats functionality using some data from from Data.Gov. Specifically, we'll index the "2009-2010 Candidate Summary File" data from the FEC...
Summary financial information about campaigns for U.S. Senate, U.S. House of Representatives, and President of the United States
That certainly sounds like it should contain some interesting numeric data, and the "Data Dictionary/Variable List" certainly seems to support this. But before we can use Solr to get some statistics from this data, we need to index it. We start by fetching the data, and taking a peek inside....
About 8 seconds later (mostly due to lag from the FEC webserver), we have our data and we can start running some queries. Without even using the StatsComponent, we can already learn some interesting facts from this data using a basic search/sort:
- The candidate with the highest net operating expenditures ($14,404,986) is the Senate campaign for Harry Reid (D) in Nevada
- The Senate candidate who has received the most total loans ($14,000,000) for her campaign is Linda McMahon (R) in South Carolina
- The Texas candidate for the House of Representatives who has reported the highest total contributions ($1,983,946) is Chet Edwards (D) in district #17