Computer scientists from Carnegie Mellon University have devised a framework for running large-scale computations for tasks such as social network or Web search analysis efficiently on a single personal computer.
The software could help developers working on many modern tasks: for example, designing a new recommendation engine using social network connections. In order to make effective recommendations-"your friends liked this movie, so here is another movie that you haven't seen yet, but you will probably like"-the software has to be able to analyze the connections between the members of a social network. This type of task is called graph computation, and it is increasingly common. But working with large-scale data sets (such as online social networks) usually requires the processing horsepower of many computers clustered together, such as those offered by Amazon's cloud-based EC2 service.
The new software, called
GraphChi, exploits the capacious hard drives that are becoming ever more common in personal computers. A graph would normally be stored in temporary memory (RAM) for analysis. With GraphChi, the hard drive performs this task instead.
"PCs don't have enough RAM to hold an entire Web graph, but they do have hard drives, which can hold a lot of information," says Carlos Guestrin, codirector of Carnegie Mellon's Select Lab, where GraphChi was developed. But hard drives are slow compared to RAM for reading and writing data, which tends to slow down computation. So Guestrin's student Aapo Kyrola designed a faster, less random method of accessing the hard drive.
According to Guestrin, a Mac Mini running GraphChi can analyze Twitter's social graph from 2010-which contains 40 million users and 1.2 billion connections-in 59 minutes. "The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers," Guestrin says.
From the GraphChi Introduction
GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph.
The source code and documentation of GraphChi for C++ is available at the Google Code project pages: