KDnuggets Home » News » 2015 » Nov » Software » Arabesque Distributed Graph Mining Platform ( 15:n39 )

Arabesque Distributed Graph Mining Platform


Arabesque provides an elegant solution to the difficult problem of graph mining that lets a user easily express graph algorithms and efficiently distribute the computation.



By Georgos Siganos, QCRI .

Distributed data processing platforms such as MapReduce and Pregel (Giraph, GraphLab, Graph-X) have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms such as computing Pagerank, SSP, CC etc.

gm_cliques

Cliques of size >= 4 for a small graph
 

However, these platforms do not represent a good match for graph mining problems, as for example finding frequent subgraphs(FSM) in a labeled graph or finding Cliques. Given an input graph, these problems require exploring a very large number of subgraphs(embeddings) and finding patterns that match some “interestingness” criteria desired by the user. The number of the embeddings that the system needs to explore grows exponential to the size of the embeddings. For instance even for a tiny graph of ~4000 vertices, that we visualize above, we can reach 1.7 Billion embeddings when we consider embeddings of size 6, as shown in the following figure.

exponential-log-scale

Arabesque: Think like an Embedding paradigm

Our system Arabesque is the first distributed data processing platform focused on solving these graph mining problems. Arabesque automates the process of exploring a very large number of embeddings, and defines a high-level computational model (filter – process) that simplifies the development of scalable graph mining algorithms. In a nutshell, Arabesque explores embeddings and passes them to the user defined code, which must simply decide whether an embedding is of interest and should be further extended (filter function) and whether we should compute outputs for the embeddings that passed the filter function (process function). This allows the end-user to think of the problem in a natural way, Think like an Embedding (TLE), where the embedding is a first class citizen, and not to try to force the problem into other paradigms. Even so, Arabesque runs under the familiar and well supported Apache Giraph system in a Hadoop cluster. Arabesque simply uses Giraph as a generic data processing platform that provides a BSP execution flow.

summary-arabesque

Example Discover Cliques

As an example, consider the code for computing cliques that we discuss in more detail in our website.

@Override
public boolean filter(VertexInducedEmbedding embedding) {
    return isClique(embedding);
}

/**
 ** For efficiency the following isClique implementation works 
 ** in an incremental way. To have an embedding that is a 
 ** clique of size k, we know that it is extended by an 
 ** embedding that is a clique of size k-1.
 **/
private boolean isClique(VertexInducedEmbedding embedding) {
    return embedding.getNumEdgesAddedWithExpansion() 
            == embedding.getNumVertices() - 1;
}

@Override
public void process(VertexInducedEmbedding embedding) {
    if (embedding.getNumWords() == maxsize) {
        output(embedding);
    }
}

Our implementation requires a handful of lines of code, is easy to reason about and straightforward and hides all the complexities of Graph Mining, while distributing the computation to tens or hundreds of machines in your Hadoop/Giraph cluster.

Elegant but above all Efficient

Arabesque is quite unique in that despite the simple and elegant API and the distributed implementation it is extremely efficient. Indeed, in our SOSP 2015 paper, we show that a single threaded Arabesque provides a comparable performance to specialized centralized implementations. This is possible because the user-defined functions are not the computational expensive parts of the Graph Mining Algorithms. The computational expensive functions are handled by Arabesque and deal mostly with the graph exploration, the isomorphism computations and the canonicality checks all inherent to Graph mining algorithms, and thus can be handled as efficient as a centralized implementation.

Additionally, Arabesque also supports the introduction of ad-hoc optimizations to its execution without having to change the user-defined filter and process functions. For instance, when finding cliques or performing triangle counting, the interesting embeddings are completely connected. This property allows us to greatly simplify both expansion and canonicality calculations by only having to deal with strictly monotonically increasing vertex ids. The activation of this clique-specific expansion mechanism results in a 10x improvement in execution time compared to the more generic expansion mechanism which has to explore more paths to handle non-completely-connected embeddings (to handle cases such as the path 1 – 4 – 2).

Open Source: Apache 2.0 license.

You can find more details on how to run Arabesque and the capabilities it provides in our website.

Binary jars can be downloaded here.

The source code can be accessed from github.

Bio: Georgos Siganos is a Senior Scientist at QCRI working on next generation Graph Mining Architectures and Big Data Systems. He is also the tech lead of the Arabesque.io and Grafos.ml open-source projects.

Related:


Sign Up