Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2015 » Oct » Tutorials, Overviews, How-Tos » Do more with Python: Creating a graph application with Python, Neo4j, Gephi, and Linkurious ( 15:n34 )

Do more with Python: Creating a graph application with Python, Neo4j, Gephi, and Linkurious

Here is how to build a neat app with graph visualization of Python and related topics from Packt and StackOverflow, combining Gephi, Linkurious, and Neo4j.

By Greg Roberts, Packt Publishing

I love Python, and to Celebrate Packt’s Python Week, I’ve spent some time developing an app using some of my favourite tools. The app is a graph visualisation of Python and related topics, as well as showing where all our content fits in. The topics are all StackOverflow tags, related by their co-occurrence in questions on the site.

The app is available to view at and in this blog, I’m going to discuss some of the techniques I used to construct the underlying dataset, and how I turned it into an online application using some of my favourite tools.

Graphs, not charts


Graphs are an incredibly powerful tool for analysing and visualising complex data. In recent years, many different graph database engines have been developed to make use of this novel manner of representing data. These databases offer many benefits over traditional, relational databases because of how the data is stored and accessed.

Here at Packt, I use a Neo4j graph to store and analyse data about our business. Using the Cypher query language, it’s easy to express complicated relations between different nodes succinctly.

It’s not just the technical aspect of graphs which make them appealing to work with. Seeing the connections between bits of data visualised explicitly as in a graph helps you to see the data in a different light, and make connections that you might not have spotted otherwise.

This graph has many uses at Packt, from customer segmentation to product recommendations. In the next section, I describe the process I use to generate recommendations from the database.

Make the connection

For product recommendations, I use what’s known as a hybrid filter. This considers both content based filtering (product x and y are about the same topic) and collaborative filtering (people who bought x also bought y). Each of these methods has strengths and weaknesses, so combining them into one algorithm provides a more accurate signal.

The collaborative aspect is straightforward to implement in Cypher. For a particular product, we want to find out which other product is most frequently bought alongside it. We have all our products and customers stored as nodes, and purchases are stored as edges. Thus, the Cypher query we want looks like this:

MATCH (n:Product {title:’Learning Cypher’})-[r:purchased*2]-(m:Product)
WITH m.title AS suggestion, count(distinct r)/(n.purchased+m.purchased) AS
WHERE m<>n

and will very efficiently return the most commonly also purchased product. When calculating the weight, we divide by the total units sold of both titles, so we get a proportion returned. We do this so we don’t just get the titles with the most units; we’re effectively calculating the size of the intersection of the two titles’ audiences relative to their overall audience size.

The content side of the algorithm looks very similar:

MATCH (n:Product {title:’Learning Cypher’})-[r:is_about*2]-(m:Product)
WITH m.title AS suggestion, count(distinct r)/(length(n.topics)+length(m.topics)) AS
WHERE m<>n

Implicit in this algorithm is knowledge that a title is_about  a topic of some kind. This could be done manually, but where’s the fun in that?

In Packt’s domain there already exists a huge, well moderated corpus of technology concepts and their usage: StackOverflow. The tagging system on StackOverflow not only tells us about all the topics developers across the world are using, it also tells us how those topics are related, by looking at the co-occurrence of tags in questions. So in our  graph, StackOverflow tags are nodes in their own right, which represent topics. These nodes are connected via edges, which are weighted to reflect their co-occurrence on StackOverflow:

edge_weight(n,m) = (# of questions tagged with both n & m)/(# questions tagged with n or m)
So, to find topics related to a given topic, we could execute a query like this:
MATCH (n:StackOverflowTag {name:’Matplotlib’})-[r:related_to]-(m:StackOverflowTag)
RETURN, r.weight, ORDER BY r.weight DESC LIMIT 10

Which would return the following:

  |     | r.weight |
1 | Matplotlib | 0.065699 | Plot
2 | Matplotlib | 0.045678 | Numpy
3 | Matplotlib | 0.029667 | Pandas
4 | Matplotlib | 0.023623 | Python
5 | Matplotlib | 0.023051 | Scipy
6 | Matplotlib | 0.017413 | Histogram
7 | Matplotlib | 0.015618 | Ipython
8 | Matplotlib | 0.013761 | MatplotlibBasemap
9 | Matplotlib | 0.013207 | Python 2.7
10 | Matplotlib | 0.012982 | Legend

There are many, more complex relationships you can define between topics like this, too. You can infer directionality in the relationship by looking at the local network, or you could start constructing Hyper graphs using the extensive StackExchange API.

Sign Up