Do more with Python: Creating a graph application with Python, Neo4j, Gephi, and Linkurious
Here is how to build a neat app with graph visualization of Python and related topics from Packt and StackOverflow, combining Gephi, Linkurious, and Neo4j.
By Greg Roberts, Packt Publishing
I love Python, and to Celebrate Packt’s Python Week, I’ve spent some time developing an app using some of my favourite tools. The app is a graph visualisation of Python and related topics, as well as showing where all our content fits in. The topics are all StackOverflow tags, related by their cooccurrence in questions on the site.
The app is available to view at gregroberts.github.io/ and in this blog, I’m going to discuss some of the techniques I used to construct the underlying dataset, and how I turned it into an online application using some of my favourite tools.
Graphs, not charts
Graphs are an incredibly powerful tool for analysing and visualising complex data. In recent years, many different graph database engines have been developed to make use of this novel manner of representing data. These databases offer many benefits over traditional, relational databases because of how the data is stored and accessed.
Here at Packt, I use a Neo4j graph to store and analyse data about our business. Using the Cypher query language, it’s easy to express complicated relations between different nodes succinctly.
It’s not just the technical aspect of graphs which make them appealing to work with. Seeing the connections between bits of data visualised explicitly as in a graph helps you to see the data in a different light, and make connections that you might not have spotted otherwise.
This graph has many uses at Packt, from customer segmentation to product recommendations. In the next section, I describe the process I use to generate recommendations from the database.
Make the connection
For product recommendations, I use what’s known as a hybrid filter. This considers both content based filtering (product x and y are about the same topic) and collaborative filtering (people who bought x also bought y). Each of these methods has strengths and weaknesses, so combining them into one algorithm provides a more accurate signal.
The collaborative aspect is straightforward to implement in Cypher. For a particular product, we want to find out which other product is most frequently bought alongside it. We have all our products and customers stored as nodes, and purchases are stored as edges. Thus, the Cypher query we want looks like this:
MATCH (n:Product {title:’Learning Cypher’})[r:purchased*2](m:Product)
WITH m.title AS suggestion, count(distinct r)/(n.purchased+m.purchased) AS
alsoBought
WHERE m<>n
RETURN* ORDER BY alsoBought DESC
and will very efficiently return the most commonly also purchased product. When calculating the weight, we divide by the total units sold of both titles, so we get a proportion returned. We do this so we don’t just get the titles with the most units; we’re effectively calculating the size of the intersection of the two titles’ audiences relative to their overall audience size.
The content side of the algorithm looks very similar:
MATCH (n:Product {title:’Learning Cypher’})[r:is_about*2](m:Product)
WITH m.title AS suggestion, count(distinct r)/(length(n.topics)+length(m.topics)) AS
alsoAbout
WHERE m<>n
RETURN * ORDER BY alsoAbout DESC
Implicit in this algorithm is knowledge that a title is_about a topic of some kind. This could be done manually, but where’s the fun in that?
In Packt’s domain there already exists a huge, well moderated corpus of technology concepts and their usage: StackOverflow. The tagging system on StackOverflow not only tells us about all the topics developers across the world are using, it also tells us how those topics are related, by looking at the cooccurrence of tags in questions. So in our graph, StackOverflow tags are nodes in their own right, which represent topics. These nodes are connected via edges, which are weighted to reflect their cooccurrence on StackOverflow:
edge_weight(n,m) = (# of questions tagged with both n & m)/(# questions tagged with n or m)
[/code]
So, to find topics related to a given topic, we could execute a query like this:
[code]
MATCH (n:StackOverflowTag {name:’Matplotlib’})[r:related_to](m:StackOverflowTag)
RETURN n.name, r.weight, m.name ORDER BY r.weight DESC LIMIT 10
Which would return the following:
 n.name  r.weight  m.name +++ 1  Matplotlib  0.065699  Plot 2  Matplotlib  0.045678  Numpy 3  Matplotlib  0.029667  Pandas 4  Matplotlib  0.023623  Python 5  Matplotlib  0.023051  Scipy 6  Matplotlib  0.017413  Histogram 7  Matplotlib  0.015618  Ipython 8  Matplotlib  0.013761  MatplotlibBasemap 9  Matplotlib  0.013207  Python 2.7 10  Matplotlib  0.012982  Legend
There are many, more complex relationships you can define between topics like this, too. You can infer directionality in the relationship by looking at the local network, or you could start constructing Hyper graphs using the extensive StackExchange API.
Top Stories Past 30 Days

