Do more with Python: Creating a graph application with Python, Neo4j, Gephi, and Linkurious

Here is how to build a neat app with graph visualization of Python and related topics from Packt and StackOverflow, combining Gephi, Linkurious, and Neo4j.



So we have our topics, but we still need to connect our content to topics. To do this, I’ve used a two stage process.

Step 1 – Parsing out the topics

We take all the copy (words) pertaining to a particular product as a document representing that product. This includes the title, chapter headings, and all the copy on the website. We use this because it’s already been optimised for search, and should thus carry a fair representation of what the title is about. We then parse this document and keep all the words which match the topics we’ve previously imported.

#...code for fetching all the copy for all the products
key_re=  '\W(%s)\W' % '|'.join(re.escape(i) for i in topic_keywords)
for i in documents:
tags = re.findall(key_re, i[‘copy’])
i['tags'] = map(lambda x: tag_lookup[x],tags)

Having done this for each product, we have a bag of words representing each product, where each word is a recognised topic.

Step 2 – Finding the information

From each of these documents, we want to know the topics which are most important for that document. To do this, we use the Tf-idf algorithm. Tf-idf stands for term frequency, inverse document frequency. The algorithm takes the number of times a term appears in a particular document, and divides it by the proportion of the documents that word appears in. The term frequency factor boosts terms which appear often in a document, whilst the inverse document frequency factor gets rid of terms which are overly common across the entire corpus (for example, the term ‘programming’ is common in our product copy, and whilst most of the documents ARE about programming, this doesn’t provide much discriminating information about each document).

To do all of this, I use python (obviously) and the excellent scikit-learn library. Tf-idf is implemented in the class sklearn.feature_extraction.text.TfidfVectorizer. This class has lots of options you can fiddle with to get more informative results.

importsklearn.feature_extraction.text as skt
tagger = skt.TfidfVectorizer(input = 'content',
encoding = 'utf-8',
decode_error = 'replace',
strip_accents = None,
analyzer = lambda x: x,
ngram_range = (1,1),
max_df = 0.8,
min_df = 0.0,
norm =  'l2',
sublinear_tf = False)

It’s a good idea to use the min_df&max_df arguments of the constructor so as to cut out the most common/obtuse words, to get a more informative weighting. The ‘analyzer’ argument tells it how to get the words from each document, in our case, the documents are already lists of normalised words, so we don’t need anything additional done

#create vectors of all the documents
vectors = tagger.fit_transform(map(lambda x: x['tags'],rows)).toarray()
#get back the topic names to map to the graph
t_map = tagger.get_feature_names()
jobs = []
for ind, vec in enumerate(vectors):
features = filter(lambda x: x[1]>0, zip(t_map,vec))
doc = documents[ind]
for topic, weight in features:
job = ‘’’MERGE (n:StackOverflowTag {name:’%s’})
MERGE (m:Product {id:’%s’})
CREATE UNIQUE (m)-[:is_about {source:’tf_idf’,weight:%d}]-(n)
’’’ % (topic, doc[‘id’], weight)
jobs.append(job)

We then execute all of the jobs using Py2neo’s Batch functionality.

Having done all of this, we can now relate products to each other in terms of what topics they have in common:

MATCH (n:Product {isbn10:'1783988363'})-[r:is_about]-(a)-[q:is_about]-(m:Product {isbn10:'1783289007'})
WITH a.name as topic, r.weight+q.weight AS weight
RETURN topic
ORDER BY weight DESC limit 6

Which returns:

  | topic
---+------------------
1 | Machine Learning
2 | Image
3 | Models
4 | Algorithm
5 | Data
6 | Python

Huzzah! I now have a graph into which I can throw any piece of content about programming or software, and it will fit nicely into the network of topics we’ve developed.

Take a breath

So, that’s how the graph came to be. To communicate with Neo4j from Python, I use the excellent py2neo module, developed by Nigel Small. This module has all sorts of handy abstractions to allow you to work with nodes and edges as native Python objects, and then update your Neo instance with any changes you’ve made.

The graph I’ve spoken about is used for many purposes across the business, and has grown in size and scope significantly over the last year. For this project, I’ve taken from this graph everything relevant to Python.

I started by getting all of our content which is_about Python, or about a topic related to python:

titles = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag {name:'Python'}) return distinct n''')]
t2 = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag)-[:related_to]-(m:StackOverflowTag {name:'Python'}) where has(n.name) return distinct n''')]
titles.extend(t2)

then hydrated this further by going one or two hops down each path in various directions, to get a large set of topics and content related to Python.

Visualising the graph

Since I started working with graphs, two visualisation tools I’ve always used are Gephi and Sigma.js.

Gephi is a great solution for analysing and exploring graphical data, allowing you to apply a plethora of different layout options, find out more about the statistics of the network, and to filter and change how the graph is displayed.

Sigma.js is a lightweight JavaScript library which allows you to publish beautiful graph visualisations in a browser, and it copes very well with even very large graphs.

Gephi has a great plugin which allows you to export your graph straight into a web page which you can host, share and adapt.

More recently, Linkurious have made it their mission to bring graph visualisation to the masses. I highly advise trying the demo of their product. It really shows how much value it’s possible to get out of graph based data. Imagine if your Customer Relations team were able to do a single query to view the entire history of a case or customer, laid out as a beautiful graph, full of glyphs and annotations.

Linkurious have built their product on top of Sigma.js, and they’ve made available much of the work they’ve done as the open source Linkurious.js. This is essentially Sigma.js, with a few changes to the API, and an even greater variety of plugins. On Github, each plugin has an API page in the wiki and a downloadable demo. It’s worth cloning the repository just to see the things it’s capable of!