Extracting Knowledge from Knowledge Graphs Using Facebook’s Pytorch-BigGraph
We are using the state-of-the-art Deep Learning tools to build a model for predict a word using the surrounding words as labels.
By Sergey Zelvenskiy, Risk Engineer at Uber
Machine learning gives us the ability to train a model, which can convert data rows into labels in such a way that similar data rows are mapped to similar or the same label.
For example, we are building SPAM filter for email messages. We have a lot of email messages, some of which are marked as SPAM and some as INBOX. We can build a model, which learns to identify the SPAM messages. The messages to be marked as SPAM will be in some way similar to those, which are already marked as SPAM.
The concept of similarity is vitally important for machine learning. In the real world, the concept of similarity is very specific to the subject matter and it depends on our knowledge.
Majority of mathematical models, on the other hand, assume that the concept of similarity is defined. Typically, we represent our data as multidimensional vectors and measure the distance between vectors.
Feature engineering is the process of converting our knowledge about the real-world objects into numerical representations of such objects. The objects we consider similar are represented as nearby vectors.
For example, we are working on estimating house prices. Our experience tells us that the house is defined by the number of bedrooms, number of bathrooms, age, sq. footage, location, etc. Houses, which are located in the same neighborhood, of similar size and age, should cost about the same amount. We convert our knowledge of the housing market into numbers characterizing the house and use it to estimate the price of the house.
Unfortunately, manual feature engineering, as described above, has limitations in terms of our ability to convert our knowledge into descriptive features.
Sometimes out knowledge is limited to the principles of similarity, but not the exact characteristics, which make the objects similar. Often our knowledge about the real world is more complex than what can be represented in a simple tabular format. It’s typically a graph of interconnected concepts and relationships.
Embedding models allow us to take the raw data and automatically transform it into the features based on our knowledge of the principles.
Word2Vec
Word2Vec is likely the most famous embedding model, which builds similarity vectors for words. In this case, our knowledge about the world is present in the form of a narrative, represented as texts, which are sequences of words.
Decades of hard work went into attempts to characterize words using manually defined features with limited success. These solutions typically could not scale to the full body of knowledge or work in limited cases.
It all changed when Tomas Mikolov and his team at Google decided to build a model, which works based on the well-known principles of similarity. The words used in similar context are typically similar. The context, in this case, is defined by the words located nearby.
What we see is that with these principles in mind, we can build a graph out of our text by simply connecting each word with its neighbors within a predefined window (typically 5 words).
Now we have a graph of real word objects (words) connected based on our knowledge into a graph.
Most simple/complex word representation
We still unable to build any model because words are not represented in form or vectors.
If all we need is to convert the words to numbers, there is a simple solution. Let’s take our dictionary and assign each word its position in the dictionary.
For example, if I have three words: cat, caterpillar, kitten.
My vector representation will be as follows: cat-[1], caterpillar-[2] and kitten-[3].
Unfortunately, this does not work. By assigning numbers like this we implicitly introduce the distance between words. The distance between cat and caterpillar is 1 and the distance between cat and kitten is 2. We are saying that cat is more like a caterpillar than a kitten, which is contradictory to our knowledge.
Alternative representation also called one-hot encoding would do this:
cat — [1,0,0]
caterpillar — [0,1,0]
kitten — [0,0,1]
This schema says that all the words are orthogonal to each other. We admit that have no preconceived notion of words similarity. We will rely on our knowledge graph (as presented above), which incorporates our principles of word similarity, to build the embeddings.
In the real world, the dictionary size is much larger than just 3. Typical dimensionality is from tens of thousands to millions. Not only these vectors don’t really represent our notion of similarity, but these are also very bulky and can’t be really used in practice.
Building word embeddings
Our knowledge graph gives us a very large number of graph edges and each edge can be interpreted as input data as the start of the edge and the label as the end of the edge. We are building a model, which is trying to predict the word using the words it’s surrounded by as labels. It is typically done in two ways. We either reconstructing the word vector from the sum of its neighbors or we do the opposite by attempting to predict the neighbors from the word.
The details of the model are out of scope and well described in many other posts. Fundamentally, the team used the basic encoder/decoder model learning the projection from the high dimensional space (millions of dimensions) to the space of limited dimensionality (typically 300) and back to high dimensional space. The goal of the training is to preserve as much information as possible during this compression (minimize cross-entropy).
This concept of learning the low dimensional projection from the sparse orthogonal dataset into more dense low dimensional space is the foundation of many other embedding training models.
The model is typically trained on sources like google crawl, twitter dataset or Wikipedia. We are consuming the world knowledge and building our word embeddings from this.
Properties of Word2Vec embeddings
The important properties of Word2Vec are the ability to preserve the relationships and expose structural equivalence.
The plots below show the connections between countries and capitals.
or other different concepts.
Essentially it allows doing algebra on words like this:
king — man+woman=queen.
Using Word Embeddings
Word embeddings drastically improve tasks like text classification, named entity recognition, machine translation.
Here is more info: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Node2Vec
Node2Vec by A. Grover and J. Leskovec. is the model, which is analyzing the homogenous weighted graphs by expanding on the ideas from Word2Vec. The idea behind this paper is that we can characterize the graph node by exploring its surroundings. Our understanding of the world is based on two principles — homophily and structural equivalence.
Homophily
Similar nodes are located nearby.
Examples:
- social network — we are more connected to people like us.
- business location — financial firms, doctor offices or marketing companies seem to be typically located on the same street
- organizational structure — people on the same team share similar traits
Structural equivalence
Different communities share the same structure:
- organizational structure — while the teams can have weak connectivity, the structure of the team (manager, senior member, newcomers, junior members) is repeated from team to team.
In order to incorporate both principles into our embeddings, the authors of Node2Vec paper propose a random walk approach combining breadth-first sampling to capture homophily and depth-first sampling to capture structural equivalence.
As we can see, the node (u) acts as a hub within a group (s1,s2,s3,s4), which is similar to s6 being a hub for (s7,s5,s8,s9). We discover the (s1,s2,s3,s4) community by BFS and (u)<->(s6) similarity by doing DFS.
We are learning about each node by exploring its surroundings. Such exploration transfers the graph into a large number of sequences (sentences) produced by random walks, which combine BFS and DFS exploration. BFS and DFS mix is controlled by the weights of the graph edges as well as the hyperparameters of the model.
Once we have the full body of sequences (sentences), we can apply the Word2Vec method the same way it was applied to texts. It produces the graph node embeddings, which are based on the principles we defined as well the knowledge from the graph.
Node2Vec Properties
Node2Vec representation improved the models for clustering and classification of the nodes. The learned similarity of the embeddings will be helpful for tasks like fraud detection.
Complementary visualizations of Les Misérables coappearance network generated by node2vec with label colors reflecting homophily (top) and structural equivalence (bottom). — https://arxiv.org/pdf/1607.00653.pdf
Node2Vec shows significant improvements in link prediction. It was able to improve the ability to reconstruct the graph, where some percentage of edges were removed. The link prediction evaluation process is discussed further in this post.