Extracting Knowledge from Knowledge Graphs Using Facebook’s Pytorch-BigGraph
We are using the state-of-the-art Deep Learning tools to build a model for predict a word using the surrounding words as labels.
Below we are going to discuss the PYTORCH-BIGGRAPH: A LARGE-SCALE GRAPH EMBEDDING SYSTEM paper further named PBG as well as the relevant family of papers.
Knowledge graphs are special types of graphs, which incorporate known entities as well as different types of edges. It represents structural knowledge.
In knowledge graphs, nodes are connected via different types of relationships.
The goal of the training is to produce embeddings, which represent our knowledge. Once we have the embeddings of the nodes, it should be easy to determine if the corresponding nodes are connected (or should be connected) in our knowledge graph via the specific type of relationship.
Different models propose different ways of comparing embeddings. The most simple models compare embedding vectors using cosine or vector product distance. More complex models apply different weighting schemes for the elements of the vector before comparison. Weighting schemes are represented as matrices and are specific to the type of relationship. We can learn the weighting matrices as part of our training.
We need to find a way to measure the similarity score between edges and use this score to estimate the possibility that these nodes are connected.
Representation of the knowledge graph
Knowledge graphs can be represented as adjacency tensor. To build it we would have a square matrix for every type of relationship. Each matrix has as many columns or rows as nodes in the graph. The value of the matrix will be 1 of these nodes are connected via this type of relationship and 0 if not. It’s pretty clear that this matrix will be very large and very sparse.
To learn our embeddings we need to convert each node into fixed sized vectors. Let’s discuss the properties of the “good” embeddings.
Good embeddings represent our knowledge expressed in the form of graph edges. Embedding vectors located “nearby” should represent the nodes, which are more likely connected. Based on this observation, we will train our model in such a way that the similarity score of the connected nodes marked as 1 in the adjacency tensor would be higher and the similarity score of the connected nodes marked as 0 in the adjacency tensor would be lower.
We are training our embeddings to reconstruct the edges of the knowledge graph from node embeddings with minimum loss of information.
Our training approach has a bit of a problem. We are trying to learn to distinguish between 1 (nodes are connected) and 0 (nodes are not connected) using the graph data. Yet, the only data we actually have is the nodes, which are connected. It’s like learning to distinguish cats from dogs by looking only at cats.
Negative sampling is a technique to expand our dataset and provide better training data by using very simple observation. Any randomly selected nodes, which are not connected as part of our graph will represent a sample data with a label 0. For the purposes of training, the PBG paper proposes to read each edge of the graph and then propose a negative sample, where one of the nodes is replaced with a randomly selected node.
For each edge, we can assign a positive similarity score and a negative similarity score. The positive similarity score is calculated based on the node embeddings and the edge relationship type weights. The negative similarity score is calculated the same way, but one of the nodes of the edge is corrupted and replaced by the random node.
Ranking loss function, which will be optimized during the training. It is constructed to establish a configurable margin between positive and negative similarity scores for all nodes in the graph and all relationship types. Ranking loss is a function of node embeddings and relationship-specific weights, which will be learned by finding minimum ranking loss.
Now we have everything we need to train the embedding models:
- data — negative and positive edges
- labels — (1 or 0)
- function to optimize (it can be ranking loss, more traditional logistic regression loss or cross entropy softmax loss used in word2vec)
- our parameters, which are embeddings as well as the weight matrices for the similarity score function.
Now it’s a matter of using calculus to find the parameters — embeddings, which optimize our loss function.
Stochastic gradient descent
The essence of the stochastic gradient descent is to gradually adjust the parameters of the loss function in such a way that the loss function is getting gradually decreased. To do this we read the data in small batches use each batch to calculate the update to the parameters of the loss function to minimize it.
There are multiple ways of doing stochastic gradient descent. PBG paper uses ADAGrad, which is one of the flavors of stochastic gradient descent to find the parameters, which minimize our loss function. I highly recommend this blog to understand all the flavors of gradient descent: http://ruder.io/optimizing-gradient-descent/index.html#adagrad
Software packages like tensorflow and pytorch provide out of the box implementations for different flavors.
The key element of gradient descent is the process of updating the parameters of the model many times until we minimized the loss function. At the end of the training, we expect to have the embeddings and scoring functions, which satisfy the goals of incorporating our knowledge.
HogWild — Distributed Stochastic Gradient Descent
Going distributed with stochastic gradient descent poses a challenge. If we simultaneously train by adjusting the parameters to minimize the loss function, there needs to be some sort of locking mechanism. In traditional multithreaded development, we lock our data during the update via pessimistic or optimistic locking. Locking slows down the progress but ensures the correctness of our results.
Luckily, the hogwild paper proved that we don’t need to have a locking mechanism. We can simply read data in batches, calculate the parameter adjustments and just save these in the shared parameter space with no regard for correctness. HogWild algorithm does exactly that. Training can be distributed and each HogWild thread can update our parameters without regard for other threads.
I recommend this blog to get more info on HogWild: https://medium.com/@krishna_srd/parallel-machine-learning-with-hogwild-f945ad7e48a4
When the graph spans billions of nodes and trillions of edges, it’s hard to fit all the parameters in the memory of one machine. It also takes a lot of time if we would wait for the end of each batch to complete the calculations before starting another batch. Our graph is so large that it would be beneficial to be able to parallelize the training and learn the parameters simultaneously. This problem is solved by Facebook team, who released PBG paper.
Nodes are split by entity types and then organized into partitions:
- The nodes are partitioned into P buckets and edges are partitioned into PxP buckets. Entity types with small cardinality do not have to be partitioned.
- Training is done in parallel with the following constraints:
for each edge bucket (p1; p2) except the first, it is important that an edge bucket (p1; *) or (*; p2) was trained in a previous iteration.
Multiple edge buckets can be trained in parallel as long as they operate on disjoint sets of partitions.
Training is happening in parallel on multiple machines and multiple threads per each machine. Each thread calculates the parameter update based on the allocated bucket and batch of data. The lock server distributes training buckets according to the established constraints. Notice that the lock server only controls the distribution of data batches across the hogwild threads and not the parameter updates.
Characteristics of PBG embeddings
Knowledge embeddings can be used in two ways:
- Link predictions.
Link predictions help to fill the gaps in our knowledge by finding the nodes, which are likely connected or about to be connected.
Example: The graph represents customers and products bought by the customers. Edges are purchase orders. Embeddings can be used to form the next purchase recommendations.
- Learning the properties of nodes
Embeddings can be used as feature vectors supplied as an input to all kinds of classification models. The learned classes can fill the gaps in our knowledge about the properties of the objects.
Evaluating link predictions using MRR/Hits10
This process is described in the paper — “Learning Structured Embeddings of Knowledge Bases” and later was used as the way to measure the quality of embedding models in many other papers including Facebook PBG.
The algorithm takes a subset of test edges and performs the following:
- Corrupt the edge by replacing the beginning or end of the edge with a negatively sampled edge.
- Train the model on a partially corrupted dataset
- Calculate aggregate MRR and Hits10 metrics for the edges from the test dataset.
Mean reciprocal rank
MRR or Mean reciprocal rank is a measure of search quality. We take an uncorrupted node and find the “nearest neighbors” with distance defined as the similarity score. We rank the nearest neighbors by the similarity score and expect that the node, which was connected, would appear on top of the ranking. MRR decreases the accuracy score in case the node is not raked on top.
The alternative measure is Hits10, where we expect the corrupted node to appear in the top 10 nearest neighbors.
PBG paper shows that on many data sets the MRR metrics gradually increases as we allocate the resources into training. Parallelism does not affect the quality of ranking to a point but saves tremendous amounts of time.
Further evaluation can be performed by simply exploring and visualizing the graphs.
The image above is a 2d projection of the embeddings built from the Freebase knowledge graph. As we can see, similar nodes are grouped together. Countries, numbers, scientific journals professions seem to have clusters even on the carefully prepared 2d projection.
Limitations of the knowledge graph models.
Knowledge graphs as described above represents a static snapshot of our knowledge. It does not reflect the process of it’s how the knowledge built up. In the real world, we learn by observing temporal patterns. While it’s possible to learn the similarity between nodes A and node B, it will be hard to see the similarity between node A and node C as it was 3 years ago.
For example, if we look at the forest for one day we will see the similarity between two large sequoia trees. Yet it will be hard to understand which sapling will grow into a large sequoia tree without long term observations of the forest.
Ideally, we need to explore the series of knowledge graphs built at different points in time and then build the embeddings, which will incorporate generational similarities.
Bio: Sergey Zelvenskiy is a risk engineer at Uber.
Original. Reposted with permission.
- Getting started with NLP using the PyTorch framework
- Introduction to PyTorch for Deep Learning
- Word Embeddings in NLP and its Applications