LinkedIn recently published
The LinkedIn Knowledge Graph (LKG)
It is an impressive achievement, connecting 450M members, 190M historical job listings, 9M companies, 200+ countries, 35K skills in 19 languages, 28K schools, 1.5K fields of study, 600+ degrees, 24K titles in 19 languages, and 500+ certificates, among other entities, as of Oct 6, 2016.
I had an opportunity to ask LinkedIn a few questions,
and here are the answers from the the leaders of the LKG project and authors of the above post:
- Qi He. Senior Engineering Manager - Machine Learning & Data Mining, Head of Data Standardization at LinkedIn,
- Bee-Chung Chen, Senior Staff Engineer & Applied Researcher at LinkedIn,
Deepak Agarwal, VP of Engineering, Head of Relevance at LinkedIn,
Q1, Gregory Piatetsky: LinkedIn Knowledge Graph (LKG) is derived primarily from a large volume of user-generated content. One of the big problems is data standardization - for example titles like "Data Scientist", "Predictive Analytics Specialist", or "Data Mining Scientist" can refer to essentially the same job.
LinkedIn Knowledge Graph
Can you give an example of how you solve such standardization questions?
"Data Scientist" is the canonical form of a title entity in the taxonomy. A member or a job with title string "Data Mining Scientist" is standardized to title "Data Scientist" by our title standardizer (a supervised binary classifier) based on title string features and other member/job metadata (e.g., the skills of the member or the skills required by the job).
However, not all similar title strings can be mapped to the same entity by this supervised method, e.g., "Predictive Analytics Specialist" is not standardized to "Data Scientist", partially because collecting high-quality and high-volume training data for this task is challenging. To augment the binary decision in such an entity-level standardization task, we also provide the similarity among these three title strings in the following two ways simultaneously.
First, LinkedIn title taxonomy has a hierarchical structure: title → super title → function, which enables a higher-level similarity. For example, these three title strings can all belong to the same super title and/or the same function. Downstream data mining applications can select the most suitable title granularity level. There are instances where the existing three hierarchical levels are not fine enough, we have an ongoing effort to add more semantic dimensions to title taxonomy.
We also have unsupervised methods that embeds title into a latent space via deep learning techniques, where the similarity among these three title strings is captured by the measurable proximity in the latent space. This unsupervised graph embedding method effectively leverages other types of entities and the entity relationships in LKG to compute the similarity among these three title strings, without the need of collecting high-quality and high-volume training data for this task. However, unsupervised methods do not provide interpretable results and although they are useful for various applications where one could use them as features in machine-learned models, it is not so useful in other cases where interpretability is important (e.g., explicit dimensions for advertisers to target, search facets, etc).
Q2, GP: Some entities are auto-generated by LinkedIn. Can you give an example and explain the criteria for auto-generation?
Company entities are a good example. A company entity consists of the ID, canonical form, and required attributes like industry, address, URL, etc. Some company entities are created and maintained by companies themselves (e.g., Microsoft delegates an administrator to manage Microsoft's company page in LinkedIn). However, many company entities do not have such administrators. We automatically extract these company entities from member profiles and job descriptions, and apply both machine learning methods and human labels to complete the requisite attributes. We also map LinkedIn members to these auto-created companies. An auto-created company entity is valuable to LinkedIn business only when the number of LinkedIn members mapped to it exceeds the minimum bar; otherwise, the entity is dropped. Once an auto-created company entity is completed, we insert it into company database and create its company page automatically.
Q3, GP: Another interesting feature is inferring entity relationships, eg between members and skills. What approach you use for inferring such relationships?
: We use a supervised classifier to predict the probability that a member has a skill for each pair of <member, skill>. The majority of features are extracted from member profile and the member's professional network in LinkedIn.
Q4, GP: What interesting trends did you find from changes in LKG, eg. which jobs, location, companies are becoming more or less popular? Can you predict economic indicators ahead of their release or other global trends?
We found many interesting trends from changes in LKG. For example, we observed talent migrations by country, talent flows between companies, talent supply and demand by region, hot jobs by skill, gender breakdown by skill, school rankings by job title, most common skillsets by job title, unicorn company rankings by hire and departure ratio, among many other insights. Recently we have started to predict some economic indicators ahead of their releases, but the details are beyond the scope of this post.
Q5, GP: Will there be an API to allow others access to all or some of LKG?
From the engineering side, we externalized some APIs to allow partners outside LinkedIn to access our entity taxonomies. Please contact the LinkedIn Partner Engineering team to access the APIs.
As a long-time LinkedIn member, I get many LinkedIn connection requests, and have accumulated several thousand connections. However, most of them I don't really know and I suspect such situation is typical for many. Unlike adding a connection, deleting a connection is not simple (I don't think I ever done it myself). Is there a concern for connection "inflation" - members have too many connections and connections are becoming less valuable? Perhaps some unused connections should be deleted or more generally, connection strength can be evaluated and shown to members?
At LinkedIn, we estimate the strength of a connection and the value that a connection can provide to members based on features derived from LKG (e.g., whether the two members are in the same company and have similar titles and skills) and members' connection patterns and interactions. The estimated connection strength and value are used in various products (e.g., ranking updates in the feed), so the connection strength measure is already embodied in some of our products and the member experience. Whether we plan to expose connection strength to members is more of a product roadmap question, and beyond the scope of our team's focus.