I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
LinkedIn is one of the favorite recruiting platforms in the market. Everyday, recruiters from all over the world rely on LinkedIn to source and filter candidates for specific career opportunities. Specifically, LinkedIn Recruiter is the product that helps recruiters build and manage a talent pool that optimizes the chances of a successful hire. The effectiveness of LinkedIn Recruiter is powered by an incredibly sophisticated series of search and recommendation algorithms that leverage state of the art machine learning architectures with the pragmatism of real world systems.
It’s not a secret that LinkedIn has been one of the software giants that has been pushing the boundaries of machine learning research and development. In addition to nurturing one of the richest datasets in the world, LinkedIn has been constantly experimenting with cutting edge machine learning techniques in order to make artificial intelligence(AI) a first class citizen of the LinkedIn experience. The recommendation experience in their Recruiter product required all LinkedIn’s machine learning expertise as it turned out to be a very unique challenge. In addition to dealing with an incredibly large and constantly growing dataset, LinkedIn Recruiter needs to handle arbitrarily complex queries and filters and deliver results that are relevant to a specific criteria. Search environments are so dynamic that result really hard to model as machine learning problems. In the case of Recruiter, LinkedIn used a three-factor criterial to frame the objectives of the search and recommendation model.
1) Relevance: The search results need to not only return relevant candidates but to surface candidates that could be interested on the target position.
2) Query Intelligence: Search results should not only return candidates that match a specific criteria but also similar criteria’s. For instance a search for machine learning should return candidates that list data science in their skillsets.
3) Personalization: Very often, finding the ideal candidates for a company is based on matching attributes that fall outside the search criteria. Other times, recruiters are not certain of what criteria to use. Personalizing search results is a key element of any successful search and recommendation experience.
A fourth key criteria of the LinkedIn Recruiter search and recommendation experience that is not as visible as the previous three is its focus on simple metrics. To simplify the recommendation experience, LinkedIn modeled a series of key metrics that are tangible indicators of a successful recruitment. For instance, the number of accepted InMails seem to be a clear metric to judge the effectiveness of the search and recommendation processes. From that perspective, LinkedIn use those key metrics as the objective to maximize in its machine learning algorithms.
The Science: From Linear Regression to Gradient-Boosted Decision Trees
The initial search and recommendation experience in LinkedIn Recruiter was based on linear regression models. While linear regression algorithms are easy to interpret and debug, they fall short to find non-linear correlations in large datasets such as LinkedIn’s. To improve that experience, LinkedIn decided to experience with Gradient Boosted Decision Trees (GBDT) to combine different models in a more complex tree structure. Aside from a larger hypothesis space, GBDT has a few other advantages, like working well with feature collinearity, handling features with different ranges and missing feature values, etc.
GBDT by itself provided some tangible improvements over linear regression but also fails to address some key challenges of the search experience. In a famous example searches for dentists were returning candidates with software engineering titles as the search models were prioritizing job seeking candidates. To improve this, LinkedIn added a series of context-aware features based on a technique known as pairwise optimization. Essentially, this method extends GBDT with pairwise ranking objective, to compare candidates within the same context and evaluate which candidate better fits the current search context.
Another challenge of the LinkedIn Recruiter experience is to match candidates with related titles such as “Data Scientist” and “Machine Learning Engineer”. This type of correlation is hard to achieve by just using GBDT. To address that LinkedIn introduced representation learning techniques based on network embedding semantic similarity features. In this model, search results will be complemented with candidates with similar titles based on the relevance of the query.
Arguably, the most difficult challenge to address in the LinkedIn Recruiter experience was personalization. Conceptually, personalization can be divided in two main groups. Entity-level personalization focuses on incorporating preferences for the different entities in the recruiting process such as recruiters, contracts, companies, and candidates. To address this challenge, LinkedIn relied on a well-known statistical method called Generalized Linear Mixed (GLMix) which uses inference to improve the results of prediction problems. Specifically, LinkedIn Recruiter used an architecture that combines learning-to-rank features, tree interaction features, and GBDT model scores. Learning-to-rank features are used as input to a pre-trained GBDT model, which generates tree ensembles that are encoded into tree interaction features and a GBDT model score for each data point. Then, using the original learning-to-rank features and their nonlinear transformations in the form of tree interaction features and GBDT model scores, the GLMix model can deliver recruiter-level and contract-level personalization.
The other type of personalization model required by the LinkedIn recruiter experience focuses more in the in-session experience. A shortcoming of utilizing offline-learned models is the fact that, as the recruiter examines the recommended candidates and provides feedback, that feedback is not taken into account during the current search session. To address this, LinkedIn Recruiter relied on a technique known as Multi-Armed Bandit models to improve the recommendations across different groups of candidates. The architecture first separates the potential candidate space for the job into skill groups. Then, a multi-armed bandit model is utilized to understand which group is more desirable based on the recruiter’s current intent, and the ranking of candidates within each skill group is updated based on the feedback.
The LinkedIn Recruiter search and recommendation experience was based on a proprietary project called Galene built on top of the Lucene search stack. The machine learning models described in the previous section contribute to build an index for different entities that are used as part of the search process.
In that architecture, the Galene broker system fans out the search query request to multiple search index partitions. Each partition retrieves the matched documents and applies the machine learning model to retrieved candidates. Each partition ranks a subset of candidates, then the broker gathers the ranked candidates and returns them to the federator. The federator further ranks the retrieved candidates using additional ranking features and the results are delivered to the application.
LinkedIn is one of the companies that has been building machine learning systems at large scale. The ideas of the recommendation and search techniques used for LinkedIn Recruiter are incredibly relevant to many similar systems across different industries. The LinkedIn engineering team published a detailed slide deck that provides more insights into their journey to build a world class recommendation system.