Interview: Lei Shi, on Unraveling Insights from Unstructured Data

We discuss challenges in leveraging Big Data, important attributes while profiling employers and job seekers, competitive landscape, desired skills in data scientists and more.


Lei Shi is currently the CTO of ChinaHR, one of China's top online recruitment websites. Lei joined Microsoft Research Asia in 2005 as a researcher in the area of Natural Language Processing. Later he joined Yahoo!, in charge of Yahoo!'s search product development in Beijing. He has extensive experience in leveraging big data and machine learning techniques in solving many complex problems.

First part of the interview

Here is second and last part of my interview with him:

Anmol Rajpurohit: Q6. What are the major challenges in leveraging Big Data and Machine Learning for improved decision-making during hiring process?

unstructured-dataLei Shi: One challenge is the unstructured data. A large portion of the content in job postings and CVs is described in the form of free text. To understand it or extract key information from unstructured text has long been a real challenge to computer scientists, and it requires advanced natural language processing techniques such as named entity recognition, information extraction, sentiment analysis etc, to handle it.

Another major challenge is data sparseness. We can frequently see CVs or job postings with inadequately filled information.

AR: Q7. What are the important attributes while profiling job seekers and employers? What are the major steps in the process of statistical profiling?

LS: In general, anything that characterize the job seekers/employers or distinguish them from others for the purpose of employment matching can be taken as attributes in profiling. profilingEmpirically, important attributes for job seekers include education level, work experience, location, age, etc.

However, with Big Data and data analytics techniques, attributes can be also automatically generated and selected. To statistically represent a profile, we convert it into a vector of attributes and values mathematically. We first define a set of attributes that are able to characterize the job or job seeker. Then to represent its value, we can either calculate according to predefined rules or statistical learning models, such as running topic models in job descriptions to capture its topic distribution.

AR: Q8. How do you distinguish from its competition?

LS: ChinaHR is a top online recruitment brand in China. Founded in 1997, ChinaHR is like at chinahr-logothe grandfather level in Chinese Internet industry, with a history far longer than its competitors. It has one of the largest job and CV databases in China, probably in the world. Due to its long history and well received brand, ChinaHR has accumulated a great amount of data including jobs, CVs, user behaviour data such as job applications, search clicks etc, which distinguish itself from competitors in relation to data analytics applications and is by far ahead in this area.

AR: Q9. Which of the current trends in Big Data arena are of great interest to you?

trendsLS: Since I'm now in the online recruitment business, the trend of Big Data in this area is currently of my greatest concern. However, progress in this area is quite slow. One possible reason might be that only very few players in this business really have sufficient volume of data as well as capability and experience of leveraging these data.

AR: Q10. What key qualities do you look for when interviewing for Data Science related positions on your team?

LS: He should be very familiar with machine learning algorithm. He should have a creative mind, because all the work to be done is very new and innovative. Since the volume of our data to be processed is really huge, he should have hands-on experience in many large scale data processing tools and database systems, such as hadoop, Hive, MongoDB etc.
AR: Q11. On a personal note, are there any good books that you’re reading lately, and would like to recommend?

LS: Personally, I do not read books. But I read lots of papers from technical conferences and journals. Most of the latest development in big data and analytics is published there.