How to Set Yourself Apart from Other Applicants with Data-Centric AI

This article is designed to help you prepare for the job market and get yourself noticed in the industry.

By Dr. Jennifer Prendki, Founder and CEO at Alectio on December 12, 2022 in Machine Learning

In 2012, Data Scientist has deemed the Sexiest Job of the 21st Century in Harvard Business Review. While the field has significantly matured since then, and there are now many “flavors” of Data Science jobs, things haven’t changed much: being a data scientist is a highly desirable career for fresh grads, and also attracts lots of people from different fields.

With the fast adoption of Machine Learning across the Industry and the fantastic progress in AI research, the good news is that the need for data scientists is not going anywhere anytime soon. There are literally hundreds of thousands of new Data Science positions being opened every year. In fact, employers typically struggle with the shortage of labor. So, with this many opportunities on the market, it can't be that hard for newcomers to land the job of their dreams, right? Unfortunately, things are never that simple, and the reality is that even with so many vacant positions, junior applicants still face huge challenges to penetrate the field.

Struggling with Credibility

In order to understand why employers seem unable to find a suitable match for their Data Science and Machine Learning roles, it is useful to understand how the job of data science became prevalent in the Tech industry.

Though the term was coined in 2001, Data Science isn’t a new field. The way it is practiced has changed dramatically over time. That said, it is only in the late 2000s and early 2010s when collecting and processing became easier with the popularization of the GPU machines and the sudden hype around Big Data that it became mainstream for companies to build entire teams dedicated to analyzing and leveraging their data. The problem? The people they were hiring were often brilliant computer scientists and statisticians, but they were rarely capable of converting their skills into tangible business value, often because the companies who had hired them had no idea what data to collect, and how to manage it. Those same companies quickly realized that they were losing money (how they fixed that issue is a separate story) and eventually learned the hard way that technical expertise isn’t sufficient to become a great data scientist.

This is how, in spite of the fast-growing need for Data Science and Machine Learning experts, companies tend to be slow to hire and often have inadequate hiring processes for such jobs.

Showcase what you can do, not what you know!

If you are on the hunt for your first job as a data scientist, you already know that a resume with loaded academic credentials might not get you very far. What hiring managers need to see is your ability to use those skills to help their business grow. Your long list of academic publications isn’t going to give them the confidence that you are the right person for the job. In fact, recruiters are particularly suspicious of resumes full of school projects, just because they see plenty of such resumes among candidates who doesn’t work out.

That’s because Data Science is probably the technical field that requires the most cross-domain competencies and the most acute communication skills. It’s just not a job you can do all alone. It’s a job which will require you to expand your skills to bridge the gap between the Engineering, Product and Business teams, whom you will depend on for your success. In short, you need to become a bit of Jack-of-all-Trades while remaining the data expert. So the resumes that catch their eye are those that display something different. Something that wouldn’t be expected from a fresh grad. And something that can prove the candidate is ready to tackle real-life problems.

This is how it became very popular for upcoming data scientists to participate in Kaggle competitions, as they were an easy way for them to prove their ability to work on real-size, industry datasets. For a while, ranking reasonably well for a Kaggle competition was a differentiator, and it would land people interviews fairly easily. But nowadays, participating in a Kaggle competition has become the new norm. Candidates hardly stand out because of it; it is actually something that’s almost expected nowadays.

So how to shine as a candidate in 2023? Don’t worry, we’ll get to that soon.

Real-life Data

Kaggle achieved something absolutely amazing: it brought an entire generation of data aspirants to hone their Machine Learning skills - and while having fun at that. Yet, there is one thing that Kaggle was never able to achieve: paliate to the most severe shortcoming of academic Data Science education: the lack of awareness on the topic of Data Preparation. Because Kaggle provides fully ready datasets to train on, the competitors have only one thing to do: build, tune and train models, without ever worrying about data quality. Therefore, even with multiple high-quality submissions to Kaggle to boast about on their resume, candidates are still falling short when it comes to bringing hiring managers the confidence that they can work with real-life data other than feeding it into a model.

This continues to be a challenge in the way that companies find data talent. Luckily, what is a problem for some, can turn out to be a huge opportunity for others. Data Science candidates are still investing most of their time proving their model building skills, though it is easy to differentiate oneself by demonstrating awesome Data Preparation skills. And how one can do just that is precisely the topic of the next section.

How to Set Yourself Apart from Other Applicants with Data-Centric AI

Coming on top thanks to the Data-Centric AI Hype

But before jumping into the practical advice part, let me clarify what Data-Centric AI is. Just like often in Data Science, Data-Centric AI is a new term to refer to an old idea: it is the concept of optimizing the performance of a Machine Learning model by putting in the work on the training data, as opposed to the model.

The Data-Centric AI workflow

Traditionally, when building and training a Machine Learning model, data scientists treat their training data as a static object that they feed into a model which they will modify, tune and perfect until they are satisfied with the results. Once they are okay with the validation performance, they consider the model “ready” and move on to testing before having their model deployed. This is called Model-Centric AI, and that’s what you are taught in school.

But on the job, what you will experience will be widely different: your data will be messy, have missing fields, and be corrupted; worse: there might be no data at all, and you will be expected to collect and organize it. You will have to spend significantly more time preparing your data than on building the model, especially since the use of pre-trained models and ML libraries becomes more and more mainstream. The industry simply calls (and always has called) for a Data-Centric approach to AI.

How to showcase your Data-Centric AI Skills - and get the Job

So what better way to sell yourself as a great data scientist, than to display your incredible Data-Centric AI skills? By doing that, you would solve the two biggest challenges when it comes to getting your first job as a data scientist:

You will be able to distinguish yourself from other candidates and will attract the attention of recruiters by displaying a different type of expertise. This will also demonstrate your ability to stay on top of new trends in Technology and hence, your continuous learning abilities.
You will actually prove that you have unique skills in Data Preparation and are capable of dealing with the challenges of real-life data. This will set you apart from other people with similar training but no practical experience with Data Cleaning, which will put the concerns of most recruiters at ease.

Here is some very good news: this is actually not hard at all to do just that, both because not a lot of people are using this strategy yet, and because there is a large number of opportunities to do so. And while most people believe that Data Preparation is mostly about Data Labelling, the truth is that Data-Centric AI is in fact a collection of techniques and processes that consist in massaging training data so that it yields better results at training time. This means that there are many topics that you can start building expertise on.

5 Tips to demonstrate your Data-Centric AI skills

Gain as much knowledge as possible about Data Labeling and use that knowledge to shine during interviews. In your new job, data will most likely be raw, so show you would know how to make it ML-ready. Inform yourself on the tools and techniques typically used to label data (from using third-party labeling companies to get data annotated manually, to more advanced techniques like Weak Supervision). Don’t forget to learn about the operational and business side of Data Labeling (how much it costs, how sharing data with third-parties is impacted by Data Privacy laws like GDPR, etc.)
Build a small end-to-end Data Labeling tool as a portfolio project. You can easily use open source tools like Streamlit to create the UI.
Learn about Data-Centric training paradigms, like Active Learning and Human-in-the-Loop Machine Learning. You can do that hastily by contributing to open source Active Learning libraries. Note that Active Learning is an incredibly rich topic in itself, so don’t stop at least-confidence Active Learning but look also into Transfer Active Learning, BALD, etc.
Write introductory and technical content on the topic of Data Labeling, Data Augmentation, Synthetic Data Generation and Data-Centric AI. This will allow you to hone your own Data Prep skills as well as to showcase your understanding of the topic.
Recycle your existing projects by emphasizing the work you did in terms of Data Preparation. For example, if you had to manually annotate your own data for a school project, specify clearly in your resume how you did this, and how it impacted the quality of the results. Many people have already been doing Data-Centric AI all along, but just didn’t realize it.

Using data augmentations for your project is an easy way to showcase Data-Centric AI skills

As Data-Centric AI grows in popularity and in awareness, Data-Centric AI skills will surely become a must for any data scientist to be hired. Universities will most likely evolve their curriculum to include it as a key topic. But for now, any knowledge of Data-Centric AI will certainly set you aside and make you a unique candidate with a genuine interest in practical Machine Learning issues. So don’t miss the opportunity to shine and land your dream job.

Dr. Jennifer Prendki is the founder and CEO of Alectio, the first AI startup focused on the concept of DataPrepOps, a portmanteau term that she coined to refer to the nascent field focused on automating the optimization of a training dataset. She is on a mission to help ML teams build models with less data (leading to both the reduction of ML operations costs and CO2 emissions) and has developed technology that dynamically selects and tunes a dataset that facilitates the training process of a specific ML model.