The secret sauce for growing from a data analyst to a data scientist
Despite the increasing demand and appetite for experienced data scientists, the job is ambiguously described most of the times. Also, the delineation between data science and data analytics or engineering is still loosely defined by a lot of hiring managers.
By Natalia Koupanou, Zoopla.co.uk
Undoubtedly a data science heatwave has hit most industries making data scientist the sexiest job of the 21st century as referred in Harvard Business Review. Despite the increasing demand and appetite for experienced data scientists, the job is ambiguously described most of the times. Also, the delineation between data science and data analytics or engineering is still loosely defined by a lot of hiring managers. This lack of real industry standard confuses a lot of professionals who desire to switch to a data science role. Speaking to many analysts and software developers I realise how overwhelming the available information about AI and machine learning (ML) can be. I also know from experience how hard it is to know where to start without any guidance. Currently, I am a data scientist at Zoopla and I’d like to share some lessons I’ve learnt from my personal professional journey from analytics to data science.
Stand firm on a solid mathematical foundation
The majority of ML algorithms are built on multivariable calculus and linear and nonlinear algebra. Highly skilled data scientists are able to change the computer program at the level of mathematics and thus drive real improvement in model performance. It is important to have the mathematical skills, especially statistics and linear algebra. Having the ability to learn and understand machine learning techniques is a requirement for becoming a data scientist. Whether you’ve learnt that from a psychology or mathematics degree, a PhD or an online course is not relevant.
Personally, I have a Bachelor’s and Master’s degree in Engineering from Cambridge University. Typically, STEM bachelor degrees already provide fundamentals in mathematics required to learn machine learning and data science techniques. Many aspiring data scientists are discouraged by the myth of a PhD being a prerequisite for a career in data science. Currently, there are many data scientist with a doctoral degree but this is not a rule. For example, my former colleague Jorge Brasil with a Master’s in applied mathematics has more than 7 years of experience in data science at top companies including Microsoft.
Tip 1: Focus on your abilities rather than your background.
As a data scientist, you often break difficult, open-ended, badly defined problems into small steps. This is a skill you are trained for during a postgraduate degree in a 3–6year period. Industry can also offer that skill, and that’s why I personally chose to join an e-commerce start-up after my undergraduate studies where I was the second member of the digital analytics and pricing team.
Teach yourself before teaching your machine
A data analyst reports, summarises and interprets both historical and current information to make it useable for the business. That is very different from a data scientist, whose role is to summarise data in a way that allows to make a prediction about the future or a prescriptive decision. The core task of data scientists is to train, test and optimise ML algorithms and therefore, their skillset is heavily weighted on ML modelling.
Many blog posts in medium and other platforms are ideal for beginners and can guide you with specific problems you might want to tackle. Other, helpful reads are the following:
- Bishop — Pattern Recognition And Machine Learning (many call it the machine learning bible)
- Hal Daumé III — A course in machine learning
- Michael Nielsen — Neural Networks and Deep Learning
Theory and heavy equations can be overwhelming sometimes and should not keep someone out of the field. An approach that worked for me was doing my reading in parallel to coding. For example, try to build a single-layer perceptron (the simplest kind of neural network) from scratch to fully understand what you’ve read in the books.
Tip 2: Apply the scientific methodologies you’ve learnt.
There are also numerous online courses and masters with different weightings on theory and practise covering fundamentals of machine learning ML. My favourite picks are:
- Coursera ML course from Andrew Ng, a leader in the field, which covers some basics. It might be better to try the assignments in Python rather than Octave/Matlab as you will be better positioned in the job market if you have stronger Python skills.
- Fast.ai courses (Introduction to Machine Learning for Coders, Practical Deep Learning for Coders, Cutting Edge Deep Learning for Coders) with an inspiring teaching philosophy and a more practical focus, created by ML celebrities Jeremy Howard and Dr Rachel Thomas.
- Stanford university shares the material of a range of AI classes, to name a few that I personally liked: cs224n Convolutional Neural Networks for Visual Recognition and cs231n Natural Language Processing with Deep Learning.
The objective here is neither to memorise formulas and derivations nor to read every page of these books and lectures. You should aim to capture fundamental concepts which most models and algorithms address just in different ways, e.g. drop-out layers in neural networks, vanishing gradient, signal/noise relationships. Gaining the ability to relate problems back to these fundamentals will make you a good applied data scientist, whom a lot of employers would want to have.
Research to do science
Tip 3: Pick the right methodology for your business setting and problem.
The real skill of a data scientist is knowing what technology and machine learning methodologies are needed to answer business questions at hand. The field is booming during the last decade and continuous thirst for knowledge is required to shine as a data science professional. I would strongly advise to read both published academic papers and ML/AI blogs of different technology companies and key personas in the field. This can turn out helpful when you are asked to deliver solutions for abstract problem statements which do not provide an immediate solution. Finding the right solution via researching what’s out there is 80% of the job done. Andrej Karpathy very well said in the Stanford class cs231n, “don’t be a hero”. In my team, we don’t underestimate the effort and time others have put in finding the architecture that currently works best. Instead of rolling our own architecture for a common problem, we import libraries, download pre-trained models and fine-tune them on our data. The business world expects you to deliver (and fail) fast, hence when possible you shouldn’t reinvent the wheel but stand on the shoulders of giants.
Work on your programming skills
Python is becoming the world’s most popular coding language and has countless well-tested libraries for data science that are continually updated. Unsurprisingly most data science teams, including mine, are looking for Python users. So if you don’t know Python, sign up for an online course and learn the basics to get you going. You shouldn’t ignore the style guides such as PEP8 and show patient since practice will bring the desired results. Also, learning how to use Jupyter will be key for a quicker workflow and data/model exploration.
Tip 4: Practise practise and practise for stronger better faster programming skills.
Entering hackathons, participating in kaggle competitions, working on a personal coding projects are all different ways of improving your programming skills. Identifying or getting involved in data science opportunities that come out of the results of your analysis can be a way to gain experience in your current role. Algorithms for forecasting and anomaly detection can be other projects you can ask to work on even as part of your development as an analyst. I remember my first data science project in the industry was an algorithm to autocomplete search queries on an e-commerce website. This project was initiated by some interesting analytics insights around search and shopping baskets that I reported whilst still being a junior analyst.
Gain software engineering skills
Software engineering skills become necessary when you want your models to see the production light. Cultivating a coding attitude that aims for reproducibility of projects and results via automation is critical for both methodological and legal reasons. In a company with a mature data science culture, someone might create the prototype, someone else might write the production code and someone else might deploy it. In reality though and irregardless of the company’s size, it’s unlikely that you will have all the support required and knowing just the statistics will not be enough to deliver a data science project.
Tip 5: Automate the steps in your project as early as possible.
Hence, an initial data science bucket list might look like this:
- Reproducible data pipelines (eg. in spark and python): Have you ever had to reproduce an analysis that you did before? Creating a logical data flow (eg. raw (immutable data) -> intermediate (in-progress work) -> processed (final features)) and using Makefiles will save both you and your colleagues a lot of time. My team and I are huge fans ofcookiecutter, which offers a logical project structure like this one.
- End-to-end automation of training and scoring: A model is most of the times a living organism, new predictions are needed and data might shift. This translates to retraining, scoring and refining. It’s a necessity then to put your model parameters, secrets and random seeds in config files, break up a DS project into different elements and apply modularitye.g. creating a shared feature library which can be used both during training and scoring.
- Unit test coverage: I bet that you’d like to have carefree sleep and uninterrupted holidays. Then, it’s important to write tests for your projects in order to ensure robustness.
- Building an API to provide predictions: To pitch your ideas and models you need to have a proof of concept and in many cases this is equivalent to a REST API. If you can’t be bothered to use another language other than Python, you can use Flask and Flasgger that comes with Swagger UI. Swagger will come in useful for documenting and visualising your RESTful web services.
- Containerisation of a data science solution for ECS deployment or production environment: Docker allows you to isolate projects and their dependencies, move models between environments and run your code in exactly the same way every time achieving 100% reproducibility. This will help your collaboration with DevOps and engineers, since they can use your containers as a black box without having to know data science.
Translate science to domain language
As a data person and a subject matter expert, you can overcome blockers such as a missing business or KPI definition by finding proxies in data or make it outright a latent factor which you learn with ML. Data science usually brings disruption to a business and as a result, you will need to pitch your ideas to senior leadership in order to get the appropriate support and resources. Someone might say that making an algorithm understandable for all stakeholders in the business is a form of art. Learning how to translate what I’ve built in order to show its importance to others is something I am constantly having to learn and relearn. As Rebecca Pope, current head of data science and engineering at KPMG, emphasised in Women of Silicon Roundabout conference “Always remember that you (not your code) is impactful. People don’t buy algorithms, they trust you and your abilities.”. Hence, ensure you pay attention and time to the ability of translating maths into a visual narrative that is specific to your vertical industry.
Tip 6: Communicate your work with terms from your vertical industry.
Time to grow
Being in a newly shaped profession is more exciting than hard. Zoopla has given me the opportunity to work in a talented data science team and working with people I can learn from helps me to achieve my professional goals faster. Finding a team that lets you grow and having a mind like a sponge will accelerate your journey to success. I have been lucky that my line manager, Jan Teichmann, has the experience to mentor me in order to become a high-skilled data scientist. Ideally, your manager understands your day to day job and where you want to get to. Otherwise, find the extra guidance you might need outside your team or company, for example from an alumni or professor from your university or from a friendly data scientist from your network. Meet-ups and conferences could also be inspiring and help you with this task.
Tip 7: Remember no textbook or course would be as important as mentoring.
To sum up, the skillset you should focus on to launch a career in data science is statistics, multivariable calculus and linear algebra, machine learning, programming skills, software engineering and visualisation skills.
Top tips to achieve your goal:
- Focus on your abilities rather than your background
- Apply the scientific methodologies you have learnt.
- Pick the right methodology for your business setting and problem.
- Practise practise and practise for stronger better faster programming skills.
- Automate the steps in your project as early as possible.
- Communicate your work with terms from your vertical industry.
- Remember no textbook or course will be as important as mentoring.
What are you waiting for? Relish the opportunity and make the effort to become what you are dreaming of. ;)
Feel free to share the data science love and connect with me on LinkedIn. Special thanks to Jan Teichmann for his feedback and support!
Bio: Natalia Koupanou is Data Scientist at Zoopla.co.uk
Original. Reposted with permission.
- How to Become More Marketable as a Data Scientist
- Top 13 Skills To Become a Rockstar Data Scientist
- 12 Things I Learned During My First Year as a Machine Learning Engineer