How to Become a (Good) Data Scientist – Beginner Guide
A guide covering the things you should learn to become a data scientist, including the basics of business intelligence, statistics, programming, and machine learning.
How simple is Data Science?
Sometimes when you hear data scientists shoot a dozen algorithms while discussing their experiments or go into details of Tensorflow usage you might think that there is no way a layman can master Data Science. Big Data looks like another mystery of the Universe that will be shut up in an ivory tower with a handful of present-day alchemists and magicians. At the same time, you hear about the urgent necessity to become data-driven from everywhere.
The trick is, we used to have only limited and well-structured data. Now, with the global Internet, we are swimming in the never-ending flows of structured, unstructured, and semi-structured data. It gives us more power to understand industrial, commercial or social processes, but at the same time, it requires new tools and technologies.
Data Science is merely a 21st-century extension of mathematics that people have been doing for centuries. In its essence, it is the same skill of using information available to gain insight and improve processes. Whether it’s a small Excel spreadsheet or 100 million records in a database, the goal is always the same: to find value. What makes Data Science different from traditional statistics is that it tries not only to explain values, but to predict future trends.
In other words, we use Data Science for:
Data Science is a newly developed blend of machine learning algorithms, statistics, business intelligence, and programming. This blend helps us reveal hidden patterns from the raw data, which in turn provides insights into business and manufacturing processes.
What should a data scientist know?
To go into Data Science, you need the skills of a business analyst, a statistician, a programmer, and a Machine Learning developer. Luckily, for the first dive into the world of data, you do not need to be an expert in any of these fields. Let’s see what you need and how you can teach yourself the necessary minimum.
When we first look at Data Science and Business Intelligence, we see the similarity: they both focus on “data” to provide favorable outcomes, and they both offer reliable decision-support systems. The difference is that while BI works with static and structured data, Data Science can handle high-speed and complex, multi-structured data from a wide variety of data sources. From the practical perspective, BI helps interpret past data for reporting, or Descriptive Analytics and Data Science analyzes past data to make future predictions in Predictive Analytics or Prescriptive Analytics.
Theories aside, to start a simple Data Science project, you do not need to be an expert Business Analyst. What you need is to have clear ideas of the following points:
- have a question or something you’re curious about;
- find and collect relevant data that exists for your area of interest and might answer your question;
- analyze your data with selected tools;
- look at your analysis and try to interpret findings.
As you can see, at the very beginning of your journey, your curiosity and common sense might be sufficient from the BI point of view. In a more complex production environment, there will probably be separate Business Analysts to do insightful interpreting. However, it is important to have at least dim vision of BI tasks and strategies.
We recommend you to have a look at the following introductory books to feel more confident in analytics:
Introduction To The Basic Business Intelligence Concepts — an insightful article giving an overview of the basic concepts in BI;
Business Intelligence for Dummies —step-by-step guidance through BI technologies;
Big Data & Business Intelligence — an online course for beginners;
Business Analytics Fundamentals — another introductory course teaching the basic concepts of BI.
Statistics and probability
Probability and statistics are the basis of Data Science. Statistics is, in simple terms, the use of mathematics to perform technical analysis of data. With the help of statistical methods, we make estimates for further analysis. Statistical methods themselves are dependent on the theory of probability, which allows us to make predictions. Both statistics and probability are separate and complicated fields of mathematics. However, as a beginner data scientist, you can start with 5 basic statistics concepts:
- Statistical features like bias, variance, mean, median, percentiles, and many others are the first stats technique you would apply when exploring a dataset. It’s all fairly easy to understand and implement them in code even at the novice level.
- Probability Distributions represent the probabilities of all possible values in the experiment. The most common in Data Science are a Uniform Distribution that has is concerned with events that are equally likely to occur, a Gaussian, or Normal Distribution where most observations cluster around the central peak (mean) and the probabilities for values further away taper off equally in both directions in a bell curve, and a Poisson Distribution similar to the Gaussian but with an added factor of skewness.
- Over and Under Sampling that help to balance datasets. If the majority class is overrepresented, undersampling helps select some of the data from it to balance it with the minority class has. When data is insufficient, oversampling duplicates the minority class values to have the same number of examples as the majority class has.
- Dimensionality Reduction. The most common technique used for dimensionality reduction is PCA, which essentially creates vector representations of features showing how important they are to the output, i.e., their correlation.
- Bayesian Statistics. Finally, Bayesian statistics is an approach applying probability to statistical problems. It provides us with mathematical tools to update our beliefs about random events in light of seeing new data or evidence about those events.
Image credit: unsplash.com
We have selected just a few books and courses that are practice-oriented and can help you feel the taste of statistical concepts from the beginning:
Practical Statistics for Data Scientists: 50 Essential Concepts — a solid practical book that introduces essential tools specifically for data science;
Naked Statistics: Stripping the Dread from the Data — an introduction to statistics in simple words;
Statistics and probability — an introductory online course;
Statistics for data science — a special course on statistics developed for data scientists.
Data Science is an exciting field to work in, as it combines advanced statistical and quantitative skills with real-world programming ability. Depending on your background, you are free to choose a programming language to your liking. The most popular in the Data Science community are, however, R, Python, and SQL.
- R is a powerful language specifically designed for Data Science needs. It excels at a huge variety of statistical and data visualization applications, and being open source has an active community of contributors. In fact, 43 percent of data scientists are using R to solve statistical problems. However, it is difficult to learn, especially if you already mastered a programming language.
- Python is another common language in Data Science. 40 percent of respondents surveyed by O’Reilly use Python as their major programming language. Because of its versatility, you can use Python for almost all steps of data analysis. It allows you to create datasets, and you can literally find any type of dataset you need on Google. Ideal for entry-level and easy-to-learn, Python remains exciting for Data Science and Machine Learning experts with more sophisticated libraries such as Google’s Tensorflow.
- SQL(structured query language) is more useful as a data processing language than as an advanced analytical tool. IT can help you to carry out operations like add, delete and extract data from a database and carry out analytical functions and transform database structures. Even though NoSQL and Hadoop have become a large component of Data Science, it is still expected that a data scientist can write and execute complex queries in SQL.
There are plenty of resources for any programming language and every level of proficiency. We’d suggest visiting DataCamp to explore the basic programming skills needed for Data Science.
If you feel more comfortable with books, the vast collection of O’Reilly’s free programming ebooks will help you choose the language to master.
Image credit: unsplash.com
Machine Learning and AI
Although AI and Data Science usually go hand-in-hand, a large number of data scientists are not proficient in Machine Learning areas and techniques. However, Data Science involves working with large amounts of data sets that require mastering Machine Learning techniques, such as supervised machine learning, decision trees, logistic regression, etc. These skills will help you to solve different data science problems that are based on predictions of major organizational outcomes.
At the entry-level, Machine Learning does not require much knowledge of math or programming, just interest and motivation. The basic thing that you should know about ML is that in its core lies one of the three main categories of algorithms: supervised learning, unsupervised learning and reinforcement learning.
- Supervised Learning is a branch of ML that works on labeled data, in other words, the information you are feeding to the model has a ready answer. Your software learns by making predictions about the output and then comparing it with the actual answer.
- In unsupervised learning, data is not labeled, and the objective of the model is to create some structure from it. Unsupervised learning can be further divided into clustering and association. It is used to find patterns in data, which are especially useful in business intelligence to analyze customer behavior.
- Reinforcement learning is the closest to the way that humans learn, i.e., by trial and error. Here, a performance function is created to tell the model if what it did was getting it closer to its goal or making it go the other way. Based on this feedback, the model learns and then makes another guess, this continues to happen, and every new guess is better.
With these broad approaches in mind, you have a backbone for analysis of your data and explore specific algorithms and techniques that would suit you the best.
Similarly to programming, there are numerous books and courses in Machine Learning. Here are just a couple of them:
Deep Learning textbook by Ian Goodfellow and Yoshua Bengio and Aaron Courville is a classic resource recommended for all students who want to master machine and deep learning.
Machine Learning course by Andrew Ng is an absolute classic that leads you through the most popular algorithms in ML.
Machine Learning A-Z™: Hands-On Python & R In Data Science — a Udemy course specifically for novice data scientists that introduces basic ML concepts both in R and Python.
What skills should a data scientist possess?
Now you know the main prerequisites for Data Science. Does it make you a good data scientist? While there is no correct answer, there are several things to take into consideration:
Analytical Mindset: it is a general requirement for any person working with data. However, if common sense might suffice at the entry-level, your analytical thinking should be further backed up by statistical background and knowledge of data structures and machine learning algorithms.
Focus on Problem Solving: when you master a new technology, it is tempting to use it everywhere, However, while it is important to know recent trends and tools, the goal of Data Science is to solve specific problems by extracting knowledge from data. A good data scientist first understands the problem, then defines the requirements for the solution to the problem, and only then decides which tools and techniques are the best fit for the task. Don’t forget that stakeholders will never be captivated by the impressive tools you use, only by the effectiveness of your solution.
Domain Knowledge: data scientists need to understand the business problem and choose the appropriate model for the problem. They should be able to interpret the results of their models and iterate quickly to arrive at the final model. They need to have an eye for detail.
Communication Skills: there’s a lot of communication involved in understanding the problem and delivering constant feedback in simple language to the stakeholders. But this is just the surface of the importance of communication — a much more important element of this is asking the right questions. Besides, data scientists should be able to clearly document their approach so that it is easy for someone else to build on that work and, vice versa, understand research work published in their area.
As you can see, it is the combination of various technical and soft skills that make up a good data scientist.
Original. Reposted with permission.
Bio: SciForce is a Ukraine-based IT company specialized in development of software solutions based on science-driven information technologies. We have wide-ranging expertise in many key AI technologies, including Data Mining, Digital Signal Processing, Natural Language Processing, Machine Learning, Image Processing and Computer Vision.
- 6 bits of advice for Data Scientists
- My journey path from a Software Engineer to BI Specialist to a Data Scientist
- 10 Great Python Resources for Aspiring Data Scientists