7 Essential Resources & Tips To Get Started With Data Science

This instructional post takes you through connecting the various pieces when studying the data science pipeline. From analysis, to datasets, to MOOCs, to visualizing data, this informative post has some fresh insight.

By Jan R. Benetka.

1. Data Science

Data science is an umbrella term for a collection of techniques from many distinct areas such as computer science, statistics, machine learning to name just a few. The main objective is to extract information from data and turn it into knowledge which you can base your further decisions on. It sounds easy, but it's not necessarily always straightforward. Usually the process comprises many steps starting with a research question. Once you know what you want to study, you need to obtain the right data, clean it, explore it, create and evaluate a model, repeat this cycle a couple of times, and finally you are ready to start looking for a way how to properly communicate your results.

The Python for Data Analysis book is a great starting point, it guides you through all these stages and helps you to get this workflow under your skin.

Data Scientist Tweet
Definition of a Data Scientist.

2. Data Set

First of all you need an interesting data set to play with. Either you already have your own data (congratulations!) or you need to acquire some. We happen to be living in the age of information overload which probably means that data is everywhere and it's easy to get it, right? Yes and no.

Data is wherever you look, however, it's not always trivial to get what you want. The path of least resistance when searching for data is to explore publicly available data sets. People tend to organize them in curated lists such as 'Awesome Public Datasets' by Xiaming Chen, alternatively you can use one of data repositories like datahub.io. If you don't succeed, you can try to find a public API and collect the precious data yourself. Chances are high that such an API is not available or is very limited, then you have to find a way to extract the data by other means, for example, by scraping webpages. This approach typically requires some data-cleaning steps, which might be costly in terms of time and effort.

3. Statistics

Having a good understanding of statistics is extremely helpful when performing data analysis. A rule of thumb says that the first step after getting a data set is to have a quick look at it, and some basic descriptive statistics is a good friend of yours here. If your data set contains numerical variables, you might be interested in their distributions - their center (i.e., mean) and how spread they are (i.e., variance).

In short, statistics offers you a toolbox for understanding your data, distinguishing between causation and correlation, analyzing patterns, modeling, predicting, etc. Last but not least, statistics quantifies certainty of your outcomes and therefore gives you confidence in your results. In our ZEEF list you can find, among others, this awesome hands-on tutorial called "An Introduction to Statistics" prepared by Thomas Haslwanter.

4. Machine Learning

In layman's terms, the goal of machine learning algorithms is to learn to make decisions based on data. This approach, contrary to designing hard-coded algorithms, has huge benefits in a sense that one method can serve many purposes. Moreover, machine learning systems are designed to improve as new data come in. That's exactly why your Amazon account looks different when you're logged in than when you're not - as you're browsing their catalogue, it learns your preferences. Google search, to mention another example, is constantly learning the importance of webpages. You don't have time to manually inspect those X thousands of results it returns, all you want is the ten blue links to be the best hits.

If you want to start with the machine learning right away, then you should visit the Joseph Misiti's GitHub repository with a great hack-first-get-serious-later tutorial called Dive into Machine Learning. It uses Python and one of its most popular ML libraries, scikit-learn.