5 Data Science Projects to Learn 5 Critical Data Science Skills
Learn these to take any data science project idea from brainstorm to deployment.
If you’re trying to break into the data science industry, it can be great to get some projects under your belt. Doing data science projects helps you to develop the skills you’ll need to work as a data scientist. You’ll also have a product you can put on your resume and discuss during interviews, which is critical to show you know what you’re doing.
The data science development cycle is the main pattern of any data science project, whether it’s for a company or for your own personal project.You’ll need to be comfortable with data collection, cleaning, modeling, and visualization to be a proficient data scientist. The specific tool stack you use at your future data science job may vary from the tools I recommend below, but like anything in the computer science world, it’s more about learning how to think than specific syntax or features of one tool over another. After all, if you can create data visualizations using Tableau, you will be able to learn how to do it with Power BI fairly quickly since you are already familiar with the general process for visualizing data.
Getting familiar with the entire data science development cycle at once can be overwhelming. Each step of the cycle requires several skills, and developing all the data scientist skills for all the steps at once will be a frustrating and probably fruitless process. Instead of floundering to try to do them all at once, give yourself a leg-up by structuring your learning journey.
The amin obstacle you’ll face is motivation. My preferred method to maintain and feed my motivation is by picking a theme or product when trying to expand my skill sets into a new area. Try to think of an actual product (regardless of how useless or unmarketable it seems) and step through the cycle with that idea.
Follow your passions and take this opportunity to find the cross-section between why you want to get into data science and the rest of your life. If you like running, you can find a data set of race times and training plans followed to see which training plans lead to the largest amount of improvement. Maybe you’re into baking and you want to figure out the popularity of different dishes amongst home bakers by analyzing keyword frequencies of search engines.
Here is a rundown of five mini data science projects you can do. Each will teach you a skill you’ll need to showcase on your resume.
1. Data Collection
Similar to when you’re beginning to cook a meal, you first have to make sure that you’ve gathered all the ingredients you will need. The first step in producing any kind of insights is procuring data. Finding the relevant data for your data analytics project, whether it’s a personal or work project, is a big challenge.
You should be comfortable working with APIs. Think of an API as a formalized agreement between two programs, like the frontend of a website and the server and database that hold and process the data. The API would be published to the front- and back-end in order to structure the communication between them. REST APIs are very popular and used to query a web service for data. You can use APIs similar to the Google Trends API to collect data.
Importing Big Data from a Database
You’ll want to create a database on a cloud service (AWS, Azure, or Google Cloud) and connect to it. All the big cloud solution providers have extensive free tiers that are perfect for a hobby data scientist to test things out. Since a lot of consumers, students, and enterprises alike use these big-name products, there’s tons of helpful content covering their free tiers, including extensive documentation and a plethora of Stack Overflow questions. Cloud services are becoming a central part of modern data science, so it’s great for you to get to know them now.
Pick a product and create a database. Amazon and Google both have great documentation for working with their free tier databases. Importing is a fairly straightforward, well-documented process. Google even provides a list of tips and tricks for the best data import strategy, like compressing data to reduce cost.
Sourcing Your Data
There are plenty of open source sources of data for personal projects. Make sure you avoid datasets that have been overdone, like the iris dataset. You want your project to make a splash on your resume. I’ve gathered a few of my favorite data sources, one that’s quirky, one that focuses more on pop culture, and a third that contains more serious stuff like demographics and health data.
2. Data Cleaning
Cleaning data means that it’s dirty in the first place. I’ve never met a truly clean dataset in the wild, and you probably haven’t either. Cleaning data is an integral part of data science, as dirty data leads to inaccurate findings. Dirty data may contain duplicates, be outdated, incorrect, incomplete, or inconsistent. You will need to learn how to mitigate all of these issues.
According to Tableau, the five steps to data cleaning involve removing duplicates, fixing structural issues, filter out unwanted outliers, handle missing data, and validate the quality of your resulting cleaned dataset.
How to Clean Dirty Data
Remember that we’re not going for perfect; we’re just going for good enough. Find a balance between maximum effort and a potentially over-corrected dataset and lazing through the data cleaning process.
Database Trends and Applications has a great guide to step you through the data cleaning process.The most important thing to remember is to document every change you make in the process of cleaning your data. When dealing with incomplete data for example, you’ll have to make some assumptions and then decisions based on the assumptions. If you haven’t got your assumptions as well as your replacement or deletion logic recorded, you’ll miss out on reintroducing this data if you gain more information or understanding.
If you’d like some specific examples of dirty data, Foresight BI has put together exercises for different types of dirty data. Pick the five exercises that seem most challenging to you and give them a go. They’ve got some structure and good example overviews of how the data could look.
3. Data Modeling
Outside of basic statistical analysis, machine learning is a core part of data science. Get comfortable developing, maintaining and deploying machine learning models to take your data science career to the next level.
Building Machine Learning Models
Amazon has a machine learning tutorial to walk you through building, training, and deploying a machine learning model using their SageMaker services. This is a great option if you are completely new to data science or machine learning, as it will hold your hand the whole way, but you’ll still get exposed to the entire process. If you haven’t built, trained, and deployed a model on your own before, I would follow Amazon’s guide.
However, if you’ve got more experience, don’t take the easy way out. Build your model as you normally would, taking care to split the data into testing and training data. Choose the right model depending on the kind of data you have and what kind of prediction you are looking to make (supervised for labeled data, unsupervised for unlabeled, etc).
Chris Rawles put together a lovely, detailed guide for how to set up your model to train in the cloud. They used Google Cloud, but the principles he recommends hold true regardless of which cloud provider you choose.
AWS’s Lamda service is great for deploying your code and letting it just run. The pricing model is a pay-per-request, so it can be quite cost effective if you’re just using it to practice deployment and maybe show it off to a few interviewers.
Building Regression Models
Regression models work best if the outcome you are looking to predict is binary. Although a regression model is simpler than a neural network or a clustering algorithm, you should train and deploy it as you would other machine learning models.
If you feel out of your depth with machine learning and the never-ending list of tools used for data science, try starting with a digestible exercise. You can build a simple yet effective regression model in Excel. It isn’t anything fancy, and doing this won’t get you a data science job, but it’s a great baby step for beginner data scientists.
4. Data Visualization
Once you’ve done all the heavy lifting of finding data, cleaning it, developing a model, and producing predictions or insights, it’s time to show off your work! Knowing which type of visualization to go for is vital, since you need to communicate your findings in a simple yet effective manner. Try presenting your findings to friends and family using different visualizations and figure out which ones work better for certain scenarios.
Tableau has become quite famous for their snazzy, attractive visualizations. Pavleenk Kaur has put together a walk-through of the most common visualizations used in Tableau. It walks you through how to connect your data, and helps you understand the tool’s interface by describing the meanings of the colors of different options and describing the pros and cons of different visualizations.
Other BI Tools
Microsoft’s Power BI is great for dashboards, generating reports, and showing off your predictive analysis. It’s great at acting as a centralized data reporting system. With over 200K organizations using it across the globe, it’s a great tool to be familiar with when you apply for data science jobs. Check out this top list of data visualization tools for data scientists.
Recommendation engines are a great example of data science in practice. If a customer bought a tent, they will probably want to buy sleeping bags, headlamps, and a campstove, right? Recommendation engines are based on the idea of a co-occurrence matrix, which represents the number of times each row value appears in the same context as each column value.
Deploying a recommendation engine is the final project on your way to developing all the skills of a data scientist. This area of data science has a lot of overlap with the skills and responsibilities of a software developer, like using Django to create apps online. You can deploy apps like ones produced using Django or other frameworks to the cloud (AWS, Azure, or Google Cloud). These cloud services can provide you with servers and databases, both of which you’ll need to deploy your app and keep it running.
Like a book that’s never published, a data science model that never gets to the point where it's consuming data and outputting live predictions or adjusting its analysis is worth a lot less. Deployment and maintenance should always be your end goal. Learning this now by building a recommendation engine will help you maximize your business impact and perceived performance at your next data science job.
Final Thoughts on Data Science Projects to Learn Data Science Skills
It’s important that you understand the basic building blocks that make up the data science development cycle. I recommend expanding that understanding to include cloud solutions. A data science model is only useful if it is able to make live predictions, continues to consume data to update the model, and makes all of these insights available to its stakeholders.
Whether you are trying to start your own data science company or want to go work as a data scientist at a tech giant, you will need to be comfortable executing the tasks of a data scientist in a cloud environment. With all of the cloud solution providers’ free tiers, there’s no excuse not to dig your teeth into these tools now. If you’re a beginner and want to land your first data science or data analytics job, these 19 data science project ideas can help you. Pick one or all of them - whatever looks like the most fun to you.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.