Submit a blog to KDnuggets -- Top Blogs Win A Reward

Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2021 » Jun » Tutorials, Overviews » Top 10 Data Science Projects for Beginners ( 21:n22 )

Top 10 Data Science Projects for Beginners


Check out these projects for ideas to strengthen your skills and build a portfolio that stands out.



By Natassha Selvaraj, Data Scientist



Photo by Jo Szczepanska on Unsplash

 

As an aspiring data scientist, you must have heard the advice “do data science projects” over a thousand times.

Not only are data science projects a great learning experience, they also help you stand out from the crowd of data science enthusiasts looking to break into the field.

 

However, not all data science projects help your resume stand out. In fact, listing the wrong projects on your portfolio can do more harm than good.

 

In this article, I am going to walk you through the projects that are must-haves on your resume.

I will also provide you with sample datasets to experiment with for each project, along with associated tutorials that will help you complete the project.

 

Skill 1: Data Collection

 

Photo by James Harrison on Unsplash


 

Data collection and pre-processing is one of the most important skills to have as a data scientist.

In my data science job, most of my work involves data collection and cleaning in Python. After understanding the business requirement, we need to gain access to relevant data on the Internet.

This can be done with the use of APIs or web scrapers. Once that is done, the data needs to be cleaned and stored into data frames in a format that can be fed as input into a machine learning model.

This is the most time consuming aspect of a data scientist’s job.

I suggest showcasing your skills in data collection and pre-processing by completing the following projects:

 

Web Scraping — Food Reviews Site

 
Tutorial: Zomato Web Scraping with BeautifulSoup

Language: Python

Scraping reviews from a food delivery website is an interesting and practical project to have on your resume.

Simply build a web scraper to collect all the review information from all the web pages of this site, and store it in a data frame.

If you want to take this project one step further, you can use the data collected to build a sentiment analysis model and classify which of these reviews are positive and which ones are negative.

The next time you are looking for something to eat, pick a restaurant that has reviews with the best overall sentiment.

 

Web Scraping — Online Course Site

 
Tutorial: Build a Web Scraper with Python in 8 Minutes

Language: Python

Want to find the best online course to take in 2021? It is difficult to scroll through hundreds of data science courses to find an affordable, yet highly rated course.

You can do this by scraping an online course website and storing all the results into a data frame.

Taking this project a step further, you can also create visualizations around variables like price and rating to find a course that is both affordable and of good quality.

You can also create a sentiment analysis model and come up with the overall sentiment surrounding each online course. You can then choose to do the course with the highest overall sentiment.

 

Bonus

 
Create some projects where you collect data using an API or some other external tool. These skills will usually come in handy when you start working.

Most companies that rely on third-party data often purchase API access, and you will need to do the data collection with the help of these external tools.

A sample project you could do: Use the Twitter API to collect data related to a specific hashtag and store the data in a data frame.

 

Skill 2: Exploratory Data Analysis

 

Photo by Luke Chesser on Unsplash


 

After collecting and storing data, you will need to conduct an analysis of all the variables in your data frame.

You need to observe how each variable is distributed, and understand their relationship with each other. You must also be able to answer questions with the help of data available.

This is work you’d be doing very often as a data scientist, perhaps even more so than predictive modelling.

Here are some EDA project ideas:

 

Identifying the risk factors of heart disease

 
Dataset: The Framingham Heart Study

Tutorial: The Framingham Heart Study: Decision Trees

Language: Python or R

This dataset comprises of predictors such as cholesterol, age, diabetes, and family history that are used to predict the onset of heart disease in a patient.

You can use Python or R to analyze the relationships present in this dataset, and come up with answers to questions such as:

  • Are patients with diabetes more likely to develop heart disease at an early age?
  • Is there a certain demographic group that is at higher risk of heart disease than others?
  • Does frequent exercise lower the risk of developing heart disease?
  • Are smokers more likely to develop heart disease than non-smokers?

Being able to answer these questions with the help of available data is a vital skill for a data scientist to have.

Not only will this project help strengthen your skill as an analyst, it will also showcases your ability to derive insight from large datasets.

 

World Happiness Report

 
Dataset: World Happiness Report

Tutorial: World Happiness Report EDA

Language: Python

The World Happiness Report tracks six factors to measure global happiness — life expectancy, economics, social support, absence of corruption, freedom, and generosity.

You can answer the following questions when performing an analysis on this dataset:

  • Which country is the happiest in the world?
  • What are the most important contributing factors to a nation’s happiness?
  • Is overall happiness increasing or decreasing?

Again, this is a project that will help improve your skillset as an analyst. A trait I’ve seen in most successful data analysts is curiosity.

Data scientists and analysts are always looking for contributing factors.

They are always looking to find relationships between variables, and are constantly asking questions.

If you are an aspiring data scientist, doing projects like this will help you develop an analytical mind.

 

Skill 3: Data Visualization

 

Photo by Lukas Blazek on Unsplash


 

When you start working as a data scientist, your clients and stakeholders will usually be non-technical people.

You will need to break down your insight and present findings to a non-technical audience.

The best way to do this is in the form of visualizations.

Presenting an interactive dashboard will help you convey your insights a lot better, as graphs are easy to understand at a first glance.

Due to this, many companies list data visualization as a must-have skill for data science related positions.

Here are some projects that you can showcase on your portfolio to demonstrate your data visualization skills:

 

Building a Covid-19 Dashboard

 
Dataset: Covid-19 Data Repository at Johns Hopkins University

Tutorial: Building Covid-19 Dashboard with Python and Tableau

Language: Python

You will first need to pre-process the dataset above using Python. Then, you can create an interactive Covid-19 dashboard using Tableau.

Tableau is one of the most in-demand data visualization tools, and is a pre-requisite to most entry level data science positions.

Building a dashboard using Tableau and showcasing it on your portfolio will help you stand out as it demonstrates your proficiency in using the tool.

 

Building an IMDB-Movie Dataset Dashboard

 
Dataset: IMDb Top Rated Movies

Tutorial: Exploring IMDb Top 250 with Tableau

You can experiment with the IMDb dataset and create an interactive movie dashboard with Tableau.

As I mentioned above, showcasing Tableau dashboards that you have built can help your portfolio stand out.

Another great thing about Tableau is that you can upload your visualizations to Tableau Public, and share the link with anybody who wants to use your dashboard.

This means that potential employers can get to interact with your dashboard, which sparks interest. Once they are interested in your project and can actually play around with the end product, you are already a step closer to getting the job.

If you want to get started with Tableau, you can visit my tutorial here.

 

Skill 4: Machine Learning

 

Photo by Kevin Ku on Unsplash


 

Finally, you will need to showcase projects that demonstrate your proficiency in machine learning.

I suggest doing both — supervised and unsupervised machine learning projects.

 

Sentiment Analysis on Food Reviews

 
Dataset: Amazon Fine Food Reviews Dataset

Tutorial: A beginner’s guide to sentiment analysis with Python

Language: Python

Sentiment analysis is a very important aspect of machine learning. It is used often by businesses to gauge the overall customer response to their products.

Customers usually talk about products on social media and customer feedback forums. This data can be collected and analyzed to gain an understanding of how different people respond to different marketing strategies.

Based on the sentiment analysis conducted, companies can position their products differently or change their target audience.

I suggest showcasing one sentiment analysis project on your portfolio, as almost all businesses have a social media presence and the need to gauge customer feedback.

 

Life Expectancy Prediction

 
Dataset: Life Expectancy Dataset

Tutorial: Life Expectancy Regression

Language: Python

In this project, you will be predicting a person’s life expectancy based on variables such as education, number of infant deaths, alcohol consumption, and adult mortality.

The sentiment analysis project I listed above is a classification problem, which is why I’m adding a regression problem to the list.

It is important to showcase a variety of projects on your resume to show your expertise in different areas.

 

Breast Cancer Analysis

 
Dataset: Breast Cancer Dataset

Tutorial: Cluster analysis of breast cancer dataset

Language: Python

In this project, you will be using a K-means clustering algorithm to detect the presence of breast cancer based on target attributes.

K-means clustering is an unsupervised learning technique.

It is important to have clustering projects on your portfolio because most real world data is unlabelled.

Even massive datasets collected by companies usually don’t have training labels. As a data scientist you might need to do the labelling yourself using unsupervised learning techniques.

 

Conclusion

 
You need to showcase projects that display a variety of skills — including data collection, analysis, visualization, and machine learning.

Online courses aren’t sufficient for you to gain skills in all these areas. However, you can find tutorials for almost every kind of project you want to do.

All you need to have is basic knowledge of Python, and you will be able to follow along to these tutorials.

Once you get all the code to work and are able to follow along properly, you can replicate the solution and work on a variety of different projects on your own.

Remember, it is important to showcase projects on your portfolio if you are a beginner in the field of data science and don’t have a degree or master’s in the subject.

Portfolio projects are one of the best ways to display your skills to a potential employer, especially to land your first entry level job in the field.

Read about how I got my first data science internship here.


Sooner or later, those who win are those who think they can — Paul Tournier


 
Bio: Natassha Selvaraj (LinkedIn) I am currently pursuing a degree in computer science, and I major in data science. My interest lies in the field of machine learning, and I have worked on a variety of projects in this domain. I also enjoy problem-solving and programming, which I do on a daily basis.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy