14 Data Science projects to improve your skills

There's a lot of data out there and so many data science techniques to master or review. Check out these great project ideas from easy to advanced difficulty levels to develop new skills and strengthen your portfolio.



Photo by Austin Distel on Unsplash.

First of all, I wanted to give a huge shoutout to all of the nurses, doctors, grocery clerks, public administrators, and anyone else that’s putting their lives at risk to serve their communities.

Let’s not take this for granted. Take this time in isolation to learn new skills, read books, and improve yourself. For those interested in data, data analytics, or data science, I’m providing a list of fourteen data science projects that you can do during your spare time!

There are three types of projects:

  1. Visualization projects
  2. Exploratory data analysis (EDA) projects
  3. Prediction modeling

 

Visualization Projects

 

Perhaps the quickest projects to complete are data visualizations! Below are three interesting datasets that you can use to create some intriguing visualizations to add to your portfolio.

  • Coronavirus visualizations

Difficulty: Easy
Link to dataset here.

Learn how to build dynamic visualizations using Plotly to show how the coronavirus has spread globally over time like the one above! Plotly is an amazing library that makes data visualizations dynamic, appealing, and simple.

If you want to learn how to build a visualization like the one above, check out my tutorial here.

My friend, Jack, also wrote an article on predicting the coronavirus recovery here!

  • Australian Wildfire Visualizations

Difficulty: Easy
Link to dataset here.

Taken from Vox.

The 2019–2020 bushfire season, also known as the black summer, consisted of several extreme wildfires starting in June 2019. The fires burnt an estimated 18.6 million hectares and over 5,900 buildings, according to Wikipedia.

This makes for an interesting project! Leverage your data visualization skills using Plotly or Matplotlib to show the magnitude and geographical impact of the wildfires.

See how my friend, Jack, predicted Brazil’s wildfire patterns here!

  • Earth Surface Temperature Visualization

Difficulty: Easy-Medium
Link to dataset here.

Photo by William Bossen on Unsplash.

Have any climate change deniers? Create some data visualizations to show how the Earth’s surface temperatures have changed over time. You can do this by creating a line graph or another animated Choropleth map!

Bonus: create a prediction model that shows what Earth’s temperatures are expected to be in fifty years.

 

Exploratory Data Analysis Projects

 

Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

If you want to learn more about EDA, check out my guide here!

  • New York Airbnb Data Exploration

Difficulty: Medium
Link to dataset here.

Photo by Oliver Niblett on Unsplash.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more personalized ways of experiencing the world. This dataset contains information on 2019 listings in New York and its geographical information, prices, number of reviews, and more.

Some questions that you can try to answer are as follows:

  • Which hosts are the busiest and why?
  • What areas have more traffic than others, and why is that the case?
  • Are there any relationships between prices, number of reviews, and the number of days that a given listing is booked?

 

  • Most Important Factors related to Employee Attrition and Performance

Difficulty: Easy
Link to dataset here.

Photo by Campaign Creators on Unsplash.

IBM created a synthetic dataset that you can use to understand how various factors affect employee attrition and satisfaction. Some of the variables include education, job involvement, performance rating, and work-life balance.

Explore this dataset and see if there are any significant variables that indeed affect employee satisfaction. Take it a step further and see if you can rank the variables from most important to the least.

  • World University Rankings

Difficulty: Easy
Link to dataset here.

Photo by Vasily Koloda on Unsplash.

Do you think your country has the best university in the world? What does it mean to be the ‘best’ university to start with? This dataset contains three global university rankings. Using this data, see if you can answer the following questions:

  • What countries are the top universities in?
  • What are the main factors that determine one’s world ranking?

 

  • Alcohol and school success

Difficulty: Easy
Link to dataset here.

Photo by Kevin Kelly on Unsplash.

Does alcohol affect students’ grades? If not, what does? This data was obtained in a survey from students in math and Portuguese language courses in secondary school. It contains several variables like alcohol consumption, family size, involvement in extracurriculars.

Using this, explore the relationship between school performance and various factors. As a bonus, see if you can predict a student’s final grade based on other variables!

  • Pokemon Data Exploration

Difficulty: Easy
Link to dataset here.

Taken from Pokemon.com.

For all of you gamers out there, here’s a dataset that contains information on all 802 Pokemon from all seven generations. Here are several questions that you can try to answer!

  • Which generation has the strongest Pokemon? Which has the weakest?
  • What Pokemon type is the strongest? The weakest?
  • Is it possible to build a classifier to identify a legendary Pokemon?
  • Are there any correlations between physical traits and strength stats (attack, defense, speed, etc.)?

 

  • Exploring Factors of Life Expectancy

Difficulty: Easy
Link to dataset here.

WHO created a dataset of the health status of all countries over time and includes statistics on life expectancy, adult mortality, and more. Using this dataset, explore the relationships between various variables. What has the biggest impact on life expectancy?

This dataset was created to answer the following questions:

  1. Do various predicting factors that have been chosen initially really affect Life expectancy? What are the predicting variables actually affecting life expectancy?
  2. Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?
  3. How do Infant and Adult mortality rates affect life expectancy?
  4. Does Life Expectancy have a positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol, etc.
  5. What is the impact of schooling on the lifespan of humans?
  6. Does Life Expectancy have a positive or negative relationship with drinking alcohol?
  7. Do densely populated countries tend to have a lower life expectancy?
  8. What is the impact of Immunization coverage on life Expectancy?

Check out my article on Predicting Life Expectancy with Regression for inspiration!

 

Prediction Modeling

 

  • Time Series Forecast on Energy Consumption

Difficulty: Medium-Advanced
Link to dataset here.

Photo by Matthew Henry on Unsplash.

This dataset is composed of power consumption data from PJM’s website. PJM is a regional transmission organization in the United States. Using this dataset, see if you can build a time series model to predict energy consumption. In addition to that, see if you can find trends around hours of the day, holiday energy usage, and long term trends!

 

  • Loan Prediction Forecast

Difficulty: Easy
Link to dataset here.

Photo by Dmitry Demidko on Unsplash.

Taken from Analytics Vidhya, this dataset has 615 rows and 13 columns on past loans that have and haven’t been approved. See if you can create a model that predicts whether a loan will get approved or not.

 

  • Used Car Price Estimator

Difficulty: Medium
Link to dataset here.

Photo by Parker Gibbs on Unsplash.

Craigslist is the world’s largest collection of used vehicles for sale. This dataset is composed of scraped data from Craigslist and is updated every few months. Using this data set, see if you can create a dataset that predicts whether a car listing is over or underpriced.

Check out my model that predicts used car prices here!

  • Detecting Credit Card Fraud

Difficulty: Medium-Advanced
Link to dataset here.

Photo by rupixen.com on Unsplash.

This dataset presents transactions that occurred in two days, with 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, with the positive class (frauds) account for 0.172% of all transactions. Learn how to work with unbalanced datasets and build a credit card fraud detection model.

 

  • Skin Cancer Image Detection

Difficulty: Advanced
Link to dataset here.

Photo by Allie Smith on Unsplash.

With over 10,000 images, see if you can build a neural network to detect skin cancer. This definitely the hardest project and requires extensive knowledge of neural networks and image recognition. Tip: refer to kernels created by other users if you’re stuck!

Original. Reposted with permission.

Related: