Getting Started with Python for Data Science

Back to Basics: A beginner's guide to setting up Python and understanding its role in data science.

By Nisha Arya, Contributing Editor & Marketing and Client Success Manager on September 1, 2023 in Data Science

Getting Started with Python for Data Science

Image by Author

Summer is over and it’s back to studying or working on your self-development plan. Many of you may have had the summertime to think about what your next steps will be, and if that involves anything to do with Data Science - you need to read this blog.

Generative AI, ChatGPT, Google Bard - these are probably a lot of terms you've been hearing over the past few months. With this uproar, a lot of you are thinking about getting into the tech field, such as Data Science.

People from different roles want to keep their jobs, so they will aim to develop their skills to fit the current market. It is a competitive market and we are seeing more and more people building interest in Data Science; where there are thousands of courses online, bootcamps, and Masters (MSc) available in the sector.

If you want to know what FREE courses you can take for Data Science, have a read of Top Free Data Science Online Courses for 2023

With that being said, if you want to crack into the world of Data Science, you need to know about Python.

Role of Python in Data Science

Python was developed in February 1991 by Dutch programmer Guido van Rossum. The design heavily emphasizes the easy readability of code. The construction of the language and object-oriented approach helps new and current programmers write clear and understanding code, from small projects to large projects, to using small data to big data.

31 years later, Python is considered one of the best programming languages to learn today.

Python contains a variety of libraries and frameworks so that you don’t have to do everything from scratch. These pre-built components contain useful and readable code that you can implement into your programs. For example, NumPy, Matplotlib, SciPy, BeautifulSoup, and more.

If you would like to know more about Python Libraries, read the following article: Python Libraries Data Scientists Should Know in 2022.

Python is efficient, fast, and reliable which allows developers to create applications, perform analysis, and produce visualized outputs with minimum effort. All that you need to become a Data Scientist!

Setting Up Python

If you’re looking to become a Data Scientist, we’re going to go through a step-by-step guide to help you get started with Python:

Install Python

First, you will need to download the latest version of Python. You can find out the latest version by heading over to the official website here.

Based on your operating system, follow the installation instructions through to the end.

Choose your IDE or Code Editor

IDE is an integrated development environment, it is a software application that programmers use to develop software code more efficiently. A code editor has the same purpose, but it is a text editor program.

If you are unsure of which one to choose, I will provide a list of popular options:

When I started my Data Science career, I worked with VSC and Jupyter Notebook, which I found very useful in my data science learning and interactive coding. Once you choose one that fits your needs, install it and go through the walk-throughs on how to use them.

Learn The Basics

Before you dive into the deep end of comprehensive projects, you need to first learn the basics. So let’s dive into them.

Variables and Data Types

Variables is the terminology used for containers that store data values. Data values have various data types, such as integers, floating-point numbers, strings, lists, tuples, dictionaries, and more. Learning these is very important and builds your foundational knowledge.

In the following example, the variable is a name and it contains the value “John”. The data type is a string: name = "John" .

Operators and Expressions

Operators are symbols that allow computation tasks such as addition, subtraction, multiplication, division, exponentiation etc. An expression in Python is a combination of operators and operands.

For example x = x + 1 0x = x + 10 x = x+ 10

Control Structures

Control structures make your programming life easier by specifying the flow of execution in your code. In Python, there are several types of control structures that you need to learn such as conditional statements, loops, and exception handling.

For example:

if x > 0: 
    print("Positive") 
else: 
    print("Non-positive")

Functions

A function is a block of code, and this block of code can only be run when it is called. You can create a function using the def keyword.

For example

def greet(name): 
    return f"Hello, {name}!"

Modules and Libraries

A module in Python is a file containing Python definitions and statements. It can define functions, classes, and variables. A library is a collection of related modules or packages. Modules and libraries can be used by importing them by using the import statement.

For example, I mentioned above that Python contains a variety of libraries and frameworks such as NumPy. You can import these different libraries by running:

import numpy as np
import pandas as pd
import math
import random

There are various libraries and modules you can import using Python.

Working with Data

Once you have a better understanding of the basics and how they work, your next step is to use these skills to work with data. You will need to learn how to:

Import and Export Data using Pandas

Pandas is a widely-used Python library in the world of data science, as it offers a flexible and intuitive way to handle data sets of all sizes. Let’s say you have a CSV file data, you can use pandas to import the dataset by:

import pandas as pd

example_data = pd.read_csv("data/example_dataset1.csv")

Data Cleaning and Manipulation

Data cleaning and manipulation are vital steps in the data preprocessing phase of a data science project, as you take raw data and comb through all of its inconsistencies, errors, and missing values to transform it into a structured format that can be used for analysis.

Elements of data cleaning include:

Handling missing values
Duplicate data
Outliers
Data transformation
Data type cleaning

Elements of data manipulation include:

Selecting and filtering data
Sorting data
Grouping data
Joining and merging data
Creating new variables
Pivoting and cross-tabulation

You will need to learn all these elements and how they are used in Python. Want to start now, you can Learn Data Cleaning and Preprocessing for Data Science with This Free eBook.

Statistical Analysis

As part of your time as a data scientist, you will need to find out how to comb through your data to identify trends, patterns and insights. You can achieve this through statistical analysis. This is the process of collecting and analyzing data in order to identify patterns and trends.

This phase is used to remove bias through numerical analysis, allowing you to further your research, develop statistical models, and more. The conclusions are used in the decision-making process to make future predictions based on past trends.

There are 6 types of statistical analysis:

Descriptive Analysis
Inferential Analysis
Predictive Analysis
Prescriptive Analysis
Exploratory Data Analysis
Causal Analysis

In this blog, I will dive a bit more into Exploratory Data Analysis.

Exploratory Data Analysis (EDA)

Once you have cleaned and manipulated data, it is ready for the next step: exploratory data analysis. This is when data scientists analyze and investigate the dataset and create a summary of the main characteristics/variables that can help them gain further insight and create data visualizations.

EDA tools include

Predictive modeling such as linear regression
Clustering techniques such as K-means clustering
Dimensionality reduction techniques such as Principal Component Analysis (PCA)
Univariate, Bivariate, and Multivariate visualizations

This phase of data science can be the most difficult aspect and requires a lot of practice. Libraries and modules can assist you, but you will need to understand the task at hand and what you want your outcome to be to figure out what EDA tool you need.

Data Visualisation

EDA is used to gain further insight and create data visualization. As a data scientist, you will be expected to create visualizations of your findings. This can be basic visualizations such as line charts, bar plots, and scatter plots, but then you can be very creative such as heatmaps, choropleth maps, and bubble charts.

There are various data visualization libraries that can you use, however these are the most popular:

Data visualizations allow for better communication, especially for stakeholders who are not highly technically inclined.

Wrapping it up

This blog is intended to guide beginners on the steps they will need to take to learn Python in their data science career. Each phase requires time and attention to master. As I could not go into extensive detail on each, I have created a short list that can guide you further:

Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.