10 Python Skills for Beginners

Python is the fastest growing, most-beloved programming language. Get started with these Data Science tips.



By Nicole Janeway Bills, Data Scientist at Atlas Research

Figure

Photo by Shelby Miller on Unsplash

 

With Python’s straightforward, human-readable syntax, anyone can access impressive capabilities for scientific computing. Python has become the standard language for data science and machine learning, and it was rated in the top three most loved languages in Stack Overflow’s 2020 Developer Survey.

If you’re a newcomer to this much loved programming language, here are ten tips to promote the flourishing of your Python skillset. You can follow along in this Google Colab notebook (plus, a quick video introduction to Google Colab).

 

#10 — List comprehensions

 
A simple, single-line syntax for working with lists, a list comprehension allows you to access and perform an action on individual elements of a list.

The syntax consists of brackets containing an expression like print(plant), followed by a for and/or if clause.

will print:

boat orchid
dancing-lady orchid
nun's hood orchid
chinese ground orchid
vanilla orchid
tiger orchid


(Note: the semicolon at the end of the list comprehension will suppress printing the output of the last line of the Jupyter Notebook cell. This way, Jupyter Notebook won’t also print a list of None.)

 

#9 — Single-line if statement

 
Along with the previous tip, the single-line if can help you make your code more concise.

Let’s say we’ve decided we’re interested in identifying whether or not a plant is an orchid. With a single line-if, we start with the value we want to output if the test condition is true.

This code combines single-line if with list comprehension to output a 1 where the plant is an orchid and otherwise outputting a 0.

[1 if 'orchid' in plant else 0 for plant in greenhouse]


will output:

[1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0]


This list is not so interesting on its own, but when combined with the next tip, we’ll see a practical use of the single-line if.

 

#8 — Apply a lambda to a DataFrame column

 
A pandas DataFrame is a structure that can hold tabular data, like Excel for Python. lambda is a keyword that provides a shortcut to performing operations on values in the table.

Let’s say we have a table of information about our greenhouse plants:

Printing this dataframe will show the following:

Image for post

Let’s say we wanted to know whether a plant prefers a certain German classical composer.

data[‘music’].apply(lambda x: 1 if x == ‘bach’ else 0)


will output:

Image for post

where the first column is the DataFrame index and the second column is a series representing the output of the single-line if.

lambda represents an “anonymous function.” It allows us to perform operations on the values in a DataFrame without creating a formal function — i.e., a function with def and return statements, which we’ll see in a moment.

 

#7— Apply a condition to multiple columns

 
Let’s say we want to identify which Bach-loving plants also need full sun, so we can arrange them together in the greenhouse.

First, we create a function by using the def keyword and giving it a name with underscores between words (e.g. sunny_shelf). Appropriately, this naming convention is called snake case ????

The function sunny_shelf takes in two parameters as its inputs — the column to check for “full sun” and the column to check for “bach.” The function outputs whether both these conditions are true.

On line 4, we .apply() this function to the DataFrame and specify which columns should be passed in as parameters. axis=1 tells pandas that it should evaluate the function across columns (versus axis=0, which evaluates across rows). We assign the output of the .apply() function to a new DataFrame column called ‘new_shelf.’

Alternatively, we could use the np.where() function for the same purpose:

This function from the numpy library checks the two conditions specified above (i.e., that the plant is a lover of full sun and Germanic classical music) and assigns the output to ‘new_shelf’ column.

For these tips on .apply()np.where(), and other incredibly useful code snippets, check out Chris Albon’s blog.

 

#6— Break up a long line of code

 
By the way, you can break up any statement inside parentheses, brackets, or braces across multiple lines to avoid a single line from running too long. We’ve seen this when we initialized our greenhouse list, created our dataframe of plants, and used the np.where() function.

According to PEP8, the Python style guide:

The preferred way of wrapping long lines is by using Python’s implied line continuation inside parentheses, brackets, and braces.

 

#5 — Read in a .csv and set index

 
Now let’s expand our greenhouse, so we have more real world data to work with. We’ll do this by importing a .csv containing plant data. Follow along by accessing the dataset here.

Let’s say the table contains a unique plant identifier that we want to use as an index in our DataFrame. We can set this up with the index_col parameter.

data = pd.read_csv('greenhouse.csv', index_col='plant_id')


Image for post

For the fundamentals of exploratory data analysis (EDA) and 9 other useful Python tips, check out this article:

10 Underrated Python Skills
Up your Data Science game with these tips for improving your Python coding for better EDA, target analysis, feature…
 

 

#4— Format as money

 
How much are we spending on these plants anyway? Let’s format the output of this calculation as money.

‘${:,.2f}’.format(data[‘price’].sum())


will output:

'$15,883.66'


The comma separator makes it easy to see how much cash we’ve dropped so far.

 

#3 — Create a pivot table

 
Next, let’s say we want to see the amount spent per plant species. We can pivot with aggregation using either pd.pivot_table() or .groupby().

pd.pivot_table(data, index=’plant’, values=’price’, aggfunc=np.sum)


or

data[[‘plant’,’price’]].groupby(by=’plant’).sum()


Either method will output the following:

Image for post

We can also specify a multilevel pivot table using either method.

Checking piv.equals(piv0) yields True.

Here’s what the resulting DataFrame looks like:

Image for post

 

#2— Calculate percent of total

 
Curious about how each plant species contributed toward the total cost of the greenhouse? Divide each value by the sum across all the rows, and assign that output to a new column called ‘perc’:

piv['perc'] = piv['price'].div(piv['price'].sum(axis=0))


Image for post

 

#1 — Sort by multiple columns

 
Finally, let’s sort our DataFrame so that orchids are at the top and the plants are organized by price descending.

piv.sort_values([‘orchid’,’price’], ascending=False)


Image for post

 

Summary

 
In this article, we covered 10 Python skills that might be useful for beginner Data Scientists. These tips include:

Access the Colab notebook here, and access the greenhouse dataset here.

I hope this write up helps you grow your skillset as a new Data Scientist. Thanks for letting me share two of my favorite things — Python and gardening — in one article.

If you enjoyed this story, check out 10 Undervalued Python Skills and 10 Python Skill They Don’t Teach in Bootcamp. Follow me on MediumLinkedInYouTube, and Twitter for more ideas to advance your data science skills.

 

More Great Resources for Data Scientists

 
The Best Data Science Certification You’ve Never Heard Of
A practical guide to the most valuable training in data strategy.
 

5 Must-Read Data Science Papers (and How to Use Them)
Foundational ideas to keep you on top of the data science game.
 

What’s the Difference Between a Data Analyst, Data Scientist, and a Machine Learning Engineer?
Explore the distinction between these common job titles with the analogy of a track meet.
 

How to Future-Proof Your Data Science Project
5 critical elements of ML model selection & deployment
 

Is Your Machine Learning Model Likely to Fail?
5 missteps to avoid in your planning process
 

Bio: Nicole Janeway Bills is a machine learning engineer with experience in commercial and federal consulting. Proficient in Python, SQL, and Tableau, Nicole has business experience in natural language processing (NLP), cloud computing, statistical testing, pricing analysis, and ETL processes, and aims to use this background to connect data with business outcomes and continue to develop technical skillsets.

Original. Reposted with permission.

Related: