Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2020 » Dec » Tutorials, Overviews » 10 Python Skills They Don’t Teach in Bootcamp ( 20:n47 )

10 Python Skills They Don’t Teach in Bootcamp


Ascend to new heights in Data Science and Machine Learning with this thrilling list of coding tips.



Figure

Photo by Sandy Torchon on Pexels

 

Data science bootcamp is a ton of fun, but they don’t have time to teach you everything.

The bootcamp experience is like showing up at a theme park. (Except some of the strangers there will become your best friends.) When the ride kicks in, it demands total concentration. Between bouts of intensity, you’ll have the chance to take a breath — trading stories, recommendations, and ideas.

Recapture the thrill of learning new things with this collection of 10 Python skills they don’t teach you in bootcamp.

 

#10 — Set DataFrame display options

 
It’s straightforward to change the way pandas DataFrames are displayed in a Jupyter Notebook. I typically include this code in the same cell as my import statements:

With these settings, I can fully read cells that may contain a lot of text. I won’t have to worry about overly long dataframes, but I can scroll to the right and left to my heart’s content.

Play around with these options and find what works for you. For more, you can check out this aspect of the pandas documentation here.

 

#9 — Change how pandas displays numbers

 
If you want to alter how numbers are shown within DataFrames, use these handy options to round trailing decimal numbers.

pd.set_option(‘precision’, 2) # Round to two decimal points


This second option also provides the functionality of comma separators between three digits of larger numbers:

pd.options.display.float_format = ‘{:,.2f}’.format 


 

#8— Import Excel workbook and append sheet name

 
If you’re reading in a workbook with multiple sheets, you can pull them all into one dataframe using:

df = pd.concat(pd.read_excel('Ticket_Sales_Total.xlsx', sheet_name=None), ignore_index=True)


This trick works when your data uses all the same headers and there’s no additional info to be gained from the sheet name.

Alternatively, if you want to read in the sheets and retain some info from the sheet name, you can use this function below.

On line 15, pandas creates a new column (‘sheet’) containing the last word of the sheet name as its value. If the sheets in Ticket_Sales_Total.xlsx are named Ticket Sales 2017Ticket Sales 2018, and Ticket Sales 2019, then the read_excel_sheets() function will append each row with the relevant year from the sheet name.

Thanks Colin Copland for this tip!

 

#7— Check a random selection of pandas rows

 
Rather than looking at just the .head() or .tail() of a dataframe, you can see a selection of random rows with:

df.sample(n)


This is useful because in a sorted dataframe, anomalous records may fall into the head or tail, contributing to a distorted perspective while conducting EDA.

For a sample project with pandas, check out:

Named Entity Recognition for Clinical Text
Use pandas to reformat the 2011 i2b2 dataset into CoNLL format for natural language processing (NLP).

 

#6— Leverage Predictive Power Score instead of correlation

 
The Predictive Power Score was developed by Florian Wetschoreck and the team at 8080 Labs in order to improve upon correlation metrics.

Correlation is limited because it will miss non-linear relationships (for example, a quadratic relationship charting daily temperature and theme park ticket sales, a step function that represents the ticket price of an amusement against the number of people waiting in line, or the gaussian function used at the “Guess Your Weight” carnival game). Any relationships related to categorical variables will also be missed by a correlation matrix.

Moreover, correlation lacks the capability to provide information about asymmetry of a relationship. For example, knowing a customer’s favorite part of the park might not predict their favorite ride, but knowing their favorite ride would have much stronger predictive power for evaluating their favorite part of the park.

By contrast, the Predictive Power Score can detect non-linear effects, automatically encodes categorical variables, and quantifies asymmetry. It computes predictive relationships between pairs of columns and provides a score ranging from 0 to 1.

To use, simply import ppscore as pps and call pps.matrix(df).

The Best Data Science Certification You’ve Never Heard Of
A practical guide to the most valuable training in data strategy.

 

# 5— Create a package

 
Modules help to compartmentalize reusable code, such as Python functions, variables, and classes. Getting organized in this way can make code easier to understand and use.

To me, this is the biggest productivity booster for data scientists. It enables you to work faster and make less mistakes. Plus, by writing packages, you also improve your coding skills. — Adam Votava

A package will contain one or more relevant modules. We can create a package named mythemepark, using the following steps:

Step 1 — Create a new folder named MyThemePark.
Step 2 — Inside MyThemepark, create a subfolder with the name mythemepark.
Step 3— Using a Python IDE like atom, create modules greet_visitors.py (which will provide code for welcoming visitors as they enter the park), functions.py (which provides code to operate various rides and games), and classes.py (which will provide the templates from which we can instantiate new objects such as amusements, shops, promotions, etc.)

Notes:

 

#4— Check size of packages

 
After pip installing all the dependencies for the libraries required to run your theme park, it’s possible your SSD may be a bit cluttered. Checking the size of installed package will help you understand which packages are taking up the most space. From here, you can make choices about which packages “spark joy,” and proceed to KonMari appropriately.

To find the path to installed packages on your Linux machine, type:

pip3 show "some_package" | grep "Location:"


This will return path/to/all/packages. Something like: /Users/yourname/opt/anaconda3/lib/python3.7/site-packages

Insert that file path into the command below:

du  -h  path/to/all/packages


where du reports file system disk space usage.

This code will output the size of each package. The final line of output will contain the size of all packages.

10 Underrated Python Skills

Up your Data Science game with these tips for improving your Python coding for better EDA, target analysis, feature…

 

#3 — Check memory usage

 
As with optimizing your workspace, it may also be useful to examine the memory usage of code components. You can do this using Python’s sys.getsizeof method by implementing the following code:

 

#2— Advance your command line tools

 
Click is a command line tool for Python that enables you to create intuitive programs and interfaces for the bash shell. Click supports options dialogues, user prompts, requests for confirmation, values from environment variables, and more.

Here’s an example script that could be used to request a password from a ride operator:

Will output:

$ encrypt
Password: 
Repeat for confirmation:


 

#1 — Check everything for PEP8 compliance

 
The nblint package allows you to run the pycodestyle engine within Jupyter Notebook. This will check your code (i.e. linting) with the pycodestyle engine.

Linting highlights any syntactical or stylistic problems in your Python code, making it less error prone and more readable to your colleagues. Linting tools were first introduced by frustrated debuggers in 1978, and the practice does indeed get its name from the act of removing small bits of stray fabric from your clothes coming out of the dryer.

 

Bonus: Clean the conda cache

 
First, a quick note about the difference between pip and conda. pip is the Python Packaging Authority’s recommended tool for installing packages from the Python Package Index, PyPIconda is a cross-platform package and environment manager from Anaconda.

Generally speaking, it is a bad idea to mix pip and conda package managers. This is because the two managers don’t speak to each other — this can create package conflicts. Consider exclusively using pip within virtual environments unless you are ready to commit to conda.

We’ve already covered how to clean up packages you’ve pip installed — here are instructions for removing conda installed packages. If you’ve been using the conda package manager, you can free up space by removing unused packages and caches using this code:

conda clean --all


 

Summary

 
One more time, here are the ten tips we’ve covered in this article:

If you enjoyed this article, follow me on MediumLinkedInYouTube, and Twitter for more ideas to advance your data science skills.

 

Continue developing your skills

 
10 Python Skills for Beginners
Python is the fastest growing, most-beloved programming language. Get started with these Data Science tips.
 

10 Underrated Python Skills
Up your Data Science game with these tips for improving your Python coding for better EDA, target analysis, feature…
 

How to Ace the AWS Cloud Practitioner Certification with Minimal Effort
Forecast: cloudy with a 100% chance of passing on your first try.
 

How to Future-Proof Your Data Science Project
5 critical elements of ML model selection & deployment
 

Walkthrough: Mapping GIS Data in Python
Improve your understanding of geospatial information through GeoPandas DataFrames and Google Colab
 

 
Bio: Nicole Janeway Bills is a machine learning engineer with experience in commercial and federal consulting. Proficient in Python, SQL, and Tableau, Nicole has business experience in natural language processing (NLP), cloud computing, statistical testing, pricing analysis, and ETL processes, and aims to use this background to connect data with business outcomes and continue to develop technical skillsets.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy