Jupyter Notebook for Beginners: A Tutorial
The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects. Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.
By Benjamin Pryke, Dataquest
The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects. A notebook integrates code and its output into a single document that combines visualisations, narrative text, mathematical equations, and other rich media. The intuitive workflow promotes iterative and rapid development, making notebooks an increasingly popular choice at the heart of contemporary data science, analysis, and increasingly science at large. Best of all, as part of the open source Project Jupyter, they are completely free.
The Jupyter project is the successor to the earlier IPython Notebook, which was first published as a prototype in 2010. Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.
To get the most out of this tutorial you should be familiar with programming, specifically Python and pandas specifically. That said, if you have experience with another language, the Python in this article shouldn't be too cryptic and pandas should be interpretable. Jupyter Notebooks can also act as a flexible platform for getting to grips with pandas and even Python, as it will become apparent in this article.
- Cover the basics of installing Jupyter and creating your first notebook
- Delve deeper and learn all the important terminology
- Explore how easily notebooks can be shared and published online. Indeed, this article is a Jupyter Notebook! Everything here was written in the Jupyter Notebook environment and you are viewing it in a read-only form.
Example data analysis in a Jupyter Notebook
We will walk through a sample analysis, to answer a real-life question, so you can see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them.
So, let's say you're a data analyst and you've been tasked with finding out how the profits of the largest companies in the US changed historically. You find a data set of Fortune 500 companies spanning over 50 years since the list's first publication in 1955, put together from Fortune's public archive. We've gone ahead and created a CSV of the data you can use here.
As we shall demonstrate, Jupyter Notebooks are perfectly suited for this investigation. First, let's go ahead and install Jupyter.
The easiest way for a beginner to get started with Jupyter Notebooks is by installing Anaconda. Anaconda is the most widely used Python distribution for data science and comes pre-loaded with all the most popular libraries and tools. As well as Jupyter, some of the biggest Python libraries wrapped up in Anaconda include NumPy, pandas and Matplotlib, though the full 1000+ list is exhaustive. This lets you hit the ground running in your own fully stocked data science workshop without the hassle of managing countless installations or worrying about dependencies and OS-specific (read: Windows-specific) installation issues.
To get Anaconda, simply:
- Download the latest version of Anaconda for Python 3 (ignore Python 2.7).
- Install Anaconda by following the instructions on the download page and/or in the executable.
If you are a more advanced user with Python already installed and prefer to manage your packages manually, you can just use pip:
Creating Your First Notebook
In this section, we're going to see how to run and save notebooks, familiarise ourselves with their structure, and understand the interface. We'll become intimate with some core terminology that will steer you towards a practical understanding of how to use Jupyter Notebooks by yourself and set us up for the next section, which steps through an example data analysis and brings everything we learn here to life.
On Windows, you can run Jupyter via the shortcut Anaconda adds to your start menu, which will open a new tab in your default web browser that should look something like the following screenshot.
This isn't a notebook just yet, but don't panic! There's not much to it. This is the Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it as the launchpad for exploring, editing and creating your notebooks.
Be aware that the dashboard will give you access only to the files and sub-folders contained within Jupyter's start-up directory; however, the start-up directory can be changed. It is also possible to start the dashboard on any system via the command prompt (or terminal on Unix systems) by entering the command
jupyter notebook; in this case, the current working directory will be the start-up directory.
The astute reader may have noticed that the URL for the dashboard is something like
http://localhost:8888/tree. Localhost is not a website, but indicates that the content is being served from your local machine: your own computer. Jupyter's Notebooks and dashboard are web apps, and Jupyter starts up a local Python server to serve these apps to your web browser, making it essentially platform independent and opening the door to easier sharing on the web.
The dashboard's interface is mostly self-explanatory — though we will come back to it briefly later. So what are we waiting for? Browse to the folder in which you would like to create your first notebook, click the "New" drop-down button in the top-right and select "Python 3" (or the version of your choice).
Hey presto, here we are! Your first Jupyter Notebook will open in new tab — each notebook uses its own tab because you can open multiple notebooks simultaneously. If you switch back to the dashboard, you will see the new file
Untitled.ipynb and you should see some green text that tells you your notebook is running.
What is an ipynb File?
It will be useful to understand what this file really is. Each
.ipynb file is a text file that describes the contents of your notebook in a format called JSON. Each cell and its contents, including image attachments that have been converted into strings of text, is listed therein along with some metadata. You can edit this yourself — if you know what you are doing! — by selecting "Edit > Edit Notebook Metadata" from the menu bar in the notebook.
You can also view the contents of your notebook files by selecting "Edit" from the controls on the dashboard, but the keyword here is "can"; there's no reason other than curiosity to do so unless you really know what you are doing.
The notebook interface
Now that you have an open notebook in front of you, its interface will hopefully not look entirely alien; after all, Jupyter is essentially just an advanced word processor. Why not take a look around? Check out the menus to get a feel for it, especially take a few moments to scroll down the list of commands in the command palette, which is the small button with the keyboard icon (or
Ctrl + Shift + P).
There are two fairly prominent terms that you should notice, which are probably new to you: cells and kernels are key both to understanding Jupyter and to what makes it more than just a word processor. Fortunately, these concepts are not difficult to understand.
- A kernel is a "computational engine" that executes the code contained in a notebook document.
- A cell is a container for text to be displayed in the notebook or code to be executed by the notebook's kernel.
We'll return to kernels a little later, but first let's come to grips with cells. Cells form the body of a notebook. In the screenshot of a new notebook in the section above, that box with the green outline is an empty cell. There are two main cell types that we will cover:
- A code cell contains code to be executed in the kernel and displays its output below.
- A Markdown cell contains text formatted using Markdown and displays its output in-place when it is run.
The first cell in a new notebook is always a code cell. Let's test it out with a classic hello world example. Type
print('Hello World!') into the cell and click the run button in the toolbar above or press
Ctrl + Enter. The result should look like this:
When you ran the cell, its output will have been displayed below and the label to its left will have changed from
In [ ] to
In . The output of a code cell also forms part of the document, which is why you can see it in this article. You can always tell the difference between code and Markdown cells because code cells have that label on the left and Markdown cells do not. The "In" part of the label is simply short for "Input," while the label number indicates when the cell was executed on the kernel — in this case the cell was executed first. Run the cell again and the label will change to
In  because now the cell was the second to be run on the kernel. It will become clearer why this is so useful later on when we take a closer look at kernels.
From the menu bar, click Insert and select Insert Cell Below to create a new code cell underneath your first and try out the following code to see what happens. Do you notice anything different?
This cell doesn't produce any output, but it does take three seconds to execute. Notice how Jupyter signifies that the cell is currently running by changing its label to
In general, the output of a cell comes from any text data specifically printed during the cells execution, as well as the value of the last line in the cell, be it a lone variable, a function call, or something else. For example:
You'll find yourself using this almost constantly in your own projects, and we'll see more of it later on.
One final thing you may have observed when running your cells is that their border turned blue, whereas it was green while you were editing. There is always one "active" cell highlighted with a border whose colour denotes its current mode, where green means "edit mode" and blue is "command mode."
So far we have seen how to run a cell with
Ctrl + Enter, but there are plenty more. Keyboard shortcuts are a very popular aspect of the Jupyter environment because they facilitate a speedy cell-based workflow. Many of these are actions you can carry out on the active cell when it's in command mode.
Below, you'll find a list of some of Jupyter's keyboard shortcuts. You're not expected to pick them up immediately, but the list should give you a good idea of what's possible.
- Toggle between edit and command mode with
- Once in command mode:
- Scroll up and down your cells with your
Bto insert a new cell above or below the active cell.
Mwill transform the active cell to a Markdown cell.
Ywill set the active cell to a code cell.
D + D(
Dtwice) will delete the active cell.
Zwill undo cell deletion.
Downto select multiple cells at once.
- With multple cells selected,
Shift + Mwill merge your selection.
- With multple cells selected,
- Scroll up and down your cells with your
Ctrl + Shift + -, in edit mode, will split the active cell at the cursor.
- You can also click and
Shift + Clickin the margin to the left of your cells to select them.
Go ahead and try these out in your own notebook. Once you've had a play, create a new Markdown cell and we'll learn how to format the text in our notebooks.
Markdown is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-one correspondance with HTML tags, so some prior knowledge here would be helpful but is definitely not a prerequisite. Remember that this article was written in a Jupyter notebook, so all of the narrative text and images you have seen so far was achieved in Markdown. Let's cover the basics with a quick example.
When attaching images, you have three options:
- Use a URL to an image on the web.
- Use a local URL to an image that you will be keeping alongside your notebook, such as in the same git repo.
- Add an attachment via "Edit > Insert Image"; this will convert the image into a string and store it inside your notebook
- Note that this will make your
.ipynbfile much larger!
There is plenty more detail to Markdown, especially around hyperlinking, and it's also possible to simply include plain HTML. Once you find yourself pushing the limits of the basics above, you can refer to the official guide from the creator, John Gruber, on his website.