Data Science Dividends – A Gentle Introduction to Financial Data Analysis

This post outlines some very basic methods for performing financial data analysis using Python, Pandas, and Matplotlib, focusing mainly on stock price data. A good place for beginners to start.

Are you interested in analyzing financial -- specifically, stock -- data using Python, but have no idea where to begin? This post is a very elementary introduction to stock analysis, mainly by using Pandas and Matplotlib. While no real understanding of the stock market or financial systems is assumed, some understanding of Python would be helpful.

But be warned: financial analysis is abnormally complex. Much more so than the elementary information included in this post. Also, financial analysis is an incredibly broad topic, and means different things to different people. Kind of sounds a bit like data science, no?

I add the standard disclaimer for these types of posts that I am not a financial expert or advisor, and what is included herein does not constitute financial advice. This is more educational than anything, but may answer some of the most basic questions on the topic you may have.

I also want to point out that this post leans heavily on this great post by Curtis Miller, which, along with its follow-up, include much more editorialization than do I. Much of the approach taken in this post comes from this Miller, either directly or via inspiration, as does inspiration for much of the code. In that regard, thanks goes out to Miller for writing a great set of posts on stock data analysis. After reading this overview, you may find that looking his work over, particularly the second post, will help extend what you learn and/or fill in some of the holes.

Credit also goes to the folks are Investopedia for the definitions and explanations scattered throughout, since they are much better at explaining financial concepts than am I.

Getting Stock Data

First thing is always first, and first thing with data analysis is usually getting our hands on the data. Luckily we don't have to pay for vast amounts of run of the mill financial data these days. Even more luckily, we don't really even have to look for it. After spinning off from the Pandas project, pandas-datareader provides an easy to use API for gathering just the type of data we want, and gives quite a bit of flexibility in doing so.

A sample of just how easy it is to use the API to grab stock data from a source -- such as Yahoo! Finance -- for a particular date range and store it in a Pandas dataframe is shown below:

df = data.DataReader(stock, source, start, end)

The following is where we start with our project code, taking care of imports, setting some parameters, and reading in our first bit of data (incorporating the API call shown above). We will read in Google's (GOOG) historial stock data from its day of initial public offering (IPO) -- August 19, 2004 -- or the day it became a publicly traded company.

Pandas dataframe

So, what data does this dataframe hold? Most of it is quite straight forward, even for the uninitiated:

  • Open - opening stock price for the day
  • High - high stock price of the day, regardless of when during the day it occurred
  • Low - low stock price of the day, regardless of when during the day it occurred
  • Close - closing stock price of the day
  • Volume - volume (amount) of stock traded that day

Adjusted closing price is a bit more detailed. What, exactly, is adjusted closing price? From Investopedia:

An adjusted closing price is a stock's closing price on any given day of trading that has been amended to include any distributions and corporate actions that occurred at any time prior to the next day's open. The adjusted closing price is often used when examining historical returns or performing a detailed analysis on historical returns.

Since it takes into account distributions and adjustments, a stock's adjusted closing price is a better indicator of historical stock performance than is simple closing price.

Inspecting the Data

Now that we have it, let's investigate the data a bit further by noting some descriptive statistics:

Pandas dataframe

Note the following:

  • Absolute high price - $1228.88
  • Absolute low price - $95.96
  • Absolute high adjusted close - $813.11
  • Absolute low adjusted close - $49.96
  • Mean adjusted close - $348.81

Since we have found above that adjusted close is a good measure of historical stock performance, a quick look at a relevant histogram, gaining quick insight into how often the adjusted close was in a variety of value ranges:


I won't provide much meta-analysis of the data in the post, instead focusing on the "hows" of the analysis. The interested reader can followup elsewhere to satisfy their curiosity.

Visualizing Stock Performance

First, let's plot adjusted close prices for GOOG.

Adjusted closes GOOG

We can spot a clear trend here, as well as a few prominent peaks and valleys.

What can be more interesting than this simple graphing of daily adjusted close prices is the rolling mean, which is a constantly updated moving average based on a sliding window of defined size. Why is a rolling mean useful in stock analysis? From Investopedia:

A moving average can help cut down the amount of "noise" on a price chart. Look at the the direction of the moving average to get a basic idea of which way the price is moving. Angled up and price is moving up (or was recently) overall, angled down and price is moving down overall, moving sideways and the price is likely in a range.

It turns out implementing rolling mean in a Pandas data series is very simple, as is visualizing the result. First we take relevant data from a dataframe and load it into a data series:

goog_df is a <class 'pandas.core.frame.DataFrame'>
goog_ds is a <class 'pandas.core.series.Series'>

Then we use the Pandas rolling mean function to compute and visualize. For this experiment we will create 3 rolling means, roughly corresponding to a weekly moving average, monthly moving average, and quarterly moving average, all for 2016's adjusted closing prices.


Note that, especially with the 120 days rolling mean, extreme peaks and valleys are smoothed to show the overall performance of the stock over a longer period of time. The longer the sliding window, the more resistant it is to extreme fluctuations. However, too long a window could be disastrous when analyzing stock trends to take action on.