Using Time Series Encodings to Discover Baseball History’s Most Interesting Seasons
Take me out to the ballgame! Take me out to the crowd! For the 2,829 seasons that have been played for 101 baseball teams since 1880, which seasons were unlike any others? Using SAX Encoding to recognize patterns in time series data, the most special years in baseball can be found.
By Adam Faskiwitz, TIBCO.
Baseball is one of the oldest sports in the United States, with a history dating back to the 19th century. Since 1880, there have been 101 different teams who have played a grand total of 2,829 different seasons. By looking at the data, I wanted to statistically uncover which of these 2,829 seasons were anomalies, which teams had seasons unlike any other. To accomplish this, I utilized a method called SAX (Symbolic Aggregate Approximation) encoding. The advantage of using SAX is that it is able to act as a dimensionality reduction tool, it is tolerant of time series of different lengths, and it makes trends easier to find.
SAX encoding is a method used to simplify time series through the summarization of time intervals. By averaging, binning, and symbolically representing periods, the data becomes much smaller and easier to deal with, while still capturing its important aspects. I came across the method when looking at sensor data from the manufacturing industry. Wanting to find anomalous patterns, I discovered that looking at the unique SAX representations lead me to find sensors with higher failure rates. Applying the same idea to baseball seasons, I used unique SAX representations to find anomalous seasons.
The data used for this analysis is pulled from the pybaseball library in Python. Attached is a link to its Github: https://github.com/jldbc/pybaseball
For my analysis, I took every MLB season since 1880 and viewed each as a time series that is represented, at any given point, by the cumulative number of wins minus the number of losses. In baseball lingo, each season is a graph of the number of games above or below 0.500 (even number of wins and losses). Below is the time series for the 2000 Anaheim Angels:
Representing every season like this, I can store all 2,892 teams in one pandas dataframe:
The NaN values here are due to the differing lengths of MLB seasons over time. In the 19th century, seasons consisted of around 82 games, while since the 1960s most seasons have been twice as long, with around 162 games. SAX encoding is thankfully able to deal with the problem of differing lengths of seasons without issue.
Now that there is a numerical representation of each season, I will normalize the data to keep each season on the same scale. After transforming the data, each data point is represented by the number of standard deviations above or below the mean, as compared to the other 2,981 seasons. The resulting dataframe:
Piecewise Aggregate Approximation
The first step of SAX encoding is performing PAA (Piecewise Aggregate Approximation) on the time series. This method splits the time series into n subsections and then uses the average of each subsection as its new value. Think of PAA as a way to summarize sections of the data. Depending on the number of splits, the resulting dataframe holds a metric of how well a team has been doing in each subsection of the season. In my case, I chose to make n = 5, meaning that each column represents a fifth of the season.
Taking a look at column 0 above, the numbers represent the average performance of the different seasons during the first fifth of the season. As a result, we have metrics that can give a general trend of how the team performed across the whole season.
After PAA, the beauty of the method comes as you convert the results into a single, symbolic representation. Now that we have metrics for each ‘split’ of the season, we can bin these measurements into different categories. Here, it might be valuable to set up the bins so that teams with varying performances are separated. In this case, I ended up choosing four bins, represented by an ‘alphabet’ of A, B, C, & D, which in turn essentially suggests whether a team was horrible, bad, good, or great for a given slice of the season. Looking back at our PAA chart on the left, we can translate those values into the chart on the right and, after aggregating the slices, every season can now be represented simply by a “SAX string.”
The value of the SAX strings becomes clear now that we can easily count the frequency of seasons associated with each SAX string. For example, the most frequent SAX string is ‘CCCCC,’ which 628 seasons are represented by. What is particularly interesting is looking at the seasons where the SAX string is unique. There are only a few of them, and they are easy to find. The unique encoding of the time series tells you that the trend of that particular season was unlike any other season in baseball history; according to SAX encoding, these seasons are numerically ‘interesting.’ A baseball fan myself, the results of the method were a bit shocking to me and helped me discover really unique teams that I had never known about. Take a look at the visual below:
Highlighted above are some of the unique values from the SAX encoding method. Using my settings, there ended up being 20 total seasons that were unique. From these 20, we discover some remarkable seasons. The 2001 & 2002 Athletics seasons had a movie made about them; the 1972 Philadelphia Phillies are truly one of the most puzzling teams ever, and the 1914 Boston Braves should have a movie made about them! The list goes on…
Here is a graph of what the top common SAX String looks like compared to some of the unique seasons:
SAX String: CCCCC
When trying to find the most interesting seasons in history, someone could look one by one at every team’s time series and take note of which might look different. Instead, SAX reduces the dimensionality of thousands of time series and quickly produces results which point to just 20 teams.
While not quite as well known as other techniques, SAX is a great tool for pattern recognition on time series data and was the perfect, simple choice for approaching this question. What’s more, the same technique can be used for finding anomalous time series in manufacturing and possibly many other fields, including the medical, automotive, and telecommunications industries.
The SAX Encoding algorithm is available as a plug-in for the TIBCO Data Science platform.
Original. Reposted with permission.
- OnlyBoth MLB Baseball Knowledge Discovery Application
- Time Series for Dummies – The 3 Step Process
- Introduction to Anomaly Detection