Building AI Models for High-Frequency Streaming Data
This post is the first in a two-part series on AI for streaming data. Here, we’ll walk through strategies for aligning times and resampling the data.
Heather Gorr, Ph.D., Senior MATLAB Product Manager, MathWorks
We hear about AI everywhere. Machine learning models are now incorporated into several applications, such as medical devices and automated vehicles. These systems include many sensors, streaming data from hardware. The model is applied to the data in the stream and predictions are sent to a dashboard, database, or another device (repeatedly!).
Data prep and model development challenges are exacerbated with such high-frequency, time-series data. Consider monitoring equipment with sensors for temperature, pressure, current, etc. Each sensor has a slightly different sampling rate, or time step, and they must be synchronized into one data set (with the same times) for multivariable analysis. It can be difficult to know where to start, but techniques exist, based on the application and the data.
This post is the first in a two-part series on AI for streaming data. Here, we’ll walk through strategies for aligning times and resampling the data. Part 2 will focus on choosing machine learning models for a streaming architecture. First, let’s “sync up” on the approach.
Synchronizing Data
What does it mean to “synchronize data”? Think of “synchronizing your wristwatches”: the times are aligned and the data are merged into one data set, illustrated below. This approach sounds easy enough, but it requires planning (and experimentation!), considering the data, sampling rates, and requirements throughout the rest of the system.
To fellow data scientists, this looks like a “join” operation, using the time as the key variable. But since the sensors are often sampled at different, high-frequency rates, a typical join may result in too much missing data or an inconsistent time step.
Step 1: Aligning Times
It makes sense to think about the desired times first in the application. When integrating into a streaming application, the rest of the system must also be considered. Maybe there’s a specific time step or sample rate in mind for the problem (e.g., hourly or 10 sec). The original timestamps from one of the data sets could also be used to align the remaining data.
For example, for the equipment sensors, the sample rate is 1000 Hz (number of samples per second), or 0.001 seconds between data points. In the streaming system, 1 second of data will be processed at a time. We create a time vector from 0 to 1 seconds, with a time step of 0.001 seconds, then “resample” the data to the new times.
Step 2: Resampling Data
The art of data synchronization is in deciding how to fill in the data points where the times don’t match. The data are resampled from the original to the new times. Several common methods for doing this are listed in the table below. The choice of which to use will depend on the initial alignment of the time vectors and the application requirements.
Type | Method |
Fill | Missing |
Constant | |
Nearest neighbor | Nearest |
Previous or next | |
Aggregation | Summary statistics |
Sum | |
Product | |
Count | |
First or last | |
Interpolation | Linear |
Polynomial | |
Spline | |
Shape-preserving cubic | |
Modified Akima cubic |
Figure 2: Common methods for resampling data. © 1984–2020 The MathWorks, Inc.
So, where do we start? When unsure of the time alignment between the data sets, it is common to fill with missing data (like an outer join) or a constant value. This can be helpful as a first step, particularly when working with many sensors. Exploring the resulting data and visualizing will help determine how to proceed based on the time steps and amount of missing data.
If the times are closely aligned, any of the methods listed in the table above could be used, ensuring it makes sense for the application. When the times are not as closely aligned, it’s more common to aggregate or interpolate the data. Think of hourly data turned to daily data: how would we represent all data over 24 hours in one data point? In this example, an aggregation (daily mean) would be appropriate. For non-numeric data, it is common to use the count, mode, or nearest neighbor method.
With sensor data, interpolation is the most common approach. The times are generally off only slightly, so there aren’t as many spaces to fill and there’s knowledge of the trends. Linear interpolation is very common, as it is simple to understand. However, it can be less precise if the points are farther away, so a polynomial or spline interpolant would be more suited to these cases. To retain more of the trend, it is common to use shape-preserving piecewise cubic (“pchip”) or Akima piecewise cubic Hermite interpolants. Keep in mind, for these interpolation methods, the data must be monotonically increasing (sorted, evenly spaced with time).
This might sound challenging, but the good news is that these tasks are common enough that they’re built into APIs and modules in common data science platforms. For example, MATLAB provides a synchronize() function with many of the aforementioned options. You can also embed an app into your script to explore different time steps and resampling methods, illustrated below. This can help with fast experimentation and decision-making (trial and error, anyone?).
Once you’ve aligned the data, the sky’s the limit! However, a few more data prep considerations often must be addressed before building models with sensor data. It’s common to smooth and downsample further, then explore the frequency domain before building models. This topic will be discussed in the next post, including data prep for machine or deep learning models for this type of data.
In this blog post, we discussed strategies for high-frequency data synchronization through aligning timestamps and resampling the data. We considered the initial time-alignment and problem requirements to help decide on a suitable resampling method. Though these problems can be challenging, we also saw that tools such as MATLAB can help you experiment with different methods for aligning and resampling. For example, the Live Editor task shown above will help to explore many resampling methods quickly. Once the data are in the same data set on the same times, further analysis can be more easily performed.
In Part 2 of this series, we will focus on choosing machine and deep learning models for high-frequency data. We will then discuss integrating the data prep and modeling into a streaming architecture to complete the application.
To learn more about the topic covered in this blog, see the resources below or email me at hgorr@mathworks.com.
Resources
- Retime and Synchronize Timetable Variables Using Different Methods (example): This try-in-your-browser example shows how to fill in gaps in timetable variables using different methods for different variables.
- Synchronize Timetables (example): This example lets you interactively collect variables from all input timetables, synchronize them to a common time vector, and return the result as a single timetable.
- MATLAB for Data Science (web page): Learn how to explore data, build machine learning models, and do predictive analytics with MATLAB.
- MATLAB and Simulink for Signal Processing (web page): Find out how to analyze signals and time-series data and model, design, and simulate signal processing systems with MATLAB.