Building AI Models for High-Frequency Streaming Data


This post is the first in a two-part series on AI for streaming data. Here, we’ll walk through strategies for aligning times and resampling the data.



Sponsored Post.

Heather Gorr, Ph.D., Senior MATLAB Product Manager, MathWorks

We hear about AI everywhere. Machine learning models are now incorporated into several applications, such as medical devices and automated vehicles. These systems include many sensors, streaming data from hardware. The model is applied to the data in the stream and predictions are sent to a dashboard, database, or another device (repeatedly!).

Data prep and model development challenges are exacerbated with such high-frequency, time-series data. Consider monitoring equipment with sensors for temperature, pressure, current, etc. Each sensor has a slightly different sampling rate, or time step, and they must be synchronized into one data set (with the same times) for multivariable analysis. It can be difficult to know where to start, but techniques exist, based on the application and the data.

This post is the first in a two-part series on AI for streaming data. Here, we’ll walk through strategies for aligning times and resampling the data. Part 2 will focus on choosing machine learning models for a streaming architecture. First, let’s “sync up” on the approach.

Synchronizing Data

What does it mean to “synchronize data”? Think of “synchronizing your wristwatches”: the times are aligned and the data are merged into one data set, illustrated below. This approach sounds easy enough, but it requires planning (and experimentation!), considering the data, sampling rates, and requirements throughout the rest of the system.

Figure

Figure 1: Animation illustrating the synchronization of multirate data into one data set. The times are aligned and the data are resampled accordingly. © 1984–2020 The MathWorks, Inc.

 

To fellow data scientists, this looks like a “join” operation, using the time as the key variable. But since the sensors are often sampled at different, high-frequency rates, a typical join may result in too much missing data or an inconsistent time step.

Step 1: Aligning Times

It makes sense to think about the desired times first in the application. When integrating into a streaming application, the rest of the system must also be considered. Maybe there’s a specific time step or sample rate in mind for the problem (e.g., hourly or 10 sec). The original timestamps from one of the data sets could also be used to align the remaining data.

For example, for the equipment sensors, the sample rate is 1000 Hz (number of samples per second), or 0.001 seconds between data points. In the streaming system, 1 second of data will be processed at a time. We create a time vector from 0 to 1 seconds, with a time step of 0.001 seconds, then “resample” the data to the new times.

Step 2: Resampling Data

The art of data synchronization is in deciding how to fill in the data points where the times don’t match. The data are resampled from the original to the new times. Several common methods for doing this are listed in the table below. The choice of which to use will depend on the initial alignment of the time vectors and the application requirements.

Type Method
Fill Missing
Constant
 
Nearest neighbor Nearest
Previous or next
 
Aggregation Summary statistics
Sum
Product
Count
First or last
 
Interpolation Linear
Polynomial
Spline
Shape-preserving cubic
Modified Akima cubic

Figure 2: Common methods for resampling data. © 1984–2020 The MathWorks, Inc.

So, where do we start? When unsure of the time alignment between the data sets, it is common to fill with missing data (like an outer join) or a constant value. This can be helpful as a first step, particularly when working with many sensors. Exploring the resulting data and visualizing will help determine how to proceed based on the time steps and amount of missing data.

If the times are closely aligned, any of the methods listed in the table above could be used, ensuring it makes sense for the application. When the times are not as closely aligned, it’s more common to aggregate or interpolate the data. Think of hourly data turned to daily data: how would we represent all data over 24 hours in one data point? In this example, an aggregation (daily mean) would be appropriate. For non-numeric data, it is common to use the count, mode, or nearest neighbor method.

Figure

Figure 3: Animation illustrating data synchronization by filling with missing values. © 1984–2020 The MathWorks, Inc.

 

With sensor data, interpolation is the most common approach. The times are generally off only slightly, so there aren’t as many spaces to fill and there’s knowledge of the trends. Linear interpolation is very common, as it is simple to understand. However, it can be less precise if the points are farther away, so a polynomial or spline interpolant would be more suited to these cases. To retain more of the trend, it is common to use shape-preserving piecewise cubic (“pchip”) or Akima piecewise cubic Hermite interpolants. Keep in mind, for these interpolation methods, the data must be monotonically increasing (sorted, evenly spaced with time).

This might sound challenging, but the good news is that these tasks are common enough that they’re built into APIs and modules in common data science platforms. For example, MATLAB provides a synchronize() function with many of the aforementioned options. You can also embed an app into your script to explore different time steps and resampling methods, illustrated below. This can help with fast experimentation and decision-making (trial and error, anyone?).

Figure

Figure 4: Synchronizing data with various methods using a Live Editor task in MATLAB. © 1984–2020 The MathWorks, Inc.

 

Once you’ve aligned the data, the sky’s the limit! However, a few more data prep considerations often must be addressed before building models with sensor data. It’s common to smooth and downsample further, then explore the frequency domain before building models. This topic will be discussed in the next post, including data prep for machine or deep learning models for this type of data.

In this blog post, we discussed strategies for high-frequency data synchronization through aligning timestamps and resampling the data. We considered the initial time-alignment and problem requirements to help decide on a suitable resampling method. Though these problems can be challenging, we also saw that tools such as MATLAB can help you experiment with different methods for aligning and resampling. For example, the Live Editor task shown above will help to explore many resampling methods quickly. Once the data are in the same data set on the same times, further analysis can be more easily performed.

In Part 2 of this series, we will focus on choosing machine and deep learning models for high-frequency data. We will then discuss integrating the data prep and modeling into a streaming architecture to complete the application.

To learn more about the topic covered in this blog, see the resources below or email me at hgorr@mathworks.com.

Resources