Building AI Models for High-Frequency Streaming Data – Part Two

MathWorks, MATLAB, Streaming Analytics, Time Series

Streaming high-frequency data

What is streaming? If streaming movies or music comes to mind, you’ve got the right idea! Data is incoming continuously, but instead of simply watching, actions must be taken based on the information. Therefore, predictions must be made and reported continuously.

So, what does this mean for an AI model? Consider an example of predicting equipment failure using sensors for temperature, pressure, and current. The flow looks something like this:

The raw sensor data is passed to a messaging service for initial data management. Then additional data processing and model predictions are performed. The model is updated based on recent data, and results are sent to a dashboard (repeatedly!).

The first step is to plan out the system with the team. It is important to capture requirements and decide on parameters throughout the system before building anything. It is also helpful to build a full streaming prototype as early as possible, then come back to tune algorithms.

In our example, we used Apache Kafka for messaging, which is a distributed streaming platform with APIs for many languages to facilitate reading and writing data to the stream. One of those APIs is a MATLAB interface, which we used here. We can also specify how to manage out-of-order data, buffering, and other parameters ideal for high-frequency data.

One important parameter to consider is the time window. It controls how much data enters the system for prediction and you must decide before approaching data prep or model training. In our example, we chose one second, which is reasonable for the mathematical assumptions and model updates.

Data preparation for machine learning

Part 1 of this series focused on time alignment and synchronization of the sensor data. Now let’s think about representing the data to train a model. First, you need failure data to predict failures. Don’t worry, there’s no need to break your equipment (repeatedly) if you don’t have enough, as failure scenarios can be simulated! In our example, we apply various faults to a physical model using Simulink. We used the generated data from many simulations, along with the experimental data, to train the model.

Since only one second of data is passing through the stream, it’s important to represent the most information (and least noise). It’s common to use features from the frequency domain like the FFT and power spectrum, as in our case. We live in the time domain, so the frequency domain might sound uncomfortable. But this just means we’re analyzing the data with respect to frequency instead of time. We won’t get into it here, but you can learn more with examples on signal prep for machine and deep learning and a practical introduction to time-frequency analysis.

AI modeling approach

There are many resources for comparing various algorithms, so let’s focus on how streaming affects the choice of model. In general, models suited to time series and forecasting are used frequently and include:

Traditional time-series models (curve fitting, ARIMA, GARCH)
Machine learning models (nonlinear: trees, SVMs, Gaussian processes)
Deep learning models (multilayer perceptron, CNNs, LSTMs, TCNs)

Any of these could work in our example, but there are several key aspects to first consider for streaming. The training data set includes only one second of data at a time, so the algorithm must be capable of learning in this condition and robust to noise. Also, the model needs to be updated over time as new data enters the system, without retraining historical data. The model predictions and updates must also be fast and easily distributed, which can greatly influence the choice of algorithm. Generally, keep it simple when streaming.

In our example, we prioritized getting the streaming prototype running in production, so we needed to select and train a model quickly. We used the Classification Learner and Deep Network Designer apps in MATLAB to explore models, then exported the most accurate model. We used a classification tree ensemble for predicting faults and regression for estimating the remaining lifetime, both of which are fast and updateable in the stream.

Once the model is trained and validated, we can start integrating. The steps for data prep, model prediction, and updating the model state are performed in a function. This accepts the window of data and the model as inputs and returns the predictions and updated model as outputs. With this signature, the model can be easily cached in-memory to facilitate rapid updates while avoiding additional network latency. Here, we used an open source data structure for caching and storing state, and included with MATLAB Production Server, which made it easy to integrate and test the model caching within the streaming environment.

Putting it all together

Obviously, planning is crucial for streaming. Capturing requirements for the time window, data types, and other expectations throughout the stream is helpful and important to communicate during the development process. In addition, using standard software practices like source control, documentation, and unit testing will help facilitate development.

It is also important to ease the code handoffs with teammates. For example, as data scientists, we may be sharing our data prep and modeling with a system architect. In our example, we used MATLAB to create a library with our code and model, and the library can be called from many programming languages. This captures dependencies and creates a readme file for the integration steps. We also used the testing environment to run our code via a local host within the live streaming architecture, which is helpful for debugging.

Implementing AI models into streaming applications can be challenging. But throughout this post, we discussed considerations for training and implementing models for streaming systems. It is important to consider the requirements from the different parts of the system before approaching data prep and algorithm development. Many common models for time series are appropriate, but the need for the model to be updated over time will influence the choice of algorithm. Caching the model is also helpful to maintain low latency needed in these systems. Tools like MATLAB and Apache Kafka can help integrate the data prep and AI modeling into the streaming architecture for an easier implementation.

To learn more about streaming and deploying AI, visit the resources below see the resources below or email me at hgorr@mathworks.com.

Deploying Predictive Maintenance Algorithms to the Cloud and Edge (article): Using a packaging machine as an example, this article shows how to develop a predictive maintenance algorithm and deploy it in a production IT/OT system with MATLAB.
Deploying AI for Near Real-Time Manufacturing Decisions (conference talk): Learn more about the example discussed in this post, especially simulating data, training models, and integrating MATLAB code in the streaming environment using Apache Kafka.
Enterprise and IT Systems (overview): Find out how MATLAB code is production ready and can be securely deployed and integrated with enterprise IT systems, data sources, and operational technologies.
Use a Data Cache to Persist Data (example): Learn how to use persistence to provide a mechanism to cache data between calls to MATLAB code running on a server instance.

Building AI Models for High-Frequency Streaming Data – Part Two

Streaming high-frequency data

Data preparation for machine learning

AI modeling approach

Putting it all together

More On This Topic

Top Posts

Latest Posts

Top Posts