Onboarding Your Machine Learning Program

Machine Learning's popularity is continuing to grow and has engraved itself in pretty much every industry. This article contains lessons from a data scientist on how to unlock it's full potential.



By Adam Hunt, Chief Data Scientist, RiskIQ

Machine learning robot

These days, ‘machine learning’ is a buzzword you can’t avoid while reading about pretty much any industry.

Its ability to “outthink” humans is touted as a magical ROI booster that can drastically maximize productivity while minimizing resource expenditure. The security industry is no different. With internet-scale attack campaigns overwhelming security teams that struggle to process alerts quickly enough amidst oceans of data, machine learning was supposed to be the silver bullet for any modern cybersecurity problem. However, with great hype often comes great disappointment and we’re now experiencing the blowback from a growing number of people who believe it has not at all lived up to expectation.

The truth is, machine learning is no silver bullet. However, that doesn’t mean it’s not immensely helpful to security programs and crucial to the future of cybersecurity, people just need to reconsider the way they use it. Rather than treating it as an all-powerful robot overlord, the secret to unlocking its potential is treating it as a very junior employee.

Machine Learning isn’t the One in the Room

Machine learning models are fast, tireless, retentive and completely without any common sense. Just like any intern on their first day, you wouldn’t assume it knows how your organization works, nor, necessarily, the concepts you hope it will eventually master. When you start with machine learning program, think of it as an onboarding process. In the beginning, you need to check in on your models frequently and spend a lot of time getting them started in the right direction. At first, the models, which you hope will drive your business to new heights by processing terabytes of data at awesome speed, won’t even understand the task you’re asking of them.

Machine learning’s inability to think critically is probably the source of most of the disappointment and why humans need to have a very prominent role in the machine-learning age of cyber security. Because your models are low-level (but diligent) taskmasters that can’t see the big picture, you need to continually spoon feed them instructions. Over time, they’ll eventually see patterns based on your feedback and begin to get the hang of what you want them to look for.

As your models learn, you’ll need to check in on them less and less, but they can and should never be completely autonomous. They don’t see things the way you see them and don’t follow a thought process like our own. They can quickly stray away from the task at hand, sending your entire program into disarray.

Here’s how to make the most of your machine learning program so that it can live up to the hype:

Implement Safety Nets and Monitoring:
Once you think your model is performing well, you need a few things to make sure it doesn’t go off the rails. Before you go and build a pipeline, make sure you have the proper safety nets in place. The first of these safety nets is what we call a tripwire. If your model exceeds your expectation of the number of instances it will classify within a certain period, your tripwire will automatically disable it. This measure is critical to prevent your model from running out of control.

Going rogue is extremely common for models when they’re first released because, although you’ve provided your initial model a pristine, hand-curated data set from which to learn, the real world is really dirty, dirty in ways you could never anticipate. Just like a fresh college graduate, your model will encounter things that didn’t appear in its textbook, causing it to default to biases formed through its training data.

Cat Dog Fish

For example, if your training data only contains cats and dogs, when you provide it with a fish, it will try to classify it as either a cat or dog. Unlike a human with common sense, your model will need to be corrected, learn from its mistakes and try again. The algorithm used to train your model also has inherent biases. Just like people, every model creates its own view of the problem. At first, it makes assumptions that over simplify the solution (we’ll get into this later).

The next safety net is a whitelist. These are lists of items your models should ignore. In a perfect world, you wouldn’t need whitelists because you would invest the time engineering better features and retraining your model until it gets a specific example right. However, when you need to act now, you will be thankful you have them. While not ideal, whitelists not only prevent your current model from classifying an instance incorrectly, but it also helps your future models.

Prevent Degradation:
Your model may to work at first, but without proper feedback, its performance will degrade over time (the precision during the first week will be better than on the tenth week). How long it takes the model to degrade to an unacceptable level depends on your tolerance and its ability to generalize to the problem.

The world changes all the time and it’s important that your model changes with it. If you need your model to keep up with current trends, selecting an instance-based model or a model that can learn incrementally, is critical. Just as providing frequent feedback helps an employee learn and grow, your model needs the same kind of feedback.

Active Learning:
Active learning places an expert in the loop. When the model is unsure how to categorize a certain instance, having the ability to ask for help is critical. Models typically provide a probability or score with their prediction, which get turned into a binary decision based on some threshold you’ve provided (i.e. threat or not a threat).

But with no guidance, things get problematic and quick. Imagine a junior security researcher that doesn’t know how to assess a certain threat. They think something might be malicious, but they aren’t quite sure. They fire off an email to you requesting help, but that email doesn’t get answered for a month or more.

Left to its own devices, the employee may make an incorrect assumption. In the case where the instance was just below the cutoff, but the threat was real, the model will continue to ignore it, resulting in a potentially serious false negative. However, if it chose to act, the model will continue to flag benign instances generating a flood of false positives. Developing a feedback mechanism that provides your model with the ability to identify and surface questionable items is critical to the success of your model.

Blending and Co-training:
Everyone knows collaboration and diversity help organizations grow. When CEOs surround themselves with “yes-men” or lone wolves decide they can do better by themselves, ideas stagnate. Machine learning models are no different. Data scientists have their “go-to” algorithm to train their models. It’s important to not only try other algorithms but try other algorithms together.

Conclusion
We live in a data-driven society, in which humans really can’t go it alone. With some work, machine learning can be used to leverage your employees’ knowledge and abilities to fill a necessary gap in the talent pool. However, machine learning models are not something you can set and forget. They need frequent feedback and monitoring to provide you with the best performance. Do yourself a favor and make providing that feedback easy. The time you invest in it will pay dividends.

Bio: Adam Hunt is Chief Data Scientist at RiskIQ. He has Ph.D. in Physics (2013) from Princeton.

Related: