How To Stay Competitive In Machine Learning Business
To stay competitive in machine learning business, you have to be superior than your rivals and not the best possible – says one of the leading machine learning expert. Simple rules are defined here to make that happen. Let’s see how.
By Vin Vashishta, V-Squared
While the majority of businesses are just getting their feet wet in the machine learning space, many are already reaping the benefits. The technology is moving forward rapidly. Getting left behind is a big concern for the early adopters and a driver for fast followers.
Staying competitive in a rapidly moving, emergent technology is already a challenge. With machine learning, that’s compounded by technical complexity, a talent shortage, and a constantly changing landscape of open source products. Recognized experts in the field like Google are employing some creative thinking to stay ahead of companies like Facebook, IBM, and more recent challenger Microsoft. Through acquisitions, an active presence in open source projects, and crowdsourcing solutions to problems they’ve been unable to tackle internally, Google has managed to stay at the top of their game.
Most companies don’t need to “go Google” when it comes to machine learning. While the business case for machine learning is compelling, the majority of business needs aren’t expansive enough to justify a major player’s expenditure. So where’s the middle ground? I take my solution to that question from systems theory. For a business to be competitive in machine learning, it needs to be better than its rivals not the best possible. There are some simple rules which create the environment for that to happen. Here are a few of mine. This list is like any other open source project. Contributions are welcome in the comments section.
Create & Protect Unique Datasets
I want to start by moving the focus away from big algorithm. Big algorithm is the belief that the bigger or more complex the algorithm, the better. If that were true, why would Google and Facebook be so active in giving their algorithms away through open source contributions?
Both those companies, and a number of other players in the machine learning space, derive their advantage from their access to unique datasets. The reason behind that is pretty straight forward. The learning behind machine learning happens using datasets. That means given the same algorithm, three businesses with three different datasets (different in quality, size, data points, etc.) will come up with three different models/predictions/outputs of some kind. The more unique the dataset, the greater the difference and potential advantage.
To be unique, the dataset can’t be reverse engineered to an openly available dataset like Wikidump or BLS reports. What does that mean? If a machine learning team takes all the text in Wikipedia and creates a database of the most common words found in the same sentences together, that’s not a unique dataset and neither is any dataset derived from it. That can be reverse engineered to an openly available dataset. Say a business has a customer support call center. Is the data from that unique? In some senses, yes but remember that many businesses also have similar data from their call centers. While the data is specific to one business, its general trends will be repeated in the data available to other businesses.
So what is unique? Think about Twitter. Does anyone else, without paying for it, have access to all that content from their users? Facebook and maybe one or two others. That’s it and it makes their feed pretty unique. Let’s go back to Google. Does anyone else have access to that type of search data? No. Those are unique datasets. How do you get access to unique datasets without “going Google?” This is where business strategy planning comes into play.
The reason shifting from big algorithm to unique data is so important is it adds the question, “How do we create unique datasets?” to the strategy planning process. Most companies see hiring data scientists and machine learning experts as the only necessary step in adopting the technology. That’s what makes big algorithm thinking such a problem.
Most of the generic advice out there tells business to hire data scientists and give them business problems. That’s a start but it leaves out the fuel they need to solve those problems better than the nearest competitor. That fuel is data. Getting that data is a companywide initiative needing sponsors across departments and leadership tiers.
The first place to look is at existing data. In most mid-sized to large companies, even some smaller businesses that have been around for a while, data is everywhere and no one person really has a handle on what the business has in its possession. At one of my clients, a unique dataset was sitting in a spreadsheet that one employee had been responsible for maintaining over the last five years. Others have found goldmines in data collected from their website, marketing surveys, and mobile apps. Unique data is there for the finding once a company’s mindset has been changed to understand it’s value.
Data science and machine learning teams also need a process for sourcing unique data. That frequently crosses departments in the business and it needs a process around it to be successful. I can’t tell you how many times during projects I’ve said, “This would be so much better if I just had the data…” I started building the processes around getting that data after discovering how hard it is to send up the call to other teams throughout the business.
This is one of the many reasons that machine learning cannot be a siloed team. They need to be strong communicators and evangelists. That’s because we often need marketing to gather new data and web/app development teams to build new functionality for a project requested by the strategy team. If those teams don’t have an understanding of the need and a process to follow, requests for data can go into a black hole.
Once these unique datasets have been built and put into production, they need the same protection as any other trade secret. That has two parts to it: protection from theft and protection from inadvertent exposure. Put the dataset in one place and control access to it. Don’t allow multiple teams to keep duplicates in any number of systems. That makes it hard to know which copy is the “master” and turns securing it into a nightmare proposition. Advertise the fact that this dataset is proprietary and key to the company’s competitive advantage to all those with access. I’ve seen cases where a team hands over proprietary datasets to a customer because they had no idea it was of any value. It should be spelled out in confidentiality agreements and new hire training.
I want to move away from another common machine learning perception. The all in one machine learning engineer. That’s the belief that one person needs all the skills to handle a machine learning project from start to finish. With the current talent shortage, companies are struggling to make this paradigm work in a cost-effective way.
It’s impractical to source an entire team of unicorns for a few reasons. There are less than 20K practicing, experienced data scientists globally. Now segment that group by people with significant, 5 years or more, machine learning experience and you’re down to a very small group. It’s neither cost-effective nor a good use of time to source a team of rock stars.
The alternative is to build and train the team from within. I’ve seen a couple of key benefits from this approach. The first is domain and cultural expertise. Internal candidates know the business and are experienced in navigating its culture. They’re insiders with the support of colleagues rather than outsiders pushing a new agenda. They know the core business and have firsthand experience with business challenges.
The learning curve can be steep, not to mention expensive, if the talent isn’t sourced correctly. The internal search needs to be structured around a few roles rather than looking to train candidates to fill all roles. The first role is a data science software engineer. This person takes the models and datasets from prototype to production. Their primary skill is software development, typically across multiple programming languages like Python, Java, R, C, C++, Scala, etc. They’re also familiar with mathematical concepts and notations as well as algorithmic programming and libraries. They don’t need to be statistical masters or linear algebra experts but they do need to understand the basics to be a translation layer for the team.
The second role feeds into the data science software engineer. This is the data science analyst. Their core competency is the math and stats concepts that turn data into actionable insights. They are experts in data science with decent programming skills; enough to go from concept to rough prototype in at least one language. They listen for key terms in a business problem that indicate a particular algorithmic approach; optimization problems versus classification problems for example. They can translate that into experiments and then into a proof of concept solution.
The third role ties in closely with the data science analyst, also feeding solutions to the data science software engineer. The machine learning analyst has a core competency around machine learning and deep learning algorithms. This is often grouped in with data science but they are two different areas of study which are most feasibly found in two different individuals. This person focuses much more time on the research side and less on the prototyping side. While a data science analyst will produce a solution or two a month, the machine learning analyst produces a solution every two or three months. Their focus is on the toughest business challenges; those that aren’t handled well by established data science approaches.
Depending on the amount of data stored and processed as well as the number of solutions implemented, a data engineer might also be needed to maintain the infrastructure. This person has a strong background in platforms like Spark, Hadoop, AWS, etc. The team may also need an expert in data governance. This person is an expert in data security, data sourcing, data privacy, and the numerous regulations surrounding data management.
How do these people get trained to successfully transition from their current roles into the new ones? Certifications and online coursework are the best approach. Khan Academy has several offerings to get developers with math backgrounds more familiar with advanced math concepts. There are coding boot camps and online trainings to build software development proficiency. There are a good number of data science and machine learning curricula offered by colleges and universities. These can take a candidate with a few pieces of the puzzle and get them ready to succeed.
I also strongly advise bringing on an expert in the field to provide experience based guidance to the team. This doesn’t have to be a full time or permanent hire. In some cases, 10 hours a week over a long term or a 6-month contract is all the business needs to be ready to go on their own. As the team begins to learn from each other, having an expert there to direct that learning and avoid the pitfalls is worth every penny.
Manage the Research Side of the Process
There are two parts to this. First is realizing how large the role of research is in machine learning initiatives. The second is understanding the research process and how to manage it so it delivers results on budget.
There are a few drivers to the research stage. What data can we get to solve the business problem? What approach will work best given the business problem and available data? (There are other considerations…I’m simplifying this a bit.) What open source tools are out there and how much modification will be necessary to make them work for the need?
All of this will take about a month to complete for a typical business problem. Why? Machine learning is changing daily. Algorithms, open source projects, open data, platforms, and a few other pieces are constantly being introduced and improved upon. What worked yesterday may not be the best solution today. Without the research phase, a business could get stuck with a dated solution. Those are more expensive than they need to be and become a burden on forward progress.
The research process starts with a clear understanding of the business need(s) and all considerations. The closer the team is to the problem, the more effective the solution. Spend time on the constraints. How does the solution have to work? Who will use the solution and what are their requirements? Define acceptable accuracy. Details matter for machine learning projects. Without this first step, machine learning projects inevitably go off the rails.
From a solid understanding of the need comes quality research into a solution. Here are where the drivers get answered: data sourcing, data quality, approach, tools/framework. Typically, I’ll research 3 or 4 solutions because the prototyping process is iterative. Build what shows the most promise but have multiple backups if the first approach isn’t proven out. Validate every hypothesis even when the results are promising. Assumptions kill machine learning projects. Don’t wait for perfect. Deploy when the prototype meets the business need. Improvements can be made in maintenance mode. There’s no reasons to delay returning value to the business.
Manage the process. Keeping research on track means keeping it close to the business need and keeping it transparent. I recommend weekly meetings between the team, project stakeholders, and users. Talk through the research and later in the process, demo solutions. Encourage questions from everyone. Let the research be subjected to rigorous vetting. Make the team support every assumption. Look for biases in both the data and team’s approach.
3 Basic Steps
The reason these three steps will keep a business competitive is because so few businesses practice them. Most companies working on machine learning projects think algorithmically, chase unicorns, and let the research side of the process get out of control. This creates an environment where delivering quality solutions is very difficult. A company doing one of these three things well sets themselves apart. Doing all three well creates the conditions to stay competitive no matter how quickly the machine learning landscape evolves.
Original post. Reposted with permission.
Bio: Vin Vashishta, has built the most trusted brand in data science and machine learning around the concepts of simplicity and profitability. He is followed by Walmart, Accenture, Microsoft, IBM, and may other industry leaders. He has been recognized by Agilience, Dataconomy, and Onalytica as a thought leader in predictive analytics and machine learning.
- Generative Adversarial Networks – Hot Topic in Machine Learning
- Uber-fication! Uberize Your Business
- 3 methods to deal with outliers