Industry Predictions: Main AI, Big Data, Data Science Developments in 2017 and Trends for 2018
Tags: 2018 Predictions, AtScale, Dataiku, Hype, IBM, Industry, Kaggle, Knime, MathWorks, Predictions, RapidMiner, Splice Machine, Splunk
Here is a treasure trove of analysis and predictions from 17 leading companies in AI, Big Data, Data Science, and Machine Learning: What happened in 2017 and what will 2018 bring?
"What were the main developments in AI, Big Data, Data Science, Machine Learning in 2017 and what are key trends for 2018?"
Some of the main topics that emerged are: AI is the new Big Data, More data moving to the cloud, Hybrid Cloud, Deep Learning Hype, Machine Learning in the Enterprise, AI/ML becoming industry specific, Self-Service BI, Automated Data Science/Machine Learning, Kubernetes, GDPR, Spark, and Streaming Data.
Here is analysis and predictions from 17 firms: Alation, Arcadia Data, AtScale, BlueData, Dataiku, DataStax, IBM Analytics, IBM Cloud, Infogix, Kaggle, KNIME, MathWorks, RapidMiner, Splice Machine, Splunk, StreamSets, and Unravel.
Satyen Sangani, co-founder and CEO of Alation.
- Fear of Cloud Lock-in Will Result in Cloud Sprawl: As CIOs try to diversify investment in their compute providers, inclusive of their own on-premise capabilities, the diversification will result in data, services and algorithms spreading across multiple clouds. Finding information or code within a single cloud is tough enough. The data silos built from multiple clouds will be deep and far apart, pushing the cost of management onto humans that need to understand the infrastructure.
- Microservices Will Result in Macro-confusion: With the proliferation of containers and microservices, the cost of software creation, deployment and infrastructure will further decrease. Which services exist? How are they being used? How do we know if the service is deprecated? Who/what else is using the service?
- Buyers Bias from Buying Dumbed-Down Data Interfaces to Smartening Up Their Workforce: With "simple" business intelligence (BI) and pretty dashboards having been the talk of the BI landscape, organizations are coming to grips with the fact that they still can't trust their data. At scale, with the vast variety, complexity and volume of data, traditional governance methods are failing to get trusted answers to data consumers. Consequently, organizations will shift away from simplistic dashboards toward teaching people to be more data literate, with the best interfaces helping with this challenge.
Steve Wooledge, VP marketing and Dale Kim, Senior Director, products and solutions, at Arcadia Data.
On AI: Artificial intelligence (AI) deserves the same treatment Hadoop and other big data technologies have received lately. If the industry is trying to balance the hype around big data-oriented products, it has to make sure not to overhype the arrival of AI. This is not to suggest that AI has no place in current and future-looking big data projects, just that we are not at a point in time yet where we can reliably turn business decision-making processes over entirely to machines. Instead, in 2018 the industry will begin to modernize BI with machine assistance rather than AI-driven tasks. Think of it as power steering versus self-driving cars. Business users will get more direction on how to gain better insights faster, as they don't need to be told what the right insights are. We're so enamored by the idea of AI, but the reality is it's not ready to act on its own in the context of analyzing data for business users.
On BI: We'll also start to see a shift in which organizations will bring BI to the data. BI and big data have hit a bit of a brick wall. Companies have spent a lot of money on their data infrastructures, but many are left wondering why they have to wait so long for their reports. Part of the problem is that companies are capturing their data in a data lake built on a technology like Hadoop, but they are not taking full advantage of the power of the data lake. Next year and moving forward, we'll start to see more companies bringing the processing to the data, a core tenet of Hadoop and data lakes, with respect to their BI workloads. This will speed the time to insight and improve the ROI companies see on their big data infrastructure investments.
Bruno Aziza, CMO and Josh Klahr, VP of Product, AtScale.
1. As the Business Intelligence space grew by 60% in 2015 and then began to dwindle, AI took off. A recent article in the Harvard Business Review stated that companies are not ready for AI until they have a sound analytics foundation. AI is the natural evolution of the investment that companies have made in Big Data and analytics, and in 2018, companies will need to ensure that they have a strong analytics foundation in order to prepare for AI.
2. In 2018, companies will migrate their Big Data to the cloud. According to AtScale's Data Maturity Survey, 72% of respondents said that they plan on deploying Big Data in the cloud within the next five years.
3. 2018 will be a hybrid world. Though companies will move to the cloud, they won't completely replace their Big Data environments. Some assets will be on-premise forever.
4. In 2017, enterprises' Centers of Excellence were full-service, whereas, in 2018, they will become centers of enablement. While in the past, business users could enter their requirements and receive reports, now business users can self-serve to data that scales but is also governed. COEs will be like salad bars - you can help yourself and create your own salads.
Anant Chintamaneni, VP of Product, BlueData.
Kubernetes has won the container orchestration wars - it's clearly the de facto standard for stateless applications (such as web servers) and microservices. But what about Big Data and stateful applications? Over the next year, Kubernetes will address challenges in using the platform for long running, distributed, multi-services Big Data applications: including persistent storage, security, performance, and several other operational requirements. Big Data applications break the typical assumptions of container orchestration. Kubernetes will address these issues over the next 12-24 months as they continue to see growing adoption.
By Florian Douetteau, CEO at Dataiku
In 2017 Data Governance Took Center Stage: Unfortunately, data breaches are becoming more common than most of us are comfortable with. Most notably, the Equifax disaster. Pushing data governance even further upstage was the passing of the EU General Data Protection Regulation (GDPR), which enterprises grappled with throughout 2017 (and will continue to face in 2018).
For 2018, I predict:
1. Data Team Manager Will Become a Specialty - As data teams begin to get more organized and more robust, they will begin to get broken down into specialties (such as data analyst, data scientist, data engineer, data ops, etc.). With this continued specialization, the role of the project manager or team leader (which so far hasn't been as prominent in the data space as in other teams across the enterprise) gains more importance.
2. Automated ML Will Become a Commodity - Automated machine learning (ML), i.e., the ability to automatically search in the feature transformation and model space, will become a commodity and is already leveraged by most of the software toolkits available. With this transformation, data science will become less about framework expertise and more about data management and successfully articulating business needs.
3. Sales Bots Will Start to Work Thanks to ML and Global Machine Conversation Libraries - Bots system, especially in business-to-customer transactions, are (more often than not) hard-coded sets of rules. In 2018, these systems will evolve thanks to machine learning and the commoditization of machine learning frameworks trained on real human-to-human conversation.
See also Dataiku 4 Data Science Trends to watch here.
Patrick McFadin, Vice President Developer Relations, DataStax
What should you be doing in 2018 - New Year's Resolutions for Data Scientists
Resolution #1 - Be prepared for suspicion around AI
Today, AI is the flavor of the month. Areas like Artificial Intelligence and machine learning are being pitched as helping to improve performance within applications, while deep learning is also growing in interest.
However, while this hype might open up budgets for the future, it should not be a surprise that the hype will lead to over-inflated expectations. Prepare for this, build concrete models and business cases together, and you can avoid the hype crash and the suspicion that will come with it.
Resolution #2 - Get familiar with streaming and translytics alongside more traditional batch processes
For some use cases, traditional batch style analytics runs should be the best fit. For others, analytics that take place while a transaction is taking place will be required.
At its most simple, streaming analytics works on items that meet specific conditions - as events take place, analyze them immediately. For companies that want to work on data at scale, hundreds or thousands of actions might take place every second that all need to be analyzed when they happen. Batch processing simply can't keep up with the volume.
Alongside this, there's a new category Forrester is calling translytics. This covers how to make use of your operational data by analyzing it at the time it is created. The end result should be similar output to streaming analytics. Knowing when to implement stream processing, when to choose a translytics database and when to use batch will be important.
Resolution #3 - Plan ahead on who you trust to work with in order to avoid lock-in
You can't do everything yourself these days. You can build and run your own data center, or use a cloud provider, or both. For enterprises, moving over to public cloud does offer reduced capital cost but can also lead to more expensive operational costs over time.
However, one of the biggest issues is what you might be able to do if and when you want to move to another provider. Are there costs for taking your data out of a cloud provider's clutches? Or is there a simple migration path available? What services are exclusive on only one provider?
It's impossible to avoid "lock-in" - you have to work with someone. However it is up to you who you choose to build on and you have a choice on who to bet on as a long-term partner.
Dinesh Nirmal, Vice President, Analytics Development, IBM.
Machine Learning will continue to make inroads to the Enterprise - There is serious work being done with ML in the enterprise, but it's not as sophisticated as what we see in the new-age ML applications. So, while we may not see the equivalent of a self-driving car in the enterprise, ML will make serious inroads in finance, manufacturing, healthcare, and several other industries. We will also see ML used increasingly to automate mundane tasks of the data center and data management, in general. Heavy time- and resource-consuming tasks, like data matching and metadata creation, will be automated with greater frequency using ML next year, dramatically freeing up administrators to do more core data center work.
Natural Language interfaces will become more commonplace (and less frustrating) - Beyond voice-activated search assistants already in the marketplace, over the next year we'll see natural language interfaces integrated into more applications. And they'll work better too!
Jason McGee, IBM Fellow and VP, IBM Cloud.
Reaching a tipping point in maturity: containers, Kubernetes, and serverless
Microservices architectures built on containers and serverless computing have revolutionized the speed at which apps can be built and how they connect into the most competitive technologies today: AI, blockchain and machine learning. In 2018, we will see the adoption of these technologies reach a tipping point. They will move from early adoption to becoming the de-facto standard for complex and production-ready apps across industries and companies of all sizes.
This shift is being driven by new tools that emerged in 2017 -- like Grafeas, Istio and Composer -- that enable developers to more securely manage and coordinate the many moving parts created by building with containers, serverless and microservices. These tools are enabling greater visibility for the developer including who is working with data, what's being changed, and who has access, leading to better security. The result will be an uptick in the development of mature apps that can span and operate across multiple systems, teams and data streams.
Emily Washington, SVP of Product Management at Infogix.
In 2017 we saw Big Data become the norm as many organizations adopted some sort of big data environment. In response, we've seen an increase in the adoption of self-service data preparation tools which enable organizations to prepare data regardless of the data type. These tools allow them to leverage their Big Data to better understand their customers and deliver an improved customer experience. In addition, organizations are now applying machine learning, artificial intelligence, and advanced analytics to use cases beyond customer behavior and financial forecasting. Because of this, we saw many technologies incorporate machine leaning into their solutions.
We expect this trend to continue into 2018, we will continue to see the convergence of a broad range of data management technologies, such as data quality, analytics, governance, metadata management, etc. Extracting meaningful insights and increasing operational efficacy, requires integrated tools to enable users to quickly ingest, prepare, analyze, act on, and govern data. We also expect to see an increased importance on data governance. With regulatory pressures increasing, data is constantly being amassed, and it is more critical than ever to communicate accurately and effectively with customers, teams have greater access to data within an organization, and leveraging advanced analytics is a must, making data governance crucial.
Anthony Goldbloom, Kaggle.
I'm excited about Kaggle's public data platform (www.kaggle.com/datasets). It's actually overtaking competitions as the main driver of activity on Kaggle. We now have over 6000 datasets on most topics a machine learner or data scientist could care about. Historically the UCI Irvine data repository has been a value resource for the data science and ML community. Kaggle's public data platform supercharges that.
Michael Berthold, KNIME
2017 has seen the arrival of Big Data in the real world. Some of the early hype has cooled down and we see less of it but what we do see is much more serious and puts big data to real use.
The same still waits to happen for Deep Learning but so far 2017 has spent much of it's energy on creating a mess when it comes to terminology. Many of the younger people now confuse Machine Learning with Deep Learning and fall into the same trap that we stumbled through in the 90's, believing that Neural Networks will solve all data problems. In 2018 I see this trend continuing for a little longer.
Behind all that buzz a lot of people will still struggle with the classic issues: automated deployment of analytical results as well as monitoring and management of many thousands of predictive models. Especially for the latter I see a lot of progress next year. Part of the management push is also automating parameter sweeps, the guys from H20 have done some interesting work on this front and I am looking forward to seeing slightly more guided versions of this showing up on our end...
Data Analysts, particularly in Europe, will have to face data privacy issues and build analytical applications that allow to explain their decisions - an interesting challenge for those Deep Learning folks.
Seth DeLand, product marketing manager for data analytics group, MathWorks
Trend: Machine Learning and Deep Learning
- As it becomes easier and easier to apply machine learning techniques, more products and services will incorporate machine learning models. Embedded systems, typically used for controls and diagnostics, will incorporate machine learning models that can detect previously unobservable phenomena (eg. detecting a driver's style of driving, or classifying whether a machine is likely to breakdown or not). In 2018, we'll continue to see machine learning models being incorporated in new places, especially in edge nodes and embedded processors.
- While deep learning continues to look promising, there is still a lot of design and tuning necessary to train a useful deep network. Techniques such as automated hyperparameter tuning appear well-positioned to reduce this work, which should ramp-up the pace of adoption of deep learning.
Ingo Mierswa, Founder & President, RapidMiner
The need for automation in model building will continue, but it will go beyond mindless number crunching.To make the automated models more relevant, practitioners will need better ways to define their background knowledge about use cases and data to get meaningful models.
Many warn about the dangers of Pure AI. AI will get a reality check in the next year.Practical AI, will rise and bring together all necessary components.Machine learning will remain at the core, but knowledge management, optimization, planning, and communication will be integrated with ML and made available as services.This will lead to more integration of ML and AI into business processes and automated decision making, mostly driven by IoT applications.
Machine learning starts to learn with context, i.e. the algorithms will make more use of their memory about previous situations and decisions.This will solve some of the problems with stream-based machine learning which currently tend to forget all data after it was seen once.
The deep learning hype of course will continue, especially around unsupervised and generative learning.We will need to find more high-value use cases beyond image, audio, or video analysis or the hype will start to fade soon again.
Finally, new international standards on handling personal data (GDPR) will require more model understandability and the explanation of decisions.This will pose new challenges for automatization and deep learning. Interpretable models and trails for model-based decisions will become standard practices.
Monte Zweben, CEO of Splice Machine
- Online Predictive Processing (OLPP) emerges as a new approach to combining OLTP, OLAP, streaming, and machine learning in one platform
- AI is the new Big Data: Companies race to do it whether they know they need it or not
- The Hadoop era of disillusionment hits full stride, with many companies drowning in their data lakes, unable to get a ROI because of the complexity of duct-taping Hadoop-based compute engines
- SQL is reborn as many companies realize their Hadoop-based data lakes need traditional database operations, such as in-place record updates and indexes to power applications
- The state-of-the art for OLPP databases will be indexed by rows for fast access and updates, but stored in columnar encodings for massive storage savings and scan speeds for analytics
Toufic Boubez, Splunk VP of Engineering on AI and machine learning, Splunk.
- The buzz stops here: Artificial intelligence and machine learning are often misunderstood and misused terms. Many startups and larger technology companies attempt to boost their appeal by forcing an association with these phrases. Well, the buzz will have to stop in 2018. This will be the year we begin to demand substance to justify claims of anything that's capable of using data to predict any outcome of any relevance for business, IT or security. While 2018 will not be the year when AI capabilities mature to match human skills and capacity, AI using machine learning will increasingly help organizations make decisions on massive amounts of data that otherwise would be difficult for us to make sense of.
- AI and ML become industry-specific: AI using machine learning will increasingly provide financial services organizations with the ability to recognize fraud, identify anomalies in user behaviors, and suggest precise steps customers can take to mitigate these threats. Also, the rise of computational journalism will significantly impact the trajectory of the media industry across the U.S. and throughout the world. In 2018, we will see more and more journalists work collaboratively with data scientists, just as they are doing at the Pulitzer-nominated Atlanta Journal-Constitution. Journalists will turn to experts in AI, machine learning and natural language processing (NLP) to discover newsworthy stories with maximum relevance for local, national and global audiences - shining a light on issues that might never have been discovered previously.
- AI and ML go mainstream in B2B: Increased access to voluminous real-time data carries the additional burden of identifying relevant signals in a noisy sea of information. Whether it's predicting and preventing a critical IT infrastructure outage, or identifying a single unwanted user in traffic of millions, these are among the most crucial and requested AI and machine learning capabilities. Removing mundane tasks, and empowering machines to learn on their own, holds promise for increased innovation, productivity and workplace satisfaction.
Clarke Patterson, head of product marketing, StreamSets
In 2017 we saw confusion in the stream processing market with regard to which stream processing framework to use. Apache Flink, Spark Streaming, Kafka Streams and other alternatives emerged, all of which on the surface offer similar capabilities. Businesses that use these frameworks are scratching their heads as to which one to use, wondering if a clear leader emerges. The net result is an unwanted side-effect: 'solution sprawl' and a lack of oversight and control over ingested data.
In 2018, expect more of the same, although a leader may emerge. Initial confusion will shift to standardization with most companies picking their favorite. While Spark Streaming appears to be the lead horse, expect sprawl due to residual left from prior investment and multiple frameworks persisting across the business. Fortunately, businesses can use multiple frameworks without worrying about losing control over their data by selecting a data operations platform that includes a living data map with auto-updating capabilities. This allows application of continuous integration and continuous deployment methods for stream processing within data flows.
Kunal Agarwal, CEO Unravel
First prediction is around the enterprise focusing on mission critical big data applications rather than technologies. In the past, people were focused on learning the various big data technologies: Hadoop, Spark, Kafka, Cassandra, etc. It took time for users to understand, differentiate, and ultimately deploy them. There was a lot of debate and plenty of hype. Now that organizations have cut through the noise and figured all that out, they're concerned about actually putting their data to use.
Take the recommendation engine for example, a critical app for most all web companies. Consider Netflix: Their recommendation engine isn't just a nice add-on that enhances the user experience, it's absolutely fundamental to the experience and to Netflix's bottom line. Their platform depends on the ability to accurately suggest relevant movies and TV shows to people - otherwise, it'd be almost impossible for viewers to dig through their enormous library.
Netflix/the enterprise doesn't really care about the technology being used. It's not important which distribution or database or analytics they're using, what matters is the result. The enterprise has realized this and we can expect to see an increased adoption of an application-centric approach to big data in the coming year.