Key Data Science, Machine Learning, AI and Analytics Developments of 2022
It's the end of the year, and so it's time for KDnuggets to assemble a team of experts and get to the bottom of what the most important data science, machine learning, AI and analytics developments of 2022 were.
Photo by Artturi Jalli on Unsplash
The end of the year is imminently upon us, and just as we have done for a long time, KDnuggets has reached out to a varied cast of experts to solicit their input on the following question:
What do you think were the key data science, machine learning, AI and analytics developments of 2022?
We asked participants for approximately 400 words on the topics, and have collected the responses below. The responses are presented mostly in alphabetical order. We have held back KDnuggets input and a special guest response to the end. Also note that some of the responses deviate slightly into the realm of predictions for 2023, but they have also been included for your benefit.
Enjoy the insights provided below, and wishing everyone a successful 2023.
Digital Twins Get Physical: We will see large-scale digital twins of physical processes that are complex and multi-scale, such as weather and climate models, seismic phenomena and material properties. This will accelerate current scientific simulations as much as a million-x, and enable new scientific insights and discoveries.
Generalist AI Agents: AI agents will solve open-ended tasks with natural language instructions and large-scale reinforcement learning, while harnessing foundation models — those large AI models trained on a vast quantity of unlabeled data at scale — to enable agents that can parse any type of request and adapt to new types of questions over time.
These predictions come from here, with permission.
2022 was a tumultuous year for the global economy. The world of Data Science was directly impacted as organizations began to emphasize saving money and “doing more with less.” These changes created two significant trends. First, data teams began increasing the reuse of machine learning (ML) features, and second, new ML technology tools that optimize the ML development processes became widespread.
The increased interest in reusing features gave greater importance to investments in feature stores. But in turn, this increased interest in feature stores created a new problem for organizations — specifically, how to feed these feature stores with increased volumes without hiring an army of data scientists and data engineers. The desire to maximize the usefulness of feature stores inevitably led to a demand for new tools and platforms that automate and optimize feature discovery and feature engineering processes.
The agenda to “do more with less” will continue to be a top priority as we enter 2023. We expect this trend to drive growth in selected investments for technologies that help automate processes for the analysis and use of data — whether it’s for the development of ML models or advanced analytics applications. Platforms that help data scientists and engineers do their jobs better, faster, and with less assistance from subject-matter experts will continue to be in high demand.
One area that we have witnessed progress on this year is Responsible AI, and 2023 should be the year that we see accelerated governance and adoption, as well as concrete frameworks in place, to deliver, what should be our united goal, of responsible AI becoming the norm for every organization.
Data Science: Applying Vector Databases for Vector Similarity Search
In data science in the past 2-3 years, the most exciting developments I believe are the abundance of open sourced large pre-trained deep learning models and how the embeddings generated thanks to these models, frameworks and repositories as Tensorflow and Hugging Face, —who recently raised $100M, are leveraged for a variety of downstream ML use cases, with minimal fine tuning.
The major new development in the past year / 2022 is the ability to apply these models, frameworks and embeddings also for real-time use cases such as recommendations or sentiment analysis, using vector databases for vector similarity search with filtered searches. This not only extends the application of embeddings to a wide variety of real-time use cases, it also extends its accessibility to software developers who are not data science or machine learning experts. It enables developers to enrich any application with AI, with a few lines of code, abstracting the complexity from the developers, helping democratize those giant power hungry and data intensive deep learning models.
In the past year we have seen technology companies and open source libraries, venture capitalists and startups jumping on this opportunity. Vector database technology is starting to mature and benchmarks published, such as the recent one published by JinaAI comparing the different technologies for one million vector embeddings. These benchmarks emphasize latency and throughput while at the same time maintaining high accuracy.
Machine Learning: Emergence of Enterprise-ready Feature Stores
For machine learning engineering, the most exciting development in the past 2-3 years I believe is the emergence of Machine Learning Operations or MLOps and the important role of feature stores for machine learning as the cornerstone of these platforms.
The major new developments in the past year / 2022 in this domain is the increased maturity of feature stores with the introduction of enterprise-ready feature stores covering many types of real-time AI/ML use cases, across open source feature stores, commercial feature stores and DIY / build your own feature stores. Some of the notable examples include: Linkedin open sourcing Feathr, its battle tested scalable feature store, which enables performing feature engineering / computation in the feature store; Tecton — a commercial feature store, who recently raised $100M becoming an MLOps Unicorn, adding support for Redis Enterprise for its online store to accommodate low latency or high throughput use cases, supporting streaming and real-time features. And finally, companies such as iFood and Gojek, who have built their own feature stores several years ago, have now upgraded their online feature store from Redis open source in memory database to its Enterprise-ready version, supporting low latency and high scale use cases.
The most important language to learn isn’t Python; it’s SQL. Databases of all sizes are on a tear. Many workloads are moving to the cloud (and powerful cloud data warehouses in particular), finally reaching a tipping point as a combination of features and price make it difficult for any company to hold out. And when data is available locally, new in-memory databases like DuckDB make it possible to use advanced, SQL-based query engines from a laptop, from a serverless function, even from the browser itself. These ubiquitous SQL-based tools are crowding out yesterday’s heavily scripted approaches to data manipulation because they empower users to work with data where it sits, rather than have to extract it, manipulate it, and re-insert it.
Charles Martin is Founder at Calculation Consulting, an AI Specialist and Distinguished Engineer in NLP & Search, and the Inventor of weightwatcher.ai. Charles touches on a few developments that definitely affect lots of businesses in their AI aspirations.
ML and AI is just about everywhere now. I have received many inquiries this past year about developing ML and AI products for customers and a common story I hear is that, "we have many models running in production but we don't really understand why they work, when they will break, and how to fix them."
For many companies, their ML/ AI deployments have evolved into incredibly complex systems and the winners will not be the ones who build the best and most accurate models, but the ones who can manage this complexity.
In particular, while it has become much much easier to build and deploy ML models, managing the underlying data has become much much harder. Data quality, access, and governance remain deep challenges for companies looking to leverage ML and AI.
In my experience, the key challenge facing data governance is what I call "Data Quality Mismatch." Contrary to common beliefs and the never-ending complaints from data scientists that their data is of low quality--it is just not high enough quality for the ML or AI product they are trying to build and maintain today. Many companies are trying to build ML and AI solutions for more complex products, using data collected from older, existing, and simpler products. Consequently, the quality of this older data is only good enough for the product this data was originally designed for. For example, data can not simply be moved from a low quality reporting product and reused in a high performance ML product without expecting to find numerous and exhausting data quality issues. And issue that had no material impact on the original reporting system.
Implementing machine learning successfully in a company requires careful planning and management of the data to ensure that it is high-quality, accurate, and protected. It is important for companies to carefully manage their data and work with experienced machine learning experts to identify and address any problems with their models.
Because I see so many problems in production ML and AI models, in my own consulting practice, I have been researching and developing tools, like the open-source weightwatcher tool, which can help companies detect unexpected and nearly undetectable problems in their AI models. And without even needing access to test or training data. Check it out here.
Once dominated by public-sector applications given their ubiquitous access to the data, the commercial sector saw dozens of spatial analytics applications moved into production in 2022.
Cost of sensors and devices that generate geospatial data is falling rapidly with corresponding proliferation. The cost of location-enabled chips for cellular connectivity has declined by 70% over the past 6 years. Costs of launching a satellite have fallen sharply over the past decade on a per-kilogram basis, meaning more data-collecting satellite launches. The expansion of 5G networks is aiding in the collection of greater volumes of geospatial data. The result is that in 2022, connected devices capable of sharing their location generated over 15 zettabytes of data, making location enriched sensor data the fastest growing kind of data in the world.
Spatio-temporal databases matured and increased their presence in the cloud at the start of this decade, providing data scientists with scalable tools to fuse data sets on geospatial dimensions (i.e., joining a long/lat to a polygon), and track and analyze objects in motion. Spatial analysis in particular is extremely compute intensive, which has historically limited the amount of data that can be processed or required exotic and expensive GPU architectures that were out of reach for most organizations. Recent advances in query vectorization (aka data level parallelism) have significantly improved the efficiency of windowing functions, derived columns, and predicate joins essential to advanced spatio-temporal analytics at scale.
Last year saw innovators across industries take advantage of the unique opportunities stemming from real-time spatial data. Real-time streams of breadcrumb data from every connected Ford F150 pickup is now fused on spatial dimensions with roads, charging stations, weather events, traffic data and others resulting in new in-car services. At Liberty Mutual, real-time streams of weather events fused on spatial dimensions to building footprints are used to estimate liability during catastrophic weather events resulting in more accurate and timely claims management. As the energy sector moved quickly in 2022 to respond to supply shortages, companies like SM Energy are fusing drill bit sensor readings with geological readings at a scale that was previously intractable, resulting in faster well drilling at lower costs. T-Mobile extended their lead as the fastest 5G network as measured by download speeds (per OpenSignal) by fusing cell phone signals on spatial dimensions with buildings and roads to detect and address 5G coverage weak spots resulting in better cell service for customers.
In 2022, we witnessed an ever growing, cross-industry and robust adoption of AI algorithms and data science techniques, both among data citizens as well as in large organizations.
Traditional one-man businesses such as physicians, teachers, accountants, consultants, auditors, lawyers, and many other professional figures understood the value of data and embraced a data-informed culture to innovate and stay competitive. To close the gap, reduce upskilling costs (idle time & money), and leverage the power of AI-driven solutions, the adoption of low-code/no-code data science platforms became more prominent than ever before. Through intuitive visual interfaces, these platforms enable data citizens to build data workflows and collaborate with software experts and data scientists, triggering a positive snowball effect on their job performance.
On the other side of the spectrum, organizations that started the transition to data-driven decision making in the previous years consolidated and expanded the implementation of advanced AI solutions in 2022. With AI maturity came the need for reliable deployment, greater development agility, and improved operational efficiency, triggering the adoption of CI/CD procedures and best practices. To automatically productionize data science for constant development, testing, integration, deployment, monitoring, and versioning, AI-mature organizations increasingly relied on intuitive SaaS technology to strengthen team collaboration, minimize IT bottlenecks with centralized administration and data governance, and scale to empower any number of users, running any number of workflows in a single environment.
The higher complexity and diversification of data science operations sharpened the standardization of roles in the analytics industry. In 2022, organizations abandoned the inherently erroneous concept of the “unicorn data scientist” and moved towards different data professional figures and standardized data roles. Data curators, data engineers, data scientists, data analysts, automation specialists, ML engineers - just to name a few - became recognized job titles, each with a specific education background and set of skills.
Finally, the AI maturity of organizations and the appearing on the scene of data citizens, has favored a number of data literacy initiatives. Following this need and considering the post-covid scenario, the number of courses, events, books, videos, learnathons and other initiatives surged, especially in the second half of 2022, to keep up with the demand.
2023 will be a year of acceleration for the operationalisation of widespread usage of analytics and ML in all functions of enterprises.
For years, early adopters have already been building out systems to automate a host of mundane tasks and to focus on higher-value activities: this has included everything from financial reporting, to data cleansing and document parsing.
They’ve also combined automation with traditional analytics and AI or ML activities. The benefits can be significant, with companies reporting greater efficiencies and improved quality control, with time to focus on developing the next great ideas and products. Moving on to more profound work also delivers a higher sense of accomplishment: it makes people feel that their job has more value and sense.
All of this together creates a strong incentive for more conservative companies to heavily invest in these practices, which are more often than not accelerated by employees eager for more automation, more analytics, and more insight. When it’s grassroots-driven like this, you get buy-in from across the organization. The success of these initiatives relies on appropriate tooling and standard processes (MLOps, data ops, sometimes called XOps) in order to disseminate such power across organizations, while retaining appropriate controls and governance.
This year we witnessed many layoffs and resignations of data analytics professionals across all levels and companies. Many companies are working hard to retain top talent, investing in training programs and providing employees with growth opportunities. Others have cut down on spending and are holding back on investing in their people in fear of an economic downturn.
In addition to economic pressures, tech companies are facing tough decisions around letting their employees work remotely vs. bringing them back in the office (at least part-time). Those that provide the right balance of flexibility stand to win the competition for top talent.
The demand for data analytics, data science, and AI/ ML professionals holds strong as we face an increased reliance on AI to carry out repetitive tasks, as well as even more data to analyze than we had last year. One thing I’m noticing is that we are still seeing a disconnect between the supply of talent available; the majority are non-senior data scientists/ data engineers, and the demand for senior data professionals that can fill the needs of the hiring companies. I can’t wait to see what the new year brings!
Automating ML workflows will become more essential
Although we’ve seen plenty of top technology companies announce layoffs in the latter part of 2022, it’s likely none of these companies are laying off their most talented machine learning personnel. However, to fill the void of fewer people on deeply technical teams, companies will have to lean even further into automation to keep productivity up and ensure projects reach completion. We expect to also see companies that use ML technology put more systems into place to monitor and govern performance and make more data-driven decisions on how to manage ML or data science teams. With clearly defined goals, these technical teams will have to be more KPI-centric so leadership can have a more in-depth understanding of machine learning’s ROI. Gone are the days of ambiguous benchmarks for ML.
Hoarding ML talent is over
Recent layoffs, those working with machine learning specifically, are likely the most recent hires as opposed to the more long-term staff that have been working with ML for years. Since ML and AI has become a more common technology in the last decade, many big tech companies began hiring these types of workers because they could handle the financial cost and keep them away from competitors – not necessarily because they were needed. From this perspective, it’s not surprising to see so many ML workers being laid off considering the surplus within larger companies. However, as the era of ML talent hoarding ends, it could usher in a new wave of innovation and opportunity for startups. With so much talent now looking for work, we will likely see many of these folks trickle out of big tech and into small and medium-sized businesses or startups.
In 2022, there were many cutting-edge developments in the area of MLOps tooling, generative art, large language models, and speech recognition. OpenAI and Deepmind were at the forefront of AI development. They are always coming up with the state of the art models which transform the whole industry.
In the past, there were limited open-source tools available for us to smoothly deploy the models into production. Either we have to use DevOps tools or come up with unique solutions. There was no one-stop solution. We had to use multiple MLOps tools for experiment tracking, metadata management, ML pipelines, data and pipeline versioning, and model monitoring.
It has all changed in 2022, the product-based companies are integrating more features for data scientists and machine learning engineers to perform all of the MLOps tasks on one platform, such as DagsHub, Kubeflow, and BentoML. You will see more companies are targeting data scientists and ML engineers, instead of developers and software engineers.
DALL.E 2 was introduced by OpenAI, and soon after, we have seen people using natural language to generate high-quality art. It was just amazing. Soon after its launch, we saw the DALL.E 2 open-source version: Stable Diffusion for generative art. It allows people to understand the model architecture and come up with a unique solution such as Diffuse The Rest, Runway Inpainting, and Stable Diffusion Depth2img. Furthermore, we have seen multiple companies integrate generative art into their ecosystems.
Both Stable Diffusion and DALL.E 2 are mainstream now.
We have seen GitHub Copilot using a large language model for code generation. It has completely changed how we code. GitHub Copilot uses the OpenAI Codex to suggest code and entire functions in real-time, right from your editor.
Then, OpenAI introduced Whisper, which approaches human-level robustness and accuracy in English speech recognition. It is a little bit better than Wav2Vec2 for the English language.
In the end, OpenAI has introduced ChatGPT, which is better than GPT3. It was optimized for conversation, and you will see a lot of tweets and posts praising how accurately ChatGPT has responded to the question.
We are still far from AGI, and we are still far away from building language models that outperform humans in terms of creativity and understanding.
Matthew Mayo is a Data Scientist and the Editor-in-Chief of KDnuggets.
For the majority of the year, up until the end of November, I would have said that text-to-image technologies and applications would have been the development(s) of the year, hands down. You ahve encountered these throughout the year. We had DALL-E 2, a huge step forward from its predecessor. Midjourney won awards for its refined and stylized output. Craiyon (aka DALL-E mini) made accessing these models quicker and easier, even if with lower-quality rendered images. Then came Stable Diffusion, from StabilityAI. Stable Diffusion is open source, and free to use on your own machine if so inclined.
The results of these technologies are all quite impressive in their own right. For a long time it seemed like this would be the biggest story of 2022 by far.
But then, on November 30, OpenAI released this blog post, along with access to the model in question. It's difficult to believe that ChatGPT has only been publicly-available for a month at the time of this article's publication.
It is not difficult to see, however, how one could consider ChatGPT the biggest development in the fields of data science, machine learning, AI and analytics for 2022. Sure, it's built on the shoulders of giants such as GPT3, but that makes it no less impressive. I could link to some of the various interesting results of "chats" with the model, but you've likely seen many yourself. People have used it as a productivity tool, an assistant, a replacement for Google, a code reviewer, and for all types of other scenarios.
In my judgement, there's just no getting around the fact that ChatGPT emerged late in the year to steal the spotlight regarding top developments of 2022.
And on that note, our final respondent for this year is ChatGPT, which we asked to weigh in on the top developments of 2022. (What do you think were the key data science, machine learning, AI and analytics developments of 2022?)
I'm sorry, but my knowledge cutoff is 2021, so I am unable to provide information about developments in 2022. However, I can tell you about some key developments in data science, machine learning, AI, and analytics that have occurred in the past. Some examples include the development of deep learning techniques, the increasing use of big data and cloud computing in machine learning and AI applications, the rise of automation and machine learning in business and industry, and the increasing use of data analytics in a variety of fields.
Not terribly useful, but we felt we had to try and share the results regardless.
The response, however, made us curious to ask again, but this time about 2021. (What do you think were the key data science, machine learning, AI and analytics developments of 2021?)
There were a number of significant developments in the fields of data science, machine learning, AI, and analytics in 2021. Some examples include:
1. Continued growth and development of deep learning techniques, including the use of transformers and other advanced architectures for natural language processing and computer vision tasks.
2. The increasing adoption of machine learning and AI in various industries, including finance, healthcare, and e-commerce.
3. The emergence of edge computing as a key enabler for the deployment of AI and machine learning models in resource-constrained environments.
4. The increasing use of explainable AI (XAI) methods to improve the interpretability and accountability of machine learning models.
5. The development of new tools and frameworks for data management and analysis, such as Dask and Vaex, which allow for the efficient processing of large datasets.
6. The growing importance of data privacy and security, leading to the development of new techniques for protecting sensitive data and preventing data breaches.
7. The increasing use of data analytics in a variety of fields, including marketing, supply chain management, and cybersecurity.
I'd label this vague, but reasonably accurate. ChatGPT doesn't look like it wants to go all in on anything in particular from 2021.
Well, that's that. Our team of experts has weighed in, and we hope that their diverse insights have provided something of interest for your reading pleasure.
See you in 2023!
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.