The evolving marketplace of data now includes many firms that support a variety of needs from organizations looking to grow with data. This listing of the key players categorized by target market provides an interesting picture of this exciting industry sector.
Many technical challenges must be overcome to achieve successful delivery of machine learning solutions at scale. This article shares best practices we encountered while architecting and applying a model deployment platform within a large organization, including required functionality, the recommendation for a scalable deployment pattern, and techniques for testing and performance tuning models to maximize platform throughput.
The data build tool (dbt) is gaining in popularity and use, and this hands-on tutorial covers creating complex models, using variables and functions, running tests, generating docs, and many more features.
In this post, I want to share some of the tools that I have been exploring recently and show you how I use them and how they helped improve the efficiency of my workflow. The two I will talk about in particular are Snowflake and Dask. Two very different tools but ones that complement each other well especially as part of the ML Lifecycle.
Let's take a look at 5 different Python data structures and see how they could be used to store data we might be processing in our everyday tasks, as well as the relative memory they use for storage and time they take to create and access.
The process of mastering new knowledge often requires multiple passes to ensure the information is deeply understood. If you already began your journey into machine learning and data science, then you are likely ready for a refresher on topics you previously covered. This eight-week self-learning path will help you recapture the foundations and prepare you for future success in applying these skills.
Modern AI/ML systems’ success has been critically dependent on their ability to process massive amounts of raw data in a parallel fashion using task-optimized hardware. Can we leverage the power of GPU and distributed computing for regular data processing jobs too?
Standard cross-validation on time series data is not possible because the data model is sequential, which does not lend well to splitting the data into statistically useful training and validation sets. However, a new approach called Reconstructive Cross-validation may pave the way toward performing this type of important analysis for predictive models with temporal datasets.
The goal of classification is a primary and widely-used application of machine learning algorithms. However, if careful consideration through additional analysis is not taken into the subtlety in the results of an even an apparently straightforward binary classifier, then the deeper meaning of your prediction may be obscured.
The fast Walsh Hadamard transform is a simple and useful algorithm for machine learning that was popular in the 1960s and early 1970s. This useful approach should be more widely appreciated and applied for its efficiency.
There are many distribution functions considered in statistics and machine learning, which can seem daunting to understand at first. Many are actually closely related, and with these intuitive explanations of the most important probability distributions, you can begin to appreciate the observations of data these distributions communicate.
Surfing the professional career wave in data science is a hot prospect for many looking to get their start in the world. The digital revolution continues to create many exciting new opportunities. But, jumping in too fast without fully establishing your foundational skills can be detrimental to your success, as is suggested by this advice for data science newbies from Peter Norvig, the Director of Research at Google.
In this post we discuss the concepts of bias and fairness in the Machine Learning world, and show how ML biases often reflect existing biases in society. Additionally, We discuss various methods for testing and enforcing fairness in ML models.
Training efficient deep learning models with any software tool is nothing without an infrastructure of robust and performant compute power. Here, current software and hardware ecosystems are reviewed that you might consider in your development when the highest performance possible is needed.
As an aspiring data scientist, it is easy to get overwhelmed by the abundance of resources available on the Internet. With these 6 online courses, you can develop yourself from a novice to experienced in less than a year, and prepare you with the skills necessary to land a job in data science.
Geometric Deep Learning is an attempt for geometric unification of a broad class of machine learning problems from the perspectives of symmetry and invariance. These principles not only underlie the breakthrough performance of convolutional neural networks and the recent success of graph neural networks but also provide a principled way to construct new types of problem-specific inductive biases.
A new role of the Analytics Engineer is an exciting opportunity that crosses the skill sets of a Data Analyst and Data Engineer. Here, we describe how this position can evolve at an organization, and recommend self-learning resources that can be used to prepare for the multifaceted responsibilities.
WeightWatcher is an open-source, diagnostic tool for evaluating the performance of (pre)-trained and fine-tuned Deep Neural Networks. It is based on state-of-the-art research into Why Deep Learning Works.
This post discusses the SwAV (Swapping Assignments between multiple Views of the same image) method from the paper “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments” by M. Caron et al.
With the right software, hardware, and techniques at your fingertips, your capability to effectively develop high-performing models now hinges on leveraging automation to expedite the experimental process and building with the most efficient model architectures for your data.
While the Pandas library remains a crucial workhorse in data processing and management for data science, some limitations exist that can impact efficiencies, especially with very large data sets. Here, a few interesting alternatives to Pandas are introduced to improve your large data handling performance.
Becoming a professional in the field of data science takes more than just book-smarts. You need to have experience with real-world data sets, frequently-used tools, and an intuition for solutions that you can only gain from hands-on experience. These resources will jump start developing your practical skills.
Becoming a professional data scientist may not be as easy as "1... 2... 3...", but these 10 steps can be your self-learning roadmap to kickstarting your future in the exciting and ever-expanding field of data science.
In this article, we’ll cover a few of the most interesting — and powerful — of these techniques — focusing specifically on semantic search. We’ll learn how they work, what they’re good at, and how we can implement them ourselves.
Now that you are ready to efficiently build advanced deep learning models with the right software and hardware tools, the techniques involved in implementing such efforts must be explored to improve model quality and obtain the performance that your organization desires.
Want your social media algorithms to show you actual algorithms? Spare a moment during your social media scrolling to learn a bit of data science. Here are suggestions for at-a-glance access to good ideas and tips on your favorite platforms.