Finding data and understanding its meaning represents the traditional "daily grind" of a Data Scientist. With whale, the new lightweight data discovery, documentation, and quality engine for your data warehouse that is under development by Dataframe, your data science team will more efficiently search data and automate its data metrics.
Developing machine learning models as products that deliver business value remains a new field with uncharted paths toward success. Applying well-established software development approaches, such as agile, is not straightforward, but may still offer a solid foundation to guide success.
As a core method in the Data Scientist's toolbox, k-means clustering is valuable but can be limited based on the structure of the data. Can expanded methods like PAM (partitioning around medoids), CLARA, and CLARANS provide better solutions, and what is the future of these algorithms?
The Poisson distribution, named after the French mathematician Denis Simon Poisson, is a discrete distribution function describing the probability that an event will occur a certain number of times in a fixed time (or space) interval.
So much happened in the world during 2020 that it may have been easy to miss the great progress in the world of AI. To catch you up quickly, check out this curated list of the latest breakthroughs in AI by release date, along with a video explanation, link to an in-depth article, and code.
By Louis (What's AI) Bouchard on Dec 28, 2020 in AI, Research, Trends
XGBoost is a tree based ensemble machine learning algorithm which is a scalable machine learning system for tree boosting. Read more for an overview of the parameters that make it work, and when you would use the algorithm.
A feature store is a data warehouse of features for machine learning. Differently from a data warehouse, it is dual-database: one serving features at low latency to online applications and another storing large volumes of features. Learn how Data Scientists leverage this capability in production-deployed models.
A practical deep dive on production monitoring architectures for machine learning at scale using real-time metrics, outlier detectors, drift detectors, metrics servers and explainers.
In classification problems, the proportion of cases in each class largely determines the base rate of the predictions produced by the model. Therefore if you use sampling techniques that change this proportion, there is a good chance you will want to rescale / calibrate your predictions before using them in the wild.
SQL is an essential programming language for data analysis and processing. So, SQL questions are always part of the interview process for data science-related jobs, including data analysts, data scientists, and data engineers. Become familiar with these common patterns seen in SQL interview questions and follow our tips on how to neatly handle each with SQL queries.
This article explains the goals of anomaly detection and outlines the approaches used to solve specific use cases for anomaly detection and condition monitoring.
Check out the newest addition to our free eBook collection, Data Science and Machine Learning: Mathematical and Statistical Methods, and start building your statistical learning foundation today.
This article covers matrix decomposition, as well as the underlying concepts of eigenvalues (lambdas) and eigenvectors, as well as discusses the purpose behind using matrix and vectors in linear algebra.
No matter the field in which you hold some expertise, sharing your skills to benefit the lives of others or supporting non-profit organizations that try to make the world a better place is a noble and time-worthy personal pursuit. Many opportunities exist in data science to contribute to meaningful projects and crucial needs from your local community to a global scale.
This article compares the performance of the well-known pandas library with pypolars, a rising DataFrame library written in Rust. See how they compare.
AdaBoost technique follows a decision tree model with a depth equal to one. AdaBoost is nothing but the forest of stumps rather than trees. AdaBoost works by putting more weight on difficult to classify instances and less on those already handled well. AdaBoost algorithm is developed to solve both classification and regression problem. Learn to build the algorithm from scratch here.
Lift the curse of dimensionality by mastering the application of three important techniques that will help you reduce the dimensionality of your data, even if it is not linearly separable.
Transparency, explainability, and trust are pressing topics in AI/ML today. While much has been written about why they are important and what you need to do, no tools have existed until now.
With so much to learn and so many advancements to follow in the field of data science, there are a core set of foundational concepts that remain essential. Twenty of these ideas are highlighted here that are key to review when preparing for a job interview or just to refresh your appreciation of the basics.
In this article, we’ll cover probability mass and probability density function in this sample. You’ll see how to understand and represent these distribution functions and their link with histograms.
If you are preparing for data engineering interviews, then follow these technical recommendations regarding your resume, programming skills, SQL acumen, and system design problem-solving, as well as the non-technical aspects of your upcoming interview session.
The Q&A for the most frequently asked questions about Data Engineering: What does a data engineer do? What is a data pipeline? What is a data warehouse? How is a data engineer different from a data scientist? What skills and programming languages do you need to learn to become a data engineer?
This post is the first in a two-part series on AI for streaming data. Here, we’ll walk through strategies for aligning times and resampling the data.
As the fields related to AI and Data Science expand, they are becoming complex with more options and specializations to consider. If you are beginning your journey toward becoming an expert in Artificial Intelligence, this roadmap will guide you to find your path along what to learn next while steering clear of the latest hype.
NoSQL can offer an advantage to those who are entering Data Science and Analytics, as well as having applications with high-performance needs that aren’t met by traditional SQL databases.
There's a lot of data out there and so many data science techniques to master or review. Check out these great project ideas from easy to advanced difficulty levels to develop new skills and strengthen your portfolio.