Search results for spark dataset
-
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs.https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html
-
Data Validation for PySpark Applications using Pandera
New features and concepts.https://www.kdnuggets.com/2023/08/data-validation-pyspark-applications-pandera.html
-
PySpark for Data Science
In this tutorial, we will learn to Initiates the Spark session, load, and process the data, perform data analysis, and train a machine learning model.https://www.kdnuggets.com/2023/02/pyspark-data-science.html
-
Movie Recommendations with Spark Collaborative Filtering
Not sure what movie to watch? Ask your recommender system.https://www.kdnuggets.com/2021/12/movie-recommendations-spark-collaborative-filtering.html
-
Querying the Most Granular Demographics Dataset
Having access to broad and detailed population data can potentially offer enormous value to any organization looking to interact with specific demographics. However, access alone is not sufficient without being able to leverage advanced techniques to explore and visualize the data.https://www.kdnuggets.com/2021/08/querying-granular-demographic-dataset.html
-
Awesome list of datasets in 100+ categories
With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It's pretty hard to curate through such a massive universe of data, but this collection is a great start. Here, you can find data from cancer genomes to UFO reports, as well as years of air quality data to 200,000 jokes. Dive into this ocean of data to explore as you learn how to apply data science techniques or leverage your expertise to discover something new.https://www.kdnuggets.com/2021/05/awesome-list-datasets.html
-
Working with Spark, Python or SQL on Azure Databricks
Here we look at some ways to interchangeably work with Python, PySpark and SQL using Azure Databricks, an Apache Spark-based big data analytics service designed for data science and data engineering offered by Microsoft.https://www.kdnuggets.com/2020/08/spark-python-sql-azure-databricks.html
-
Apache Spark Cluster on Docker
Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html
-
Apache Spark on Dataproc vs. Google BigQuery
This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.https://www.kdnuggets.com/2020/07/apache-spark-dataproc-vs-google-bigquery.html
-
LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability
Spark-TFRecord enables the processing of TensorFlow’s TFRecord structures in Apache Spark.https://www.kdnuggets.com/2020/05/linkedin-open-sources-small-component-tensorflow-spark-interoperability.html
-
The Benefits & Examples of Using Apache Spark with PySpark
Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html
-
3 Best Sites to Find Datasets for your Data Science Projects
When first learning data science, you will inevitably find yourself looking for more datasets to practice with. Here, we recommend the 3 best sites to find datasets to spark your next data science project.https://www.kdnuggets.com/2020/04/best-sites-datasets-data-science.html
-
Spark NLP 101: LightPipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. Now let’s see how this can be done in Spark NLP using Annotators and Transformers.https://www.kdnuggets.com/2019/11/spark-nlp-101-lightpipeline.html
-
Time Series Analysis: A Simple Example with KNIME and Spark
The task: train and evaluate a simple time series model using a random forest of regression trees and the NYC Yellow taxi dataset.https://www.kdnuggets.com/2019/10/time-series-analysis-simple-example-knime-spark.html
-
Learn how to use PySpark in under 5 minutes (Installation + Tutorial)
Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.https://www.kdnuggets.com/2019/08/learn-pyspark-installation-tutorial.html
-
Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.https://www.kdnuggets.com/2019/05/analyzing-tweets-nlp-spark-optimus-twint.html
-
Practical Apache Spark in 10 Minutes
Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html
-
Apache Spark Introduction for Beginners">Apache Spark Introduction for Beginners
An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.https://www.kdnuggets.com/2018/10/apache-spark-introduction-beginners.html
-
Project Hydrogen, new initiative based on Apache Spark to support AI and Data Science
An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects.https://www.kdnuggets.com/2018/08/databricks-project-hydrogen-apache-spark.html
-
Introduction to Apache Spark
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.https://www.kdnuggets.com/2018/07/introduction-apache-spark.html
-
Deep Learning With Apache Spark: Part 1
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html
-
[ebook] 7 Steps for a Developer to Learn Apache Spark
We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.https://www.kdnuggets.com/2018/04/databricks-ebook-7-steps-learn-apache-spark.html
-
A Simple XGBoost Tutorial Using the Iris Dataset
This is an overview of the XGBoost machine learning algorithm, which is fast and shows good results. This example uses multiclass prediction with the Iris dataset from Scikit-learn.https://www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html
-
Apache Spark Key Terms, Explained
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html
-
XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink
An overview of XGBoost4J, a JVM-based implementation of XGBoost, one of the most successful recent machine learning algorithms in Kaggle competitions, with distributed support for Spark and Flink.https://www.kdnuggets.com/2016/03/xgboost-implementing-winningest-kaggle-algorithm-spark-flink.html
-
Auto-Scaling scikit-learn with Spark
Databricks gives us an overview of the spark-sklearn library, which automatically and seamlessly distributes model tuning on a Spark cluster, without impacting workflow.https://www.kdnuggets.com/2016/02/auto-scaling-scikit-learn-spark.html
-
9 Must-Have Datasets for Investigating Recommender Systems
Gain some insight into a variety of useful datasets for recommender systems, including data descriptions, appropriate uses, and some practical comparison.https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html
-
Deep Learning with Spark and TensorFlow
The integration of TensorFlow with Spark leverages the distributed framework for hyperparameter tuning and model deployment at scale. Both time savings and improved error rates are demonstrated.https://www.kdnuggets.com/2016/01/deep-learning-spark-tensorflow.html
-
How to Check Hypotheses with Bootstrap and Apache Spark
Learn how to leverage bootstrap sampling to test hypotheses, and how to implement in Apache Spark and Scala with a complete code example.https://www.kdnuggets.com/2016/01/hypothesis-testing-bootstrap-apache-spark.html
-
Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers
Are you interested in massive amounts of data for research? Yahoo has just released the largest-ever machine learning dataset to the research community.https://www.kdnuggets.com/2016/01/yahoo-largest-machine-learning-dataset.html
-
Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet
Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html
-
Fast Big Data: Apache Flink vs Apache Spark for Streaming Data
Real-time stream processing has been gaining momentum in recent past, and major tools which are enabling it are Apache Spark and Apache Flink. Learn with the help of a case study about Data processing, Data Flow, Data management using these tools.https://www.kdnuggets.com/2015/11/fast-big-data-apache-flink-spark-streaming.html
-
Spark SQL for Real-Time Analytics
Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things).https://www.kdnuggets.com/2015/09/spark-sql-real-time-analytics.html
-
Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020
Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.https://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
-
Awesome Public Datasets on GitHub
A long, categorized list of large datasets (available for public use) to try your analytics skills on. Which one would you pick?https://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html
-
Bitcoin tools and datasets
Bitcoin, a secure and anonymous internet currency, has recently experienced a bubble in value and attention. Here is a very useful (and free) set of data extraction scripts and datasets for analysts interested in Bitcoin.https://www.kdnuggets.com/2013/04/bitcoin-tools-datasets.html
-
2018’s Top 7 R Packages for Data Science and AI
This is a list of the best packages that changed our lives this year, compiled from my weekly digests.https://www.kdnuggets.com/2019/01/vazquez-2018-top-7-r-packages.html
-
Unlock the Secrets of LLMs in 60-Minute with Andrej Karpathy
Karpathy's talk provides a comprehensive yet accessible introduction to large language models, explaining their capabilities, future potential, and associated security risks in an engaging manner.https://www.kdnuggets.com/unlock-the-secrets-of-llms-in-a-60-minute-with-andrej-karpathy
-
A Data Lake, You Call It? It’s a Data Swamp
How and why the data lake architecture often fails to meet its promises. And how better governance helps mitigate such challenges.https://www.kdnuggets.com/a-data-lake-you-call-it-it-a-data-swamp
-
The Only Free Course You Need To Become a Professional Data Engineer
Data Engineering ZoomCamp offers free access to reading materials, video tutorials, assignments, homeworks, projects, and workshops.https://www.kdnuggets.com/the-only-free-course-you-need-to-become-a-professional-data-engineer
-
Read This Before Making a Career Switch to Data Science
From Skill Assessment to Networking: Your Roadmap to Thriving in the World of Data Science.https://www.kdnuggets.com/read-this-before-making-a-career-switch-to-data-science
-
Pandas vs. Polars: A Comparative Analysis of Python’s Dataframe Libraries
An in-depth analysis of their syntax, speed, and usability. Which one is the best to use when working with data?https://www.kdnuggets.com/pandas-vs-polars-a-comparative-analysis-of-python-dataframe-libraries
-
Mastering the Data Universe: Key Steps to a Thriving Data Science Career
This article covered the six main pillars of a data science career from learning skills to getting a job.https://www.kdnuggets.com/mastering-the-data-universe-key-steps-to-a-thriving-data-science-career
-
A Brief History of the Neural Networks
From the biological neuron to LLMs: How AI became smart.https://www.kdnuggets.com/a-brief-history-of-the-neural-networks
-
7 Best Cloud Database Platforms
Cloud databases have made it easier and cheaper to develop enterprise-level applications, offering flexibility, convenience, and standard database functionality. See what KDnuggets recommends.https://www.kdnuggets.com/7-best-cloud-database-platforms
-
Best Practices for Building ETLs for ML
This article talks about several best practices for writing ETLs for building training datasets. It delves into several software engineering techniques and patterns applied to ML.https://www.kdnuggets.com/best-practices-for-building-etls-for-ml
-
Introduction to Cloud Computing for Data Science
And the Power Duo of Modern Tech.https://www.kdnuggets.com/introduction-to-cloud-computing-for-data-science
-
The 5 Best AI Tools For Maximizing Productivity
KDnuggets reviews a diverse set of 5 AI tools to help maximize your productivity. Have a look and see what our recommendations include.https://www.kdnuggets.com/the-5-best-ai-tools-for-maximizing-productivity
-
Working with Big Data: Tools and Techniques
Where do you start in a field as vast as big data? Which tools and techniques to use? We explore this and talk about the most common tools in big data.https://www.kdnuggets.com/working-with-big-data-tools-and-techniques
-
Data Management Principles for Data Science
Back to Basics: Understanding key data management principles that data scientists should know.https://www.kdnuggets.com/data-management-principles-for-data-science
-
Creating Visuals with Matplotlib and Seaborn
Learn the basic Python package visualization for your work.https://www.kdnuggets.com/creating-visuals-with-matplotlib-and-seaborn
-
LangChain + Streamlit + Llama: Bringing Conversational AI to Your Local Machine
Integrating Open Source LLMs and LangChain for Free Generative Question Answering (No API Key required).https://www.kdnuggets.com/2023/08/langchain-streamlit-llama-bringing-conversational-ai-local-machine.html
-
AI: Large Language & Visual Models
This article discusses the significance of large language and visual models in AI, their capabilities, potential synergies, challenges such as data bias, ethical considerations, and their impact on the market, highlighting their potential for advancing the field of artificial intelligence.https://www.kdnuggets.com/2023/06/ai-large-language-visual-models.html
-
Ten Years of AI in Review
From image classification to chatbot therapy.https://www.kdnuggets.com/2023/06/ten-years-ai-review.html
-
10 Jupyter Notebook Tips and Tricks for Data Scientists
Unlock the full potential of Jupyter Notebook with expert tips and techniques, including time-saving shortcuts, powerful magic functions, and advanced features, to boost your productivity.https://www.kdnuggets.com/2023/06/10-jupyter-notebook-tips-tricks-data-scientists.html
-
The Top AutoML Frameworks You Should Consider in 2023
AutoML frameworks are powerful tool for data analysts and machine learning specialists that can automate data preprocessing, model selection, hyperparameter tuning, and even perform complex tasks like feature engineering.https://www.kdnuggets.com/2023/05/best-automl-frameworks-2023.html
-
The Role of Open Source Tools in Accelerating Data Science Progress
Open source tools have had a pivotal role in the evolution of data science, from providing the foundation for analysis, to fueling the innovation that shapes today's landscape. The open source impact on data science is demonstrated best by looking at the relationship's past, present, and future.https://www.kdnuggets.com/2023/05/role-open-source-tools-accelerating-data-science-progress.html
-
Data Analytics Tools You Need To Know in 2023
What tools do you need to know to be a successful data analyst?https://www.kdnuggets.com/2023/05/data-analytics-tools-need-know-2023.html
-
Introducing Healthcare-Specific Large Language Models from John Snow Labs
John Snow Labs recently released a new LLM called BioGPT-JSL and capabilities tuned specifically to the medical domain. This article summarizes three things you should know about it. https://www.kdnuggets.com/2023/04/john-snow-introducing-healthcare-specific-large-language-models-john-snow-labs.html
-
Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Skills like the ability to clean, transform, statistically analyze, visualize, communicate, and predict data.https://www.kdnuggets.com/2023/04/top-19-skills-need-know-2023-data-scientist.html
-
Introducing the Testing Library for Natural Language Processing
Deliver reliable, safe and effective NLP models.https://www.kdnuggets.com/2023/04/introducing-testing-library-natural-language-processing.html
-
Announcing PyCaret 3.0: Open-source, Low-code Machine Learning in Python
Exploring the Latest Enhancements and Features of PyCaret 3.0.https://www.kdnuggets.com/2023/03/announcing-pycaret-30-opensource-lowcode-machine-learning-python.html
-
7 Best Tools for Machine Learning Experiment Tracking
Tools for organizing machine learning experiments, source code, artifacts, models registry, and visualization in one place.https://www.kdnuggets.com/2023/02/7-best-tools-machine-learning-experiment-tracking.html
-
Learn Data Engineering From These GitHub Repositories
Kickstart your Data Engineering career with these curated GitHub repositories.https://www.kdnuggets.com/2023/02/learn-data-engineering-github-repositories.html
-
Overcome Your Data Quality Issues with Great Expectations
Bad data costs organizations money, reputation, and time. Hence it is very important to monitor and validate data quality continuously.https://www.kdnuggets.com/2023/01/overcome-data-quality-issues-great-expectations.html
-
12 Essential Commands for Streamlit
Learn about the most commonly used Streamlit commands and build a customized web application.https://www.kdnuggets.com/2023/01/12-essential-commands-streamlit.html
-
Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science
Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.https://www.kdnuggets.com/2020/10/data-science-minimum-10-essential-skills.html
-
Top 38 Python Libraries for Data Science, Data Visualization & Machine Learning
This article compiles the 38 top Python libraries for data science, data visualization & machine learning, as best determined by KDnuggets staff.https://www.kdnuggets.com/2020/11/top-python-libraries-data-science-data-visualization-machine-learning.html
-
Data Science Projects That Can Help You Solve Real World Problems
The best way to learn Data Science is by solving real-world problems with the data and building your own portfolio. In this article, we will discuss three projects that you can work on to build your portfolio and impress interviewers.https://www.kdnuggets.com/2022/11/data-science-projects-help-solve-real-world-problems.html
-
The Complete Data Engineering Study Roadmap
Everything you need to know to start your career in Data Engineering.https://www.kdnuggets.com/2022/11/complete-data-engineering-study-roadmap.html
-
How LinkedIn Uses Machine Learning To Rank Your Feed
In this post, you will learn to clarify business problems & constraints, understand problem statements, select evaluation metrics, overcome technical challenges, and design high-level systems.https://www.kdnuggets.com/2022/11/linkedin-uses-machine-learning-rank-feed.html
-
Top 10 MLOps Tools to Optimize & Manage Machine Learning Lifecycle
As more businesses experiment with data, they realize that developing a machine learning (ML) model is only one of many steps in the ML lifecycle.https://www.kdnuggets.com/2022/10/top-10-mlops-tools-optimize-manage-machine-learning-lifecycle.html
-
Is OLAP Dead?
OLAP enables citizen analysts to quickly, efficiently, and cost-effectively uncover new business insights at a reduced time-to-value.https://www.kdnuggets.com/2022/10/olap-dead.html
-
25 Advanced SQL Interview Questions for Data Scientists
Check out this collection of advanced SQL interview questions with answers.https://www.kdnuggets.com/2022/10/25-advanced-sql-interview-questions-data-scientists.html
-
3 Simple Ways to Speed Up Your Python Code
The post explains three popular frameworks, PySpark, Dask, and Ray, and discusses various factors to select the most appropriate one for your project.https://www.kdnuggets.com/2022/10/3-simple-ways-speed-python-code.html
-
Top Open Source Large Language Models
In this article, we will discuss the importance of large language models and suggest some of the top open source models and the NLP tasks they can be used for.https://www.kdnuggets.com/2022/09/john-snow-top-open-source-large-language-models.html
-
KDnuggets News, September 14: Free Python for Data Science Course • Everything You’ve Ever Wanted to Know About Machine Learning
Free Python for Data Science Course • Everything You’ve Ever Wanted to Know About Machine Learning • Progress Bars in Python with tqdm for Fun and Profit • 7 Tips for Python Beginners • 7 Data Analytics Interview Questions & Answershttps://www.kdnuggets.com/2022/n36.html
-
7 Things You Didn’t Know You Could do with a Low Code Tool
Surprisingly easy solutions for complex data problems.https://www.kdnuggets.com/2022/09/7-things-didnt-know-could-low-code-tool.html
-
Machine Learning Metadata Store
In this article, we will learn about metadata stores, the need for them, their components, and metadata store management.https://www.kdnuggets.com/2022/08/machine-learning-metadata-store.html
-
Is There a Way to Bridge the MLOps Tools Gap?
Converting Jupyter notebooks to a well-designed software system is a mandatory step in every ML project. But there is a notable lack of tooling to assist developers with such translation, beyond the basic nbconvert utility.https://www.kdnuggets.com/2022/08/way-bridge-mlops-tools-gap.html
-
Most In-demand Artificial Intelligence Skills To Learn In 2022
Artificial Intelligence (AI) is the process of programming a computer that can reason and learn like a human being and make decisions for itself.https://www.kdnuggets.com/2022/08/indemand-artificial-intelligence-skills-learn-2022.html
-
10 Modern Data Engineering Tools
Learn about the modern tools for data orchestration, data storage, analytical engineering, batch processing, and data streaming.https://www.kdnuggets.com/2022/07/10-modern-data-engineering-tools.html
-
Data Science, Statistics and Machine Learning Dictionary
Check out this curated list of the most used data science terminology and get a leg up on your learning.https://www.kdnuggets.com/2022/05/data-science-statistics-machine-learning-dictionary.html
-
Feature Stores for Real-time AI & Machine Learning
Real-time AI/ML is on the rise and feature stores are key to successfully deploying them. Read on to see how the choice of online store and the feature store architecture play important roles in determining its performance and cost.https://www.kdnuggets.com/2022/03/feature-stores-realtime-ai-machine-learning.html
-
How to Process a DataFrame with Millions of Rows in Seconds
TLDR; process it with a new Python Data Processing Engine in the Cloud.https://www.kdnuggets.com/2022/01/process-dataframe-millions-rows-seconds.html
-
Query Your Pandas DataFrames with SQL
Learn how to query your Pandas DataFrames using the standard SQL SELECT statement, seamlessly from within your Python code.https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html
-
How to Speed Up XGBoost Model Training
XGBoost is an open-source implementation of gradient boosting designed for speed and performance. However, even XGBoost training can sometimes be slow. This article will review the advantages and disadvantages of each approach as well as go over how to get started.https://www.kdnuggets.com/2021/12/speed-xgboost-model-training.html
-
Cloud ML In Perspective: Surprises of 2021, Projections for 2022
Let’s take a closer look on Cloud ML market in 2021 in retrospective (with occasional drills into realities of 2020, too). Read this in-depth analysis.https://www.kdnuggets.com/2021/12/cloud-ml-perspective-surprises-2021-projections-2022.html
-
My First Six Months as a Data Scientist
The technical and non-technical lessons I’ve learned.https://www.kdnuggets.com/2021/12/first-six-months-data-scientist.html
-
Introduction to Clustering in Python with PyCaret
A step-by-step, beginner-friendly tutorial for unsupervised clustering tasks in Python using PyCaret.https://www.kdnuggets.com/2021/12/introduction-clustering-python-pycaret.html
-
Inside recommendations: how a recommender system recommends
We describe types of recommender systems, more specifically, algorithms and methods for content-based systems, collaborative filtering, and hybrid systems.https://www.kdnuggets.com/2021/11/recommendations-recommender-system.html
-
Four Basic Steps in Data Preparation">Four Basic Steps in Data Preparation
What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.https://www.kdnuggets.com/2021/10/four-basic-steps-data-preparation.html
-
The 20 Python Packages You Need For Machine Learning and Data Science">The 20 Python Packages You Need For Machine Learning and Data Science
Do you do Python? Do you do data science and machine learning? Then, you need to do these crucial Python libraries that enable nearly all you will want to do.https://www.kdnuggets.com/2021/10/20-python-packages.html
-
Data science SQL interview questions from top tech firms">Data science SQL interview questions from top tech firms
As a data scientist, there is one thing you really need to understand and know how to handle: data. With SQL being a foundational technical approach for working with data, it should not be surprising that the top tech companies will ask about your SQL skills during an interview. Here, we cover the key concepts tested so you can best prepare for your next data science interview.https://www.kdnuggets.com/2021/10/data-science-sql-interview-questions.html
-
Messy Data is Beautiful
Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.https://www.kdnuggets.com/2021/09/sparkbeyond-messy-data-is-beautiful.html
-
Data Engineering Technologies 2021
Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.https://www.kdnuggets.com/2021/09/data-engineering-technologies-2021.html
-
What Is The Real Difference Between Data Engineers and Data Scientists?
To launch your data career, you’ll need both theoretical knowledge and applied skills. Bootcamp programs like Springboard’s Data Science Career Track and Data Engineering Career Track can help make you job-ready through hands-on, project-based learning and one-on-one mentorship. Wondering which data career path is right for you? Read on to find out.https://www.kdnuggets.com/2021/09/springboard-difference-data-engineers-data-scientists.html
-
A Data Science Portfolio That Will Land You The Job">A Data Science Portfolio That Will Land You The Job
Landing a data science job is no easy feat, especially during the COVID-19 pandemic. This article provides aspiring data scientists with advice on building a data science portfolio that stands out.https://www.kdnuggets.com/2021/09/data-science-portfolio-job.html
-
How My Learning Path Changed After Becoming a Data Scientist
I keep learning but in a different way.https://www.kdnuggets.com/2021/08/learning-path-changed-becoming-data-scientist.html
-
GPU-Powered Data Science (NOT Deep Learning) with RAPIDS">GPU-Powered Data Science (NOT Deep Learning) with RAPIDS
How to utilize the power of your GPU for regular data science and machine learning even if you do not do a lot of deep learning work.https://www.kdnuggets.com/2021/08/gpu-powered-data-science-deep-learning-rapids.html
-
Get Interactive Plots Directly With Pandas">Get Interactive Plots Directly With Pandas
Telling a story with data is a core function for any Data Scientist, and creating data visualizations that are simultaneously illuminating and appealing can be challenging. This tutorial reviews how to create Plotly and Bokeh plots directly through Pandas plotting syntax, which will help you convert static visualizations into interactive counterparts -- and take your analysis to the next level.https://www.kdnuggets.com/2021/06/interactive-plots-directly-pandas.html
-
Building a Knowledge Graph for Job Search Using BERT
A guide on how to create knowledge graphs using NER and Relation Extraction.https://www.kdnuggets.com/2021/06/knowledge-graph-job-search-bert.html
-
Top 10 Data Science Projects for Beginners">Top 10 Data Science Projects for Beginners
Check out these projects for ideas to strengthen your skills and build a portfolio that stands out.https://www.kdnuggets.com/2021/06/top-10-data-science-projects-beginners.html
-
Make Pandas 3 Times Faster with PyPolars
Learn how to speed up your Pandas workflow using the PyPolars library.https://www.kdnuggets.com/2021/05/pandas-faster-pypolars.html
-
Data Validation in Machine Learning is Imperative, Not Optional
Before we reach model training in the pipeline, there are various components like data ingestion, data versioning, data validation, and data pre-processing that need to be executed. In this article, we will discuss data validation, why it is important, its challenges, and more.https://www.kdnuggets.com/2021/05/data-validation-machine-learning-imperative.html
-
How to Speed Up Pandas with Modin
The Modin library has the ability to scale your pandas workflows by changing one line of code and integration with the Python ecosystem and Ray clusters. This tutorial goes over how to get started with Modin and how it can speed up your pandas workflows.https://www.kdnuggets.com/2021/03/speed-up-pandas-modin.html
-
Beautiful decision tree visualizations with dtreeviz
Improve the old way of plotting the decision trees and never go back!https://www.kdnuggets.com/2021/03/beautiful-decision-tree-visualizations-dtreeviz.html
-
Inside the Architecture Powering Data Quality Management at Uber
Data Quality Monitor implements novel statistical methods for anomaly detection and quality management in large data infrastructures.https://www.kdnuggets.com/2021/02/inside-architecture-powering-data-quality-management-uber.html
-
The Best Tool for Data Blending is KNIME
These are the lessons and best practices I learned in many years of experience in data blending, and the software that became my most important tool in my day-to-day work.https://www.kdnuggets.com/2021/01/best-tool-data-blending-knime.html
-
Model Experiments, Tracking and Registration using MLflow on Databricks
This post covers how StreamSets can help expedite operations at some of the most crucial stages of Machine Learning Lifecycle and MLOps, and demonstrates integration with Databricks and MLflow.https://www.kdnuggets.com/2021/01/model-experiments-tracking-registration-mlflow-databricks.html
-
10 Python Skills They Don’t Teach in Bootcamp
Ascend to new heights in Data Science and Machine Learning with this thrilling list of coding tips.https://www.kdnuggets.com/2020/12/10-python-skills-dont-teach-bootcamp.html
-
Top Python Libraries for Deep Learning, Natural Language Processing & Computer Vision">Top Python Libraries for Deep Learning, Natural Language Processing & Computer Vision
This article compiles the 30 top Python libraries for deep learning, natural language processing & computer vision, as best determined by KDnuggets staff.https://www.kdnuggets.com/2020/11/top-python-libraries-deep-learning-natural-language-processing-computer-vision.html
-
Pandas on Steroids: End to End Data Science in Python with Dask">Pandas on Steroids: End to End Data Science in Python with Dask
End to end parallelized data science from reading big data to data manipulation to visualisation to machine learning.https://www.kdnuggets.com/2020/11/pandas-steroids-dask-python-data-science.html
-
Data scientist or machine learning engineer? Which is a better career option?
In order to build automated data processing systems, we require professionals like Machine Learning Engineers and Data Scientists. But which of these is a better career option right now? Read on to find out.https://www.kdnuggets.com/2020/11/greatlearning-data-scientist-machine-learning-engineer.html
-
How to be a 10x data scientist
If you are a Data Scientist looking to make it to the next level, then there are many opportunities to up your game and your efficiency to stand out from the others. Some of these recommendations that you can follow are straightforward, and others are rarely followed, but they will all pay back in dividends of time and effectiveness for your career.https://www.kdnuggets.com/2020/10/10x-data-scientist.html
-
How LinkedIn Uses Machine Learning in its Recruiter Recommendation Systems">How LinkedIn Uses Machine Learning in its Recruiter Recommendation Systems
LinkedIn uses some very innovative machine learning techniques to optimize candidate recommendations.https://www.kdnuggets.com/2020/10/linkedin-machine-learning-recruiter-recommendation-systems.html
-
5 Challenges to Scaling Machine Learning Models
ML models are hard to be translated into active business gains. In order to understand the common pitfalls in productionizing ML models, let’s dive into the top 5 challenges that organizations face.https://www.kdnuggets.com/2020/10/5-challenges-scaling-machine-learning-models.html
-
LinkedIn’s Pro-ML Architecture Summarizes Best Practices for Building Machine Learning at Scale
The reference architecture is powering mission critical machine learning workflows within LinkedIn.https://www.kdnuggets.com/2020/09/linkedin-pro-ml-architecture-best-practices-building-machine-learning-scale.html
-
Unpopular Opinion – Data Scientists Should Be More End-to-End
Can a do-it-all Data Scientist really be more effective at delivering new value from data? While it might sound exhausting, important efficiencies can exist that might bring better value to the business even faster.https://www.kdnuggets.com/2020/09/data-scientists-should-be-more-end-to-end.html
-
Online Certificates/Courses in AI, Data Science, Machine Learning from Top Universities">Online Certificates/Courses in AI, Data Science, Machine Learning from Top Universities
We present the online courses and certificates in AI, Data Science, Machine Learning, and related topics from the top 20 universities in the world.https://www.kdnuggets.com/2020/09/online-certificates-ai-data-science-machine-learning-top.html
-
Here’s what you need to look for in a model server to build ML-powered services
More applications are being infused with machine learning while MLOps processes and best practices are becoming well established. Critical to these software and systems are the servers that run the models, which should feature key capabilities to drive successful enterprise-scale productionizing of machine learning.https://www.kdnuggets.com/2020/09/model-server-build-ml-powered-services.html
-
The List of Top 10 Lists in Data Science">The List of Top 10 Lists in Data Science
The list of Top 10 lists that Data Scientists -- from enthusiasts to those who want to jump start a career -- must know to smoothly navigate a path through this field.https://www.kdnuggets.com/2020/08/top-10-lists-data-science.html
-
Netflix’s Polynote is a New Open Source Framework to Build Better Data Science Notebooks">Netflix’s Polynote is a New Open Source Framework to Build Better Data Science Notebooks
The new notebook environment provides substantial improvements to streamline experimentation in machine learning workflows.https://www.kdnuggets.com/2020/08/netflix-polynote-open-source-framework-better-data-science-notebooks.html
-
Know What Employers are Expecting for a Data Scientist Role in 2020">Know What Employers are Expecting for a Data Scientist Role in 2020
The analysis is done from 1000+ recent Data scientist jobs, extracted from job portals using web scraping.https://www.kdnuggets.com/2020/08/employers-expecting-data-scientist-role-2020.html
-
A Tour of End-to-End Machine Learning Platforms
An end-to-end machine learning platform needs a holistic approach. If you’re interested in learning more about a few well-known ML platforms, you’ve come to the right place!https://www.kdnuggets.com/2020/07/tour-end-to-end-machine-learning-platforms.html
-
What I learned from looking at 200 machine learning tools
While hundreds of machine learning tools are available today, the ML software landscape may still be underdeveloped with more room to mature. This review considers the state of ML tools, existing challenges, and which frameworks are addressing the future of machine learning software.https://www.kdnuggets.com/2020/07/200-machine-learning-tools.html
-
Some Things Uber Learned from Running Machine Learning at Scale
Uber machine learning runtime Michelangelo has been in operation for a few years. What has the Uber team learned?https://www.kdnuggets.com/2020/07/some-things-uber-learned-machine-learning-scale.html
-
Uber’s Ludwig is an Open Source Framework for Low-Code Machine Learning">Uber’s Ludwig is an Open Source Framework for Low-Code Machine Learning
The new framework allow developers with minimum experience to create and train machine learning models.https://www.kdnuggets.com/2020/06/uber-ludwig-open-source-framework-machine-learning.html
-
How to Think Like a Data Scientist">How to Think Like a Data Scientist
So what does it take to become a data scientist? For some pointers on the skills for success, I interviewed Ben Chu, who is a Senior Data Scientist at Refinitiv Labs.https://www.kdnuggets.com/2020/05/think-like-data-scientist-data-analyst.html
-
State of the Machine Learning and AI Industry
Enterprises are struggling to launch machine learning models that encapsulate the optimization of business processes. These are now the essential components of data-driven applications and AI services that can improve legacy rule-based business processes, increase productivity, and deliver results. In the current state of the industry, many companies are turning to off-the-shelf platforms to increase expectations for success in applying machine learning.https://www.kdnuggets.com/2020/04/machine-learning-ai-industry.html