Search results for spark dataset

    Found 269 documents, 5928 searched:

  • A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

    In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs.

    https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html

  • Data Validation for PySpark Applications using Pandera

    New features and concepts.

    https://www.kdnuggets.com/2023/08/data-validation-pyspark-applications-pandera.html

  • PySpark for Data Science

    In this tutorial, we will learn to Initiates the Spark session, load, and process the data, perform data analysis, and train a machine learning model.

    https://www.kdnuggets.com/2023/02/pyspark-data-science.html

  • Movie Recommendations with Spark Collaborative Filtering

    Not sure what movie to watch? Ask your recommender system.

    https://www.kdnuggets.com/2021/12/movie-recommendations-spark-collaborative-filtering.html

  • Querying the Most Granular Demographics Dataset

    Having access to broad and detailed population data can potentially offer enormous value to any organization looking to interact with specific demographics. However, access alone is not sufficient without being able to leverage advanced techniques to explore and visualize the data.

    https://www.kdnuggets.com/2021/08/querying-granular-demographic-dataset.html

  • Awesome list of datasets in 100+ categories

    With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It's pretty hard to curate through such a massive universe of data, but this collection is a great start. Here, you can find data from cancer genomes to UFO reports, as well as years of air quality data to 200,000 jokes. Dive into this ocean of data to explore as you learn how to apply data science techniques or leverage your expertise to discover something new.

    https://www.kdnuggets.com/2021/05/awesome-list-datasets.html

  • Working with Spark, Python or SQL on Azure Databricks

    Here we look at some ways to interchangeably work with Python, PySpark and SQL using Azure Databricks, an Apache Spark-based big data analytics service designed for data science and data engineering offered by Microsoft.

    https://www.kdnuggets.com/2020/08/spark-python-sql-azure-databricks.html

  • Apache Spark Cluster on Docker

    Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.

    https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html

  • Apache Spark on Dataproc vs. Google BigQuery

    This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

    https://www.kdnuggets.com/2020/07/apache-spark-dataproc-vs-google-bigquery.html

  • LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

    Spark-TFRecord enables the processing of TensorFlow’s TFRecord structures in Apache Spark.

    https://www.kdnuggets.com/2020/05/linkedin-open-sources-small-component-tensorflow-spark-interoperability.html

  • The Benefits & Examples of Using Apache Spark with PySpark

    Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.

    https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html

  • 3 Best Sites to Find Datasets for your Data Science Projects

    When first learning data science, you will inevitably find yourself looking for more datasets to practice with. Here, we recommend the 3 best sites to find datasets to spark your next data science project.

    https://www.kdnuggets.com/2020/04/best-sites-datasets-data-science.html

  • Spark NLP 101: LightPipeline

    A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. Now let’s see how this can be done in Spark NLP using Annotators and Transformers.

    https://www.kdnuggets.com/2019/11/spark-nlp-101-lightpipeline.html

  • Time Series Analysis: A Simple Example with KNIME and Spark

    The task: train and evaluate a simple time series model using a random forest of regression trees and the NYC Yellow taxi dataset.

    https://www.kdnuggets.com/2019/10/time-series-analysis-simple-example-knime-spark.html

  • Learn how to use PySpark in under 5 minutes (Installation + Tutorial)

    Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.

    https://www.kdnuggets.com/2019/08/learn-pyspark-installation-tutorial.html

  • Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint

    Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.

    https://www.kdnuggets.com/2019/05/analyzing-tweets-nlp-spark-optimus-twint.html

  • Practical Apache Spark in 10 Minutes

    Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.

    https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html

  • Apache Spark Introduction for Beginners">Silver BlogApache Spark Introduction for Beginners

    An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.

    https://www.kdnuggets.com/2018/10/apache-spark-introduction-beginners.html

  • Project Hydrogen, new initiative based on Apache Spark to support AI and Data Science

    An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects.

    https://www.kdnuggets.com/2018/08/databricks-project-hydrogen-apache-spark.html

  • Introduction to Apache Spark

    This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.

    https://www.kdnuggets.com/2018/07/introduction-apache-spark.html

  • Deep Learning With Apache Spark: Part 1

    First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.

    https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html

  • [ebook] 7 Steps for a Developer to Learn Apache Spark

    We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.

    https://www.kdnuggets.com/2018/04/databricks-ebook-7-steps-learn-apache-spark.html

  • A Simple XGBoost Tutorial Using the Iris Dataset

    This is an overview of the XGBoost machine learning algorithm, which is fast and shows good results. This example uses multiclass prediction with the Iris dataset from Scikit-learn.

    https://www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html

  • Apache Spark Key Terms, Explained

    An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.

    https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html

  • XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink

    An overview of XGBoost4J, a JVM-based implementation of XGBoost, one of the most successful recent machine learning algorithms in Kaggle competitions, with distributed support for Spark and Flink.

    https://www.kdnuggets.com/2016/03/xgboost-implementing-winningest-kaggle-algorithm-spark-flink.html

  • Auto-Scaling scikit-learn with Spark

    Databricks gives us an overview of the spark-sklearn library, which automatically and seamlessly distributes model tuning on a Spark cluster, without impacting workflow.

    https://www.kdnuggets.com/2016/02/auto-scaling-scikit-learn-spark.html

  • 9 Must-Have Datasets for Investigating Recommender Systems

    Gain some insight into a variety of useful datasets for recommender systems, including data descriptions, appropriate uses, and some practical comparison.

    https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html

  • Deep Learning with Spark and TensorFlow

    The integration of TensorFlow with Spark leverages the distributed framework for hyperparameter tuning and model deployment at scale. Both time savings and improved error rates are demonstrated.

    https://www.kdnuggets.com/2016/01/deep-learning-spark-tensorflow.html

  • How to Check Hypotheses with Bootstrap and Apache Spark

    Learn how to leverage bootstrap sampling to test hypotheses, and how to implement in Apache Spark and Scala with a complete code example.

    https://www.kdnuggets.com/2016/01/hypothesis-testing-bootstrap-apache-spark.html

  • Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers

    Are you interested in massive amounts of data for research? Yahoo has just released the largest-ever machine learning dataset to the research community.

    https://www.kdnuggets.com/2016/01/yahoo-largest-machine-learning-dataset.html

  • Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet

    Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.

    https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html

  • Fast Big Data: Apache Flink vs Apache Spark for Streaming Data

    Real-time stream processing has been gaining momentum in recent past, and major tools which are enabling it are Apache Spark and Apache Flink. Learn with the help of a case study about Data processing, Data Flow, Data management using these tools.

    https://www.kdnuggets.com/2015/11/fast-big-data-apache-flink-spark-streaming.html

  • Spark SQL for Real-Time Analytics

    Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things).

    https://www.kdnuggets.com/2015/09/spark-sql-real-time-analytics.html

  • Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020

    Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.

    https://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html

  • Awesome Public Datasets on GitHub

    A long, categorized list of large datasets (available for public use) to try your analytics skills on. Which one would you pick?

    https://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html

  • Bitcoin tools and datasets

    Bitcoin, a secure and anonymous internet currency, has recently experienced a bubble in value and attention. Here is a very useful (and free) set of data extraction scripts and datasets for analysts interested in Bitcoin.

    https://www.kdnuggets.com/2013/04/bitcoin-tools-datasets.html

  • 2018’s Top 7 R Packages for Data Science and AI

    This is a list of the best packages that changed our lives this year, compiled from my weekly digests.

    https://www.kdnuggets.com/2019/01/vazquez-2018-top-7-r-packages.html

  • Unlock the Secrets of LLMs in 60-Minute with Andrej Karpathy

    Karpathy's talk provides a comprehensive yet accessible introduction to large language models, explaining their capabilities, future potential, and associated security risks in an engaging manner.

    https://www.kdnuggets.com/unlock-the-secrets-of-llms-in-a-60-minute-with-andrej-karpathy

  • A Data Lake, You Call It? It’s a Data Swamp

    How and why the data lake architecture often fails to meet its promises. And how better governance helps mitigate such challenges.

    https://www.kdnuggets.com/a-data-lake-you-call-it-it-a-data-swamp

  • The Only Free Course You Need To Become a Professional Data Engineer

    Data Engineering ZoomCamp offers free access to reading materials, video tutorials, assignments, homeworks, projects, and workshops.

    https://www.kdnuggets.com/the-only-free-course-you-need-to-become-a-professional-data-engineer

  • Read This Before Making a Career Switch to Data Science

    From Skill Assessment to Networking: Your Roadmap to Thriving in the World of Data Science.

    https://www.kdnuggets.com/read-this-before-making-a-career-switch-to-data-science

  • Pandas vs. Polars: A Comparative Analysis of Python’s Dataframe Libraries

    An in-depth analysis of their syntax, speed, and usability. Which one is the best to use when working with data?

    https://www.kdnuggets.com/pandas-vs-polars-a-comparative-analysis-of-python-dataframe-libraries

  • Mastering the Data Universe: Key Steps to a Thriving Data Science Career

    This article covered the six main pillars of a data science career from learning skills to getting a job.

    https://www.kdnuggets.com/mastering-the-data-universe-key-steps-to-a-thriving-data-science-career

  • A Brief History of the Neural Networks

    From the biological neuron to LLMs: How AI became smart.

    https://www.kdnuggets.com/a-brief-history-of-the-neural-networks

  • 7 Best Cloud Database Platforms

    Cloud databases have made it easier and cheaper to develop enterprise-level applications, offering flexibility, convenience, and standard database functionality. See what KDnuggets recommends.

    https://www.kdnuggets.com/7-best-cloud-database-platforms

  • Best Practices for Building ETLs for ML

    This article talks about several best practices for writing ETLs for building training datasets. It delves into several software engineering techniques and patterns applied to ML.

    https://www.kdnuggets.com/best-practices-for-building-etls-for-ml

  • Introduction to Cloud Computing for Data Science

    And the Power Duo of Modern Tech.

    https://www.kdnuggets.com/introduction-to-cloud-computing-for-data-science

  • The 5 Best AI Tools For Maximizing Productivity

    KDnuggets reviews a diverse set of 5 AI tools to help maximize your productivity. Have a look and see what our recommendations include.

    https://www.kdnuggets.com/the-5-best-ai-tools-for-maximizing-productivity

  • Working with Big Data: Tools and Techniques

    Where do you start in a field as vast as big data? Which tools and techniques to use? We explore this and talk about the most common tools in big data.

    https://www.kdnuggets.com/working-with-big-data-tools-and-techniques

  • Data Management Principles for Data Science

    Back to Basics: Understanding key data management principles that data scientists should know.

    https://www.kdnuggets.com/data-management-principles-for-data-science

  • Creating Visuals with Matplotlib and Seaborn

    Learn the basic Python package visualization for your work.

    https://www.kdnuggets.com/creating-visuals-with-matplotlib-and-seaborn

  • LangChain + Streamlit + Llama: Bringing Conversational AI to Your Local Machine

    Integrating Open Source LLMs and LangChain for Free Generative Question Answering (No API Key required).

    https://www.kdnuggets.com/2023/08/langchain-streamlit-llama-bringing-conversational-ai-local-machine.html

  • AI: Large Language & Visual Models

    This article discusses the significance of large language and visual models in AI, their capabilities, potential synergies, challenges such as data bias, ethical considerations, and their impact on the market, highlighting their potential for advancing the field of artificial intelligence.

    https://www.kdnuggets.com/2023/06/ai-large-language-visual-models.html

  • Ten Years of AI in Review

    From image classification to chatbot therapy.

    https://www.kdnuggets.com/2023/06/ten-years-ai-review.html

  • 10 Jupyter Notebook Tips and Tricks for Data Scientists

    Unlock the full potential of Jupyter Notebook with expert tips and techniques, including time-saving shortcuts, powerful magic functions, and advanced features, to boost your productivity.

    https://www.kdnuggets.com/2023/06/10-jupyter-notebook-tips-tricks-data-scientists.html

  • The Top AutoML Frameworks You Should Consider in 2023

    AutoML frameworks are powerful tool for data analysts and machine learning specialists that can automate data preprocessing, model selection, hyperparameter tuning, and even perform complex tasks like feature engineering.

    https://www.kdnuggets.com/2023/05/best-automl-frameworks-2023.html

  • The Role of Open Source Tools in Accelerating Data Science Progress

    Open source tools have had a pivotal role in the evolution of data science, from providing the foundation for analysis, to fueling the innovation that shapes today's landscape. The open source impact on data science is demonstrated best by looking at the relationship's past, present, and future.

    https://www.kdnuggets.com/2023/05/role-open-source-tools-accelerating-data-science-progress.html

  • Data Analytics Tools You Need To Know in 2023

    What tools do you need to know to be a successful data analyst?

    https://www.kdnuggets.com/2023/05/data-analytics-tools-need-know-2023.html

  • Introducing Healthcare-Specific Large Language Models from John Snow Labs

    John Snow Labs recently released a new LLM called BioGPT-JSL and capabilities tuned specifically to the medical domain. This article summarizes three things you should know about it. 

    https://www.kdnuggets.com/2023/04/john-snow-introducing-healthcare-specific-large-language-models-john-snow-labs.html

  • Top 19 Skills You Need to Know in 2023 to Be a Data Scientist

    Skills like the ability to clean, transform, statistically analyze, visualize, communicate, and predict data.

    https://www.kdnuggets.com/2023/04/top-19-skills-need-know-2023-data-scientist.html

  • Introducing the Testing Library for Natural Language Processing

    Deliver reliable, safe and effective NLP models.

    https://www.kdnuggets.com/2023/04/introducing-testing-library-natural-language-processing.html

  • Announcing PyCaret 3.0: Open-source, Low-code Machine Learning in Python

    Exploring the Latest Enhancements and Features of PyCaret 3.0.

    https://www.kdnuggets.com/2023/03/announcing-pycaret-30-opensource-lowcode-machine-learning-python.html

  • 7 Best Tools for Machine Learning Experiment Tracking

    Tools for organizing machine learning experiments, source code, artifacts, models registry, and visualization in one place.

    https://www.kdnuggets.com/2023/02/7-best-tools-machine-learning-experiment-tracking.html

  • Learn Data Engineering From These GitHub Repositories

    KDnuggets Top Blog Kickstart your Data Engineering career with these curated GitHub repositories.

    https://www.kdnuggets.com/2023/02/learn-data-engineering-github-repositories.html

  • Overcome Your Data Quality Issues with Great Expectations

    Bad data costs organizations money, reputation, and time. Hence it is very important to monitor and validate data quality continuously.

    https://www.kdnuggets.com/2023/01/overcome-data-quality-issues-great-expectations.html

  • 12 Essential Commands for Streamlit

    Learn about the most commonly used Streamlit commands and build a customized web application.

    https://www.kdnuggets.com/2023/01/12-essential-commands-streamlit.html

  • Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

    Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.

    https://www.kdnuggets.com/2020/10/data-science-minimum-10-essential-skills.html

  • Top 38 Python Libraries for Data Science, Data Visualization & Machine Learning

    This article compiles the 38 top Python libraries for data science, data visualization & machine learning, as best determined by KDnuggets staff.

    https://www.kdnuggets.com/2020/11/top-python-libraries-data-science-data-visualization-machine-learning.html

  • Data Science Projects That Can Help You Solve Real World Problems

    The best way to learn Data Science is by solving real-world problems with the data and building your own portfolio. In this article, we will discuss three projects that you can work on to build your portfolio and impress interviewers.

    https://www.kdnuggets.com/2022/11/data-science-projects-help-solve-real-world-problems.html

  • The Complete Data Engineering Study Roadmap

    KDnuggets Top Blog Everything you need to know to start your career in Data Engineering.

    https://www.kdnuggets.com/2022/11/complete-data-engineering-study-roadmap.html

  • How LinkedIn Uses Machine Learning To Rank Your Feed

    In this post, you will learn to clarify business problems & constraints, understand problem statements, select evaluation metrics, overcome technical challenges, and design high-level systems.

    https://www.kdnuggets.com/2022/11/linkedin-uses-machine-learning-rank-feed.html

  • Top 10 MLOps Tools to Optimize & Manage Machine Learning Lifecycle

    As more businesses experiment with data, they realize that developing a machine learning (ML) model is only one of many steps in the ML lifecycle.

    https://www.kdnuggets.com/2022/10/top-10-mlops-tools-optimize-manage-machine-learning-lifecycle.html

  • Is OLAP Dead?

    OLAP enables citizen analysts to quickly, efficiently, and cost-effectively uncover new business insights at a reduced time-to-value.

    https://www.kdnuggets.com/2022/10/olap-dead.html

  • 25 Advanced SQL Interview Questions for Data Scientists

    KDnuggets Top Blog Check out this collection of advanced SQL interview questions with answers.

    https://www.kdnuggets.com/2022/10/25-advanced-sql-interview-questions-data-scientists.html

  • 3 Simple Ways to Speed Up Your Python Code

    The post explains three popular frameworks, PySpark, Dask, and Ray, and discusses various factors to select the most appropriate one for your project.

    https://www.kdnuggets.com/2022/10/3-simple-ways-speed-python-code.html

  • Top Open Source Large Language Models

    In this article, we will discuss the importance of large language models and suggest some of the top open source models and the NLP tasks they can be used for.

    https://www.kdnuggets.com/2022/09/john-snow-top-open-source-large-language-models.html

  • KDnuggets News, September 14: Free Python for Data Science Course • Everything You’ve Ever Wanted to Know About Machine Learning

    Free Python for Data Science Course • Everything You’ve Ever Wanted to Know About Machine Learning • Progress Bars in Python with tqdm for Fun and Profit • 7 Tips for Python Beginners • 7 Data Analytics Interview Questions & Answers

    https://www.kdnuggets.com/2022/n36.html

  • 7 Things You Didn’t Know You Could do with a Low Code Tool

    Surprisingly easy solutions for complex data problems.

    https://www.kdnuggets.com/2022/09/7-things-didnt-know-could-low-code-tool.html

  • Machine Learning Metadata Store

    In this article, we will learn about metadata stores, the need for them, their components, and metadata store management.

    https://www.kdnuggets.com/2022/08/machine-learning-metadata-store.html

  • Is There a Way to Bridge the MLOps Tools Gap?

    Converting Jupyter notebooks to a well-designed software system is a mandatory step in every ML project. But there is a notable lack of tooling to assist developers with such translation, beyond the basic nbconvert utility.

    https://www.kdnuggets.com/2022/08/way-bridge-mlops-tools-gap.html

  • Most In-demand Artificial Intelligence Skills To Learn In 2022

    KDnuggets Top Blog Artificial Intelligence (AI) is the process of programming a computer that can reason and learn like a human being and make decisions for itself.

    https://www.kdnuggets.com/2022/08/indemand-artificial-intelligence-skills-learn-2022.html

  • 10 Modern Data Engineering Tools

    Learn about the modern tools for data orchestration, data storage, analytical engineering, batch processing, and data streaming.

    https://www.kdnuggets.com/2022/07/10-modern-data-engineering-tools.html

  • Data Science, Statistics and Machine Learning Dictionary

    Check out this curated list of the most used data science terminology and get a leg up on your learning.

    https://www.kdnuggets.com/2022/05/data-science-statistics-machine-learning-dictionary.html

  • Feature Stores for Real-time AI & Machine Learning

    Real-time AI/ML is on the rise and feature stores are key to successfully deploying them. Read on to see how the choice of online store and the feature store architecture play important roles in determining its performance and cost.

    https://www.kdnuggets.com/2022/03/feature-stores-realtime-ai-machine-learning.html

  • How to Process a DataFrame with Millions of Rows in Seconds

    TLDR; process it with a new Python Data Processing Engine in the Cloud.

    https://www.kdnuggets.com/2022/01/process-dataframe-millions-rows-seconds.html

  • Query Your Pandas DataFrames with SQL

    Learn how to query your Pandas DataFrames using the standard SQL SELECT statement, seamlessly from within your Python code.

    https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html

  • How to Speed Up XGBoost Model Training

    XGBoost is an open-source implementation of gradient boosting designed for speed and performance. However, even XGBoost training can sometimes be slow. This article will review the advantages and disadvantages of each approach as well as go over how to get started.

    https://www.kdnuggets.com/2021/12/speed-xgboost-model-training.html

  • Cloud ML In Perspective: Surprises of 2021, Projections for 2022

    Let’s take a closer look on Cloud ML market in 2021 in retrospective (with occasional drills into realities of 2020, too). Read this in-depth analysis.

    https://www.kdnuggets.com/2021/12/cloud-ml-perspective-surprises-2021-projections-2022.html

  • My First Six Months as a Data Scientist

    The technical and non-technical lessons I’ve learned.

    https://www.kdnuggets.com/2021/12/first-six-months-data-scientist.html

  • Introduction to Clustering in Python with PyCaret

    A step-by-step, beginner-friendly tutorial for unsupervised clustering tasks in Python using PyCaret.

    https://www.kdnuggets.com/2021/12/introduction-clustering-python-pycaret.html

  • Inside recommendations: how a recommender system recommends

    We describe types of recommender systems, more specifically, algorithms and methods for content-based systems, collaborative filtering, and hybrid systems.

    https://www.kdnuggets.com/2021/11/recommendations-recommender-system.html

  • Four Basic Steps in Data Preparation">Silver BlogFour Basic Steps in Data Preparation

    What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.

    https://www.kdnuggets.com/2021/10/four-basic-steps-data-preparation.html

  • Gold BlogThe 20 Python Packages You Need For Machine Learning and Data Science">Rewards BlogGold BlogThe 20 Python Packages You Need For Machine Learning and Data Science

    Do you do Python? Do you do data science and machine learning? Then, you need to do these crucial Python libraries that enable nearly all you will want to do.

    https://www.kdnuggets.com/2021/10/20-python-packages.html

  • Data science SQL interview questions from top tech firms">Gold BlogData science SQL interview questions from top tech firms

    As a data scientist, there is one thing you really need to understand and know how to handle: data. With SQL being a foundational technical approach for working with data, it should not be surprising that the top tech companies will ask about your SQL skills during an interview. Here, we cover the key concepts tested so you can best prepare for your next data science interview.

    https://www.kdnuggets.com/2021/10/data-science-sql-interview-questions.html

  • Messy Data is Beautiful

    Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.

    https://www.kdnuggets.com/2021/09/sparkbeyond-messy-data-is-beautiful.html

  • Data Engineering Technologies 2021

    Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.

    https://www.kdnuggets.com/2021/09/data-engineering-technologies-2021.html

  • What Is The Real Difference Between Data Engineers and Data Scientists?

    To launch your data career, you’ll need both theoretical knowledge and applied skills. Bootcamp programs like Springboard’s Data Science Career Track and Data Engineering Career Track can help make you job-ready through hands-on, project-based learning and one-on-one mentorship. Wondering which data career path is right for you? Read on to find out.

    https://www.kdnuggets.com/2021/09/springboard-difference-data-engineers-data-scientists.html

  • Gold BlogA Data Science Portfolio That Will Land You The Job">Rewards BlogGold BlogA Data Science Portfolio That Will Land You The Job

    Landing a data science job is no easy feat, especially during the COVID-19 pandemic. This article provides aspiring data scientists with advice on building a data science portfolio that stands out.

    https://www.kdnuggets.com/2021/09/data-science-portfolio-job.html

  • How My Learning Path Changed After Becoming a Data Scientist

    I keep learning but in a different way.

    https://www.kdnuggets.com/2021/08/learning-path-changed-becoming-data-scientist.html

  • GPU-Powered Data Science (NOT Deep Learning) with RAPIDS">Gold BlogGPU-Powered Data Science (NOT Deep Learning) with RAPIDS

    How to utilize the power of your GPU for regular data science and machine learning even if you do not do a lot of deep learning work.

    https://www.kdnuggets.com/2021/08/gpu-powered-data-science-deep-learning-rapids.html

  • Get Interactive Plots Directly With Pandas">Silver BlogGet Interactive Plots Directly With Pandas

    Telling a story with data is a core function for any Data Scientist, and creating data visualizations that are simultaneously illuminating and appealing can be challenging. This tutorial reviews how to create Plotly and Bokeh plots directly through Pandas plotting syntax, which will help you convert static visualizations into interactive counterparts -- and take your analysis to the next level.

    https://www.kdnuggets.com/2021/06/interactive-plots-directly-pandas.html

  • Building a Knowledge Graph for Job Search Using BERT

    A guide on how to create knowledge graphs using NER and Relation Extraction.

    https://www.kdnuggets.com/2021/06/knowledge-graph-job-search-bert.html

  • Gold BlogTop 10 Data Science Projects for Beginners">Rewards BlogGold BlogTop 10 Data Science Projects for Beginners

    Check out these projects for ideas to strengthen your skills and build a portfolio that stands out.

    https://www.kdnuggets.com/2021/06/top-10-data-science-projects-beginners.html

  • Make Pandas 3 Times Faster with PyPolars

    Learn how to speed up your Pandas workflow using the PyPolars library.

    https://www.kdnuggets.com/2021/05/pandas-faster-pypolars.html

  • Data Validation in Machine Learning is Imperative, Not Optional

    Before we reach model training in the pipeline, there are various components like data ingestion, data versioning, data validation, and data pre-processing that need to be executed. In this article, we will discuss data validation, why it is important, its challenges, and more.

    https://www.kdnuggets.com/2021/05/data-validation-machine-learning-imperative.html

  • How to Speed Up Pandas with Modin

    The Modin library has the ability to scale your pandas workflows by changing one line of code and integration with the Python ecosystem and Ray clusters. This tutorial goes over how to get started with Modin and how it can speed up your pandas workflows.

    https://www.kdnuggets.com/2021/03/speed-up-pandas-modin.html

  • Beautiful decision tree visualizations with dtreeviz

    Improve the old way of plotting the decision trees and never go back!

    https://www.kdnuggets.com/2021/03/beautiful-decision-tree-visualizations-dtreeviz.html

  • Inside the Architecture Powering Data Quality Management at Uber

    Data Quality Monitor implements novel statistical methods for anomaly detection and quality management in large data infrastructures.

    https://www.kdnuggets.com/2021/02/inside-architecture-powering-data-quality-management-uber.html

  • The Best Tool for Data Blending is KNIME

    These are the lessons and best practices I learned in many years of experience in data blending, and the software that became my most important tool in my day-to-day work.

    https://www.kdnuggets.com/2021/01/best-tool-data-blending-knime.html

  • Model Experiments, Tracking and Registration using MLflow on Databricks

    This post covers how StreamSets can help expedite operations at some of the most crucial stages of Machine Learning Lifecycle and MLOps, and demonstrates integration with Databricks and MLflow.

    https://www.kdnuggets.com/2021/01/model-experiments-tracking-registration-mlflow-databricks.html

  • 10 Python Skills They Don’t Teach in Bootcamp

    Ascend to new heights in Data Science and Machine Learning with this thrilling list of coding tips.

    https://www.kdnuggets.com/2020/12/10-python-skills-dont-teach-bootcamp.html

  • Top Python Libraries for Deep Learning, Natural Language Processing & Computer Vision">Gold BlogTop Python Libraries for Deep Learning, Natural Language Processing & Computer Vision

    This article compiles the 30 top Python libraries for deep learning, natural language processing & computer vision, as best determined by KDnuggets staff.

    https://www.kdnuggets.com/2020/11/top-python-libraries-deep-learning-natural-language-processing-computer-vision.html

  • Pandas on Steroids: End to End Data Science in Python with Dask">Gold BlogPandas on Steroids: End to End Data Science in Python with Dask

    End to end parallelized data science from reading big data to data manipulation to visualisation to machine learning.

    https://www.kdnuggets.com/2020/11/pandas-steroids-dask-python-data-science.html

  • Data scientist or machine learning engineer? Which is a better career option?

    In order to build automated data processing systems, we require professionals like Machine Learning Engineers and Data Scientists. But which of these is a better career option right now? Read on to find out.

    https://www.kdnuggets.com/2020/11/greatlearning-data-scientist-machine-learning-engineer.html

  • How to be a 10x data scientist

    If you are a Data Scientist looking to make it to the next level, then there are many opportunities to up your game and your efficiency to stand out from the others. Some of these recommendations that you can follow are straightforward, and others are rarely followed, but they will all pay back in dividends of time and effectiveness for your career.

    https://www.kdnuggets.com/2020/10/10x-data-scientist.html

  • How LinkedIn Uses Machine Learning in its Recruiter Recommendation Systems">Silver BlogHow LinkedIn Uses Machine Learning in its Recruiter Recommendation Systems

    LinkedIn uses some very innovative machine learning techniques to optimize candidate recommendations.

    https://www.kdnuggets.com/2020/10/linkedin-machine-learning-recruiter-recommendation-systems.html

  • 5 Challenges to Scaling Machine Learning Models

    ML models are hard to be translated into active business gains. In order to understand the common pitfalls in productionizing ML models, let’s dive into the top 5 challenges that organizations face.

    https://www.kdnuggets.com/2020/10/5-challenges-scaling-machine-learning-models.html

  • LinkedIn’s Pro-ML Architecture Summarizes Best Practices for Building Machine Learning at Scale

    The reference architecture is powering mission critical machine learning workflows within LinkedIn.

    https://www.kdnuggets.com/2020/09/linkedin-pro-ml-architecture-best-practices-building-machine-learning-scale.html

  • Unpopular Opinion – Data Scientists Should Be More End-to-End

    Can a do-it-all Data Scientist really be more effective at delivering new value from data? While it might sound exhausting, important efficiencies can exist that might bring better value to the business even faster.

    https://www.kdnuggets.com/2020/09/data-scientists-should-be-more-end-to-end.html

  • Online Certificates/Courses in AI, Data Science, Machine Learning from Top Universities">Silver BlogOnline Certificates/Courses in AI, Data Science, Machine Learning from Top Universities

    We present the online courses and certificates in AI, Data Science, Machine Learning, and related topics from the top 20 universities in the world.

    https://www.kdnuggets.com/2020/09/online-certificates-ai-data-science-machine-learning-top.html

  • Here’s what you need to look for in a model server to build ML-powered services

    More applications are being infused with machine learning while MLOps processes and best practices are becoming well established. Critical to these software and systems are the servers that run the models, which should feature key capabilities to drive successful enterprise-scale productionizing of machine learning.

    https://www.kdnuggets.com/2020/09/model-server-build-ml-powered-services.html

  • The List of Top 10 Lists in Data Science">Gold BlogThe List of Top 10 Lists in Data Science

    The list of Top 10 lists that Data Scientists -- from enthusiasts to those who want to jump start a career -- must know to smoothly navigate a path through this field.

    https://www.kdnuggets.com/2020/08/top-10-lists-data-science.html

  • Netflix’s Polynote is a New Open Source Framework to Build Better Data Science Notebooks">Gold BlogNetflix’s Polynote is a New Open Source Framework to Build Better Data Science Notebooks

    The new notebook environment provides substantial improvements to streamline experimentation in machine learning workflows.

    https://www.kdnuggets.com/2020/08/netflix-polynote-open-source-framework-better-data-science-notebooks.html

  • Platinum BlogKnow What Employers are Expecting for a Data Scientist Role in 2020">Platinum BlogPlatinum BlogKnow What Employers are Expecting for a Data Scientist Role in 2020

    The analysis is done from 1000+ recent Data scientist jobs, extracted from job portals using web scraping.

    https://www.kdnuggets.com/2020/08/employers-expecting-data-scientist-role-2020.html

  • A Tour of End-to-End Machine Learning Platforms

    An end-to-end machine learning platform needs a holistic approach. If you’re interested in learning more about a few well-known ML platforms, you’ve come to the right place!

    https://www.kdnuggets.com/2020/07/tour-end-to-end-machine-learning-platforms.html

  • What I learned from looking at 200 machine learning tools

    While hundreds of machine learning tools are available today, the ML software landscape may still be underdeveloped with more room to mature. This review considers the state of ML tools, existing challenges, and which frameworks are addressing the future of machine learning software.

    https://www.kdnuggets.com/2020/07/200-machine-learning-tools.html

  • Some Things Uber Learned from Running Machine Learning at Scale

    Uber machine learning runtime Michelangelo has been in operation for a few years. What has the Uber team learned?

    https://www.kdnuggets.com/2020/07/some-things-uber-learned-machine-learning-scale.html

  • Uber’s Ludwig is an Open Source Framework for Low-Code Machine Learning">Silver BlogUber’s Ludwig is an Open Source Framework for Low-Code Machine Learning

    The new framework allow developers with minimum experience to create and train machine learning models.

    https://www.kdnuggets.com/2020/06/uber-ludwig-open-source-framework-machine-learning.html

  • How to Think Like a Data Scientist">Gold BlogHow to Think Like a Data Scientist

    So what does it take to become a data scientist? For some pointers on the skills for success, I interviewed Ben Chu, who is a Senior Data Scientist at Refinitiv Labs.

    https://www.kdnuggets.com/2020/05/think-like-data-scientist-data-analyst.html

  • State of the Machine Learning and AI Industry

    Enterprises are struggling to launch machine learning models that encapsulate the optimization of business processes. These are now the essential components of data-driven applications and AI services that can improve legacy rule-based business processes, increase productivity, and deliver results. In the current state of the industry, many companies are turning to off-the-shelf platforms to increase expectations for success in applying machine learning.

    https://www.kdnuggets.com/2020/04/machine-learning-ai-industry.html

Refine your search here:

No, thanks!