- Movie Recommendations with Spark Collaborative Filtering - Dec 1, 2021.
Not sure what movie to watch? Ask your recommender system.
Apache Spark, Collaborative, Knime, Low-Code, Recommender Systems
- The Most In Demand Skills for Data Engineers in 2021 - May 18, 2021.
If you are preparing to make a career in data or are looking for opportunities to skill-up in your current data-centric role, then this analysis of in-demand skills for 2021, based on over 17,000 Data Engineer job postings, should offer you a good idea as to which programming languages and software tools are increasing and decreasing in importance.
Apache Spark, AWS, Data Engineer, Data Science Skills, Data Scientist, Python, Skills, SQL
How to Acquire the Most Wanted Data Science Skills - Nov 13, 2020.
We recently surveyed KDnuggets readers to determine the "most wanted" data science skills. Since they seem to be those most in demand from practitioners, here is a collection of resources for getting started with this learning.
Algorithms, Amazon, Apache Spark, AWS, Computer Vision, Data Science, Data Science Skills, Deep Learning, Docker, NLP, NoSQL, PyTorch, Reinforcement Learning, TensorFlow
- Working with Spark, Python or SQL on Azure Databricks - Aug 27, 2020.
Here we look at some ways to interchangeably work with Python, PySpark and SQL using Azure Databricks, an Apache Spark-based big data analytics service designed for data science and data engineering offered by Microsoft.
Apache Spark, Databricks, Microsoft Azure, Python, SQL
- Unifying Data Pipelines and Machine Learning with Apache Spark™ and Amazon SageMaker - Aug 25, 2020.
Roll up your sleeves and charge up because you’re invited to an interactive, virtual Machine Learning workshop run by Amazon Web Services, Databricks, and Immuta on September 10.
Apache Spark, AWS, Immuta, Sagemaker, Webinar
- Containerization of PySpark Using Kubernetes - Aug 6, 2020.
This article demonstrates the approach of how to use Spark on Kubernetes. It also includes a brief comparison between various cluster managers available for Spark.
Apache Spark, Containers, Kubernetes
- 5 Apache Spark Best Practices For Data Science - Aug 4, 2020.
Check out these best practices for Spark that the author wishes they knew before starting their project.
Apache Spark, Best Practices, Data Science
- KDnuggets™ News 20:n29, Jul 29: Easy Guide To Data Preprocessing In Python; Building a better Spark UI; Computational Algebra for Coders: The Free Course - Jul 29, 2020.
An easy guide to data pre-processing in Python; Monitoring Apache Spark with a better Spark UI; Computational Linear Algebra for Coders: the free course; Labelling data with Snorkel; Bayesian Statistics.
Apache Spark, Bayesian, Data Preprocessing, Linear Algebra, Python
- Monitoring Apache Spark – We’re building a better Spark UI - Jul 23, 2020.
Data Mechanics is developing a free monitoring UI tool for Apache Spark to replace the Spark UI with a better UX, new metrics, and automated performance recommendations. Preview these high-level feedback features, and consider trying it out to support its first release.
Apache Spark, Monitoring, UI/UX
- Apache Spark Cluster on Docker - Jul 22, 2020.
Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.
Apache Spark, Data Engineering, Docker, Jupyter, Python
- Apache Spark on Dataproc vs. Google BigQuery - Jul 15, 2020.
This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.
Apache Spark, BigQuery, Google
- The Benefits & Examples of Using Apache Spark with PySpark - Apr 21, 2020.
Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.
Apache Spark, Data Management, Python, SQL
- Spark NLP 101: LightPipeline - Nov 27, 2019.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. Now let’s see how this can be done in Spark NLP using Annotators and Transformers.
Apache Spark, NLP, Pipeline, Spark NLP
- KDnuggets™ News 19:n41, Oct 30: Feature Selection: Beyond feature importance?; Time Series Analysis Using KNIME and Spark - Oct 30, 2019.
This week in KDnuggets: Feature Selection: Beyond feature importance?; Time Series Analysis: A Simple Example with KNIME and Spark; 5 Advanced Features of Pandas and How to Use Them; How to Measure Foot Traffic Using Data Analytics; Introduction to Natural Language Processing (NLP); and much, much more!
Apache Spark, Data Analytics, Feature Selection, Knime, NLP, Pandas, Python, scikit-learn, Time Series
- Time Series Analysis: A Simple Example with KNIME and Spark - Oct 23, 2019.
The task: train and evaluate a simple time series model using a random forest of regression trees and the NYC Yellow taxi dataset.
Apache Spark, Knime, Rosaria Silipo, Seasonality, Time Series
- Learn how to use PySpark in under 5 minutes (Installation + Tutorial) - Aug 13, 2019.
Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.
Apache Spark, Big Data, Data Science, Python
- Online Workshop: How to set up Kubernetes for all your machine learning workflows - Jul 17, 2019.
Join this free live online workshop, Jul 31 @12 PM ET, to learn how to set up your Kubernetes cluster, so you can run Spark, TensorFlow, and any ML framework instantly, touching on the entire machine learning pipeline from model training to model deployment.
Apache Spark, cnvrg.io, Kubernetes, Machine Learning, TensorFlow
Spark NLP: Getting Started With The World’s Most Widely Used NLP Library In The Enterprise - Jun 18, 2019.
The Spark NLP library has become a popular AI framework that delivers speed and scalability to your projects. Check out what's under the hood and learn about how to getting started leveraging Spark NLP from John Snow Labs.
Apache Spark, Enterprise, John Snow Labs, NLP, Spark NLP
- Scalable Python Code with Pandas UDFs: A Data Science Application - Jun 13, 2019.
There is still a gap between the corpus of libraries that developers want to apply in a scalable runtime and the set of libraries that support distributed execution. This post discusses how to bridge this gap using the the functionality provided by Pandas UDFs in Spark 2.3+
Apache Spark, Big Data, Pandas, Python
What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem - Jun 10, 2019.
We identify the 6 tools in the modern open-source Data Science ecosystem, examine the Python vs R question, and determine which tools are used the most with Deep Learning and Big Data.
Anaconda, Apache Spark, Big Data Software, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikit-learn, Software, SQL, Tableau, TensorFlow
- Why physical storage of your database tables might matter - May 31, 2019.
Follow this investigation into why physical storage of your database tables might matter, from problem identification to possible issue resolutions.
Apache Spark, Databases, Postgres, SQL

Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis - May 30, 2019.
Python continues to lead the top Data Science platforms, but R and RapidMiner hold their share; Almost 50% have used Deep Learning tools; SQL is steady; Consolidation continues.
Pages: 1 2
Anaconda, Apache Spark, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikit-learn, Software, SQL, TensorFlow
- Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint - May 24, 2019.
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
Pages: 1 2
Apache Spark, Big Data, Deep Learning, Machine Learning, NLP, Optimus, Python, Twint
Data Science with Optimus Part 2: Setting your DataOps Environment - Apr 16, 2019.
Breaking down data science with Python, Spark and Optimus. Today: Data Operations for Data Science. Here we’ll learn to set-up Git, Travis CI and DVC for our project.
Apache Spark, Data Operations, Data Science, Python, Workflow
- Data Science with Optimus Part 1: Intro - Apr 15, 2019.
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras. It’s super easy to use.
Apache Spark, Data Science, Python, Workflow
- Scaling Big Data and AI – Spark + AI Summit 2019 - Mar 25, 2019.
Data and AI are all about scale. Databricks is bringing the Spark + AI Summit to San Francisco Apr 23-25. Check out the full list of sessions at Summit to see more exciting talks. Use code KDNuggets200 and get $200 off registration.
AI, Apache Spark, CA, Databricks, San Francisco
- Rapidly Build and Run Apache Spark Applications in the Cloud with StreamAnalytix on AWS Marketplace - Mar 1, 2019.
StreamAnalytix is an Apache Spark based big data analytics and machine learning platform. It offers an intuitive visual development environment to rapidly build and operationalize batch + streaming applications, across industries, data formats, and use cases.
Apache Spark, AWS, Impetus, StreamAnalytix, Streaming Analytics
- 10 Trending Data Science Topics at ODSC East 2019 - Feb 7, 2019.
ODSC East 2019, Boston, Apr 30 - May 3, will host over 300+ of the leading experts in data science and AI. Here are a few standout topics and presentations in this rapidly evolving field. Register for ODSC East at 50% off till Feb 8.
Apache Spark, Boston, Data Science, LSTM, Machine Learning, ODSC, Python
- Practical Apache Spark in 10 Minutes - Jan 11, 2019.
Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.
Apache Spark
- Spark + AI Summit: learn best practices in ML and DL, latest frameworks, and more – special KDnuggets offer - Dec 14, 2018.
Check agenda for the Spark + AI Summit in San Francisco on April 23-25, 2019, comprising of 12 technical tracks on data and AI across verticals, and get the biggest discount: $700 off until Dec 31.
AI, Apache Spark, CA, Databricks, San Francisco
- [ebook] Manipulating Data in Apache Spark - Oct 29, 2018.
In this ebook from Databricks, learn how DataFrames leverage the power of distributed processing through Spark, how to make big data processing easier for a wider audience, and more.
Apache Spark, Clustering, Databricks, ebook, Free ebook
- KDnuggets™ News 18:n40, Oct 24: Graphs Are The Next Frontier In Data Science; Apache Spark Intro for Beginners - Oct 24, 2018.
Apache Spark, Graph Databases
Apache Spark Introduction for Beginners - Oct 18, 2018.
An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.
Apache Spark, Beginners, Hadoop, R
- Big Data Day Camp: Big Data Tools & Techniques (October 25-26) - Oct 4, 2018.
Learn how to use data to make wise, actionable data driven decisions! Our first 2-day camp, Big Data Tools & Techniques, is October 25-26 at Qualcomm Institute, UCSD.
Apache Spark, Big Data, Deep Learning, Hadoop, Kafka
- ebook: Aggregating Data with Apache Spark™ - Sep 12, 2018.
Learn why cluster computing makes Spark the ideal processing engine for complex aggregations, the different types of aggregations that you can do with Spark, and more.
Apache Spark, Data Preparation, Databricks, ebook
- Optimus v2: Agile Data Science Workflows Made Easy - Aug 30, 2018.
Looking for a library to skyrocket your productivity as Data Scientist? Check this out!
Apache Spark, Machine Learning, Pandas, Python
- Project Hydrogen, new initiative based on Apache Spark to support AI and Data Science - Aug 16, 2018.
An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects.
AI, Apache Spark, Data Science, Databricks, Distributed Computing, Production
- [eBook] A Unified Approach to Analytics with Apache Spark - Jul 25, 2018.
How your data scientists and engineers can build models and data pipelines rapidly while collaborating with the business - download the ebook now.
Analytics, Apache Spark, Data Science, Databricks, ebook
- Introduction to Apache Spark - Jul 6, 2018.
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
Apache Spark, Data Processing, Distributed Systems
- [ebook] Apache Spark™ Under the Hood - Jun 27, 2018.
Learn how to install and run Spark yourself; A summary of Spark core architecture and concepts; Spark powerful language APIs and how you can use them.
Apache Spark, Databricks, ebook, PyTorch, R, scikit-learn, TensorFlow
The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R? - Jun 6, 2018.
We find 6 tools form the modern open source Data Science / Machine Learning ecosystem; examine whether Python declared victory over R; and review which tools are most associated with Deep Learning and Big Data.
Anaconda, Apache Spark, Data Science, Keras, Machine Learning, Open Source, Poll, Python, R, RapidMiner, Scala, scikit-learn, TensorFlow
- Deep Learning With Apache Spark: Part 2 - May 23, 2018.
In this article I’ll continue the discussion on Deep Learning with Apache Spark. I will focus entirely on the DL pipelines library and how to use it from scratch.
Pages: 1 2
Apache Spark, Deep Learning, Keras, SQL, TensorFlow
- Mastering Advanced Analytics with Apache Spark - May 22, 2018.
Get ebook with a collection of the most popular technical blog posts that introduce you to machine learning on Apache Spark, and highlight many of the major developments around Spark MLlib and GraphX.
Advanced Analytics, Apache Spark, Databricks, Graph Analytics, Machine Learning, MLlib
- Regeneron Pharmaceuticals: Spark R&D Developer (Data Engineer) - May 15, 2018.
Seeking an R&D Spark Developer to join the Genome Informatics team to expand the RGC’s big data infrastructure and develop new algorithms/tools to support various workflows/analyses throughout the RGC and Regeneron.
Apache Spark, Data Engineer, Developer, NY, Regeneron, Tarrytown
- KDnuggets™ News 18:n19, May 9: KDnuggets Poll: What tools you used for Analytics/Data Science Projects? 8 Useful Advices for Aspiring Data Scientists - May 9, 2018.
Also: Boost your data science skills. Learn linear algebra; Apache Spark: Python vs. Scala; Getting Started with spaCy for Natural Language Processing.
Advice, Apache Spark, Data Science, Mathematics
Apache Spark : Python vs. Scala - May 4, 2018.
When it comes to using the Apache Spark framework, the data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. This article compares the two, listing their pros and cons.
Apache Spark, Java, Python, Scala
- KDnuggets™ News 18:n17, Apr 25: Python Regular Expressions Cheat Sheet; Deep Learning With Apache Spark; Building a Question Answering Model - Apr 25, 2018.
Also: Derivation of Convolutional Neural Network from Fully Connected Network Step-By-Step; Presto for Data Scientists - SQL on anything; Why Deep Learning is perfect for NLP (Natural Language Processing); Top 16 Open Source Deep Learning Libraries and Platforms
Apache Spark, Cheat Sheet, Deep Learning, NLP, Python, Question answering, SQL
- Deep Learning With Apache Spark: Part 1 - Apr 18, 2018.
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.
Apache Spark, Databricks, Deep Learning, Pipeline
- [ebook] 7 Steps for a Developer to Learn Apache Spark - Apr 17, 2018.
We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.
Apache Spark, Databricks, Developer, ebook, Spark SQL
- Making Machine Learning Simple - Mar 20, 2018.
Learn how to build better models with support for multiple data sources and feature extraction at scale, simplify operations with on-demand cluster management, and more.
Apache Spark, Databricks, Feature Extraction, Machine Learning
- Ranking Popular Distributed Computing Packages for Data Science - Mar 20, 2018.
We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.
Apache Spark, Data Science, Distributed Systems, GitHub, Hadoop
- [eBook] Solving 4 Big Problems in Data Science - Mar 6, 2018.
Insights and tools from leading data science teams to accelerate results.
Apache Spark, Big Data, Cloud Computing, Databricks, Deployment, ebook
- A powerful new IDE to build, test, and run Apache Spark applications on your desktop for free! - Feb 23, 2018.
Build enterprise-grade functionally rich Spark applications with the aid of an intuitive drag-and-drop user interface and a wide array of pre-built Spark operators.
Apache Spark, Impetus, StreamAnalytix
- The Data Scientist’s Guide to Apache Spark™ - Feb 16, 2018.
How data scientists can leverage Spark for advanced analytics.
Advanced Analytics, Apache Spark, Data Scientist, Databricks, Matei Zaharia
- Top 15 Scala Libraries for Data Science in 2018 - Feb 9, 2018.
For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.
Apache Spark, Data Analysis, Data Science, Data Visualization, Machine Learning, NLP, Scala
- Kogentix Automated Machine Learning Platform - Jan 24, 2018.
Kogentix Automated Machine Learning Platform is the only solution we have seen that runs natively on Spark and includes all of the elements required to build and run a machine learning application.
Apache Spark, Automated Machine Learning, Data Visualization, Kogentix
- Best Data Science, Machine Learning Courses from Udemy, only $10 until Dec 21 - Dec 14, 2017.
Holiday Dev & IT sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until Dec 21, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Graph Analytics Using Big Data - Dec 4, 2017.
An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.
Pages: 1 2
Apache Spark, Big Data, Graph Analytics, India, Java
- Machine Learning with Optimus on Apache Spark - Nov 30, 2017.
The way most Machine Learning models work on Spark are not straightforward, and they need lots of feature engineering to work. That’s why we created the feature engineering section inside the Optimus Data Frame Transformer.
Pages: 1 2
Apache Spark, Data Science, Feature Engineering, Machine Learning, MLlib, Python, Workflow
- KDnuggets™ News 17:n45, Nov 29: New Poll: Data Science Methods Used? Deep Learning Specialization: 21 Lessons Learned - Nov 29, 2017.
Also The 10 Statistical Techniques Data Scientists Need to Master; Did Spark Really Kill Hadoop? A Framework for Textual Data Science.
Andrew Ng, Apache Spark, Deep Learning, Statistics, Text Mining
- Natural Language Processing Library for Apache Spark – free to use - Nov 28, 2017.
Introducing the Natural Language Processing Library for Apache Spark - and yes, you can actually use it for free! This post will give you a great overview of John Snow Labs NLP Library for Apache Spark.
Apache Spark, API, GitHub, John Snow Labs, Machine Learning, NLP
Did Spark Really Kill Hadoop? - Nov 22, 2017.
A comprehensive survey conducted by iDatalabs shows us the trends of the future of these two Data Science technologies.
Apache Spark, Big Data, Hadoop, iDatalabs
- [eBook] A Gentle Introduction to Apache Spark(tm) - Nov 21, 2017.
If you are a developer or data scientist interested in big data, Spark is the tool for you. Download this ebook to learn why Spark is a popular choice for data analytics, what tools and features are available, and much more.
Apache Spark, Databricks, ebook, Free ebook
- How (& Why) Data Scientists and Data Engineers Should Share a Platform - Nov 17, 2017.
Sharing one platform has some obvious benefits for Data Science and Data Engineering teams, but technical, language and process challenges often make this a challenge. Learn how one company implemented single cloud platform for R, Python and other workloads – and some of the unexpected benefits they discovered along the way.
Apache Spark, Cazena, Data Science Platform, Hadoop, Python, R
- Best Data Science, Machine Learning Courses from Udemy, only $10 until Nov 28- Black Friday/Cybermonday sale - Nov 17, 2017.
Black Friday/Cybermonday sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until Nov 28, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- PySpark SQL Cheat Sheet: Big Data in Python - Nov 16, 2017.
PySpark is a Spark Python API that exposes the Spark programming model to Python - With it, you can speed up analytic applications. With Spark, you can get started with big data processing, as it has built-in modules for streaming, SQL, machine learning and graph processing.
Pages: 1 2
Apache Spark, Big Data, DataCamp, Python, SQL
- Best Data Science, Machine Learning Courses from Udemy (only $12 until Oct 31) - Oct 27, 2017.
Fall sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $12 until Oct 31, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Build, Test and Run Spark Applications at No Cost with StreamAnalytix Visual Spark Studio - Oct 25, 2017.
Experience the Ease and Speed of Building Spark Application on Your Desktop. Free to download and use!
Apache Spark, Impetus, StreamAnalytix, Streaming Analytics
- KDnuggets™ News 17:n41, Oct 25: Learning git not enough to become data scientist; Peak Data Scientist Demand? Top Machine Learning w. R videos - Oct 25, 2017.
Becoming a data scientist after a science PhD; New Poll: When will demand for Data Scientists/Machine Learning experts peak? It Only Takes One Line of Code to Run Regression.
Apache Spark, Data Scientist, Machine Learning, R
- Data Scientist Guide to Apache Spark - Oct 20, 2017.
Learn how data scientists can leverage Spark for advanced analytics with The Data Scientist’s Guide to Apache Spark, from Databricks!
Apache Spark, Data Science, Data Scientist, Databricks, Free ebook
- Spark – The Definitive Guide – exclusive preview - Sep 25, 2017.
Get an exclusive preview of "Spark: The Definitive Guide" from Databricks! Learn how Spark runs on a cluster, see examples in SQL, Python and Scala, Learn about Structured Streaming and Machine Learning and more.
Apache Spark, Databricks, Free ebook, Python, Scala, SQL
- The Easy Button for R & Python on Spark, Webinar Oct 18 - Sep 22, 2017.
Learn five solid reasons to use managed services for Cloudera for R, Python and other advanced analytics on Spark & Hadoop in the cloud.
Apache Spark, Cazena, Cloud Analytics, Cloudera, Python, R
- Benchmarking Big Data SQL Platforms in the Cloud - Sep 21, 2017.
TPC-DS benchmarks demonstrate Databricks Runtime 3.0's superior performance. Sign-up for a Databricks account to get fastest performance.
Apache Spark, AWS, Benchmark, Cloud Computing, Databricks, Presto
- Using Apache SystemML(tm) with Hortonworks Data Platform - Sep 18, 2017.
Learn how to add Apache SystemML to an existing Hortonworks Data Platform (HDP) 2.6.1 cluster for Apache Spark. Users interested in Python, Scala, Spark, or Zeppelin can run Apache SystemML as described here.
Apache, Apache Spark, Apache SystemML, Hortonworks, IBM, Machine Learning
- Best Data Science, Machine Learning Courses from Udemy (only $12 until Sep 20) - Sep 14, 2017.
Back-to-school sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $12 until Sep 20, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
A Vision for Making Deep Learning Simple - Sep 5, 2017.
This post introduces Deep Learning Pipelines from Databricks, a new open-source library aimed at enabling everyone to easily integrate scalable deep learning into their workflows, from machine learning practitioners to business analysts.
Apache Spark, Databricks, Deep Learning, Hyperparameter
- Spark Summit Europe – Big Ideas About Big Data- KDnuggets Offer - Aug 24, 2017.
Spark Summit will bring together more than 1,200 developers, data scientists, analysts, researchers, and business pros from around the world. Reg by Aug 25 to catch early bird rates and save extra 15% w. code KD824.
Apache Spark, Big Data, Databricks, Dublin, Summit
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Aug 22, 2017.
In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs.
Apache Spark, API
- Best Data Science, Machine Learning Courses from Udemy (only $10 or $12 till Aug 10) - Aug 6, 2017.
Back-to-school sale on best courses from Udemy, including Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 or $12 until Aug 10, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
Apache Flink: The Next Distributed Data Processing Revolution? - Jul 5, 2017.
Will Apache Flink displace Apache Spark as the new champion of Big Data Processing? We compare Spark and Apache Flink performance for batch processing and stream processing.
Apache Spark, Big Data, Flink, Streaming Analytics
- Pitfalls in pseudo-random number sampling at scale with Apache Spark - Jun 27, 2017.
Large scale simulation of random number generation is possible with today’s high speed & scalable distributed computing frameworks. Let’s understand how it can be achieved using Apache Spark.
Apache Spark, GitHub, Random, RDD
- Will Apache Spark Finally Advance Genomic Data Analysis? - Jun 23, 2017.
Spark has been useful in mapping out genetic traits that can be associated with certain diseases and the genetic makeup of microorganisms that live in our bodies.
Apache Spark, Data Analysis, Genomics
- Spark with Scala – ACM Professional Development Seminar, Santa Clara, Aug 5 - Jun 22, 2017.
This class will introduce Apache Spark 2, focusing on using it for data analysis Taught by Sujee Maniyam on behalf of the local ACM chapter, SFbayACM.
Apache Spark, CA, Santa Clara, Scala, SFbayACM
- Best Data Science Courses from Udemy (only $10 till June 21) - Jun 19, 2017.
Here are some of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until June 21, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Machine learning made simple with Apache Spark - Jun 15, 2017.
Powered by Apache Spark, Databricks provides an end-to-end platform designed to help data engineers and data scientists easily implement advanced analytics at scale. Download the Making Machine Learning Simple Whitepaper from Databricks to learn more.
Apache Spark, Databricks, White Paper
- How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I - Jun 8, 2017.
As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!
Apache Spark, Feature Engineering, Jupyter, Kaggle, Machine Learning, Python
- Data Science for Newbies: An Introductory Tutorial Series for Software Engineers - May 31, 2017.
This post summarizes and links to the individual tutorials which make up this introductory look at data science for newbies, mainly focusing on the tools, with a practical bent, written by a software engineer from the perspective of a software engineering approach.
Apache Spark, Data Science, Jupyter, Machine Learning, Pandas, Python, Reddit, Scala, SQL
- Webinar: A New Era of Data Science – Unlocking Big Data Insights with Machine Learning and Spark, May 31 - May 19, 2017.
Learn about Big Data technologies and trends, Democratizing Big Data analytics, Big Data and the Cloud, and more in this webcast with top experts Dean Abbott and Mamdouh Refaat.
Angoss, Apache Spark, Dean Abbott, Machine Learning
- Best Data Science Courses from Udemy (only $10 till May 27) - May 17, 2017.
Here a list of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until May 27, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Regeneron: Spark R&D Developer - May 10, 2017.
Seeking an R&D Spark Developer to join the Genome Informatics team to expand the RGC’s big data infrastructure and develop new algorithms/tools to support various workflows/analyses throughout the RGC and Regeneron.
Apache Spark, Developer, NY, R&D, Regeneron, Tarrytown
- Spark Summit – Explore the future of Data Science and Machine Learning, San Francisco, June 5-7 – KDnuggets Offer - Apr 25, 2017.
Choose from over 175 sessions in 10 different tracks, including Developer, Data Science, Enterprise Applications, Machine Learning, Streaming and Spark Experience & Use Cases. Save 15% with code KDNUGGETS.
Apache Spark, CA, Data Science, Developers, Hackathon, San Francisco, Summit
- Best Data Science Courses from Udemy (only $10 till Apr 29) - Apr 24, 2017.
Here a list of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until April 29, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Webinar: R with RStudio, Spark, sparklyr in Minutes, April 26 - Apr 11, 2017.
Find out how to expand R capabilities with RStudio + sparklyr on Apache Spark on a fast cloud platform and how simple to get started in the cloud with Cazena Data Science Sandbox as a Service.
Apache Spark, Cazena, Cloudera, Data Science Platform, Rstudio
- Grunion, Query Optimization Tool for Data Science and Big Data - Mar 14, 2017.
Grunion is a patent-pending query optimization, translation, and federation framework built to help bridge the gap between data science and data engineering teams. Read more to request access.
Apache Spark, Benchmark, Data Workflow, Datascience.com, NoSQL, SQL
- Best Data Science Courses from Udemy (only $19 till Mar 31) - Mar 10, 2017.
Here a list of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $19 until March 31, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Gartner Data Science Platforms – A Deeper Look - Mar 3, 2017.
Thomas Dinsmore critical examination of Gartner 2017 MQ of Data Science Platforms, including vendors who out, in, have big changes, Hadoop and Spark integration, open source software, and what Data Scientists actually use.
Apache Spark, Data Science Platform, Gartner, IBM, Python, R, SAS, Thomas Dinsmore
- The 6 Best Data Science Courses from Udemy (only $10 till Feb 28) - Feb 25, 2017.
Here a list of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until Feb 28, 2017.
Apache Spark, Hadoop, Machine Learning, Online Education, Python, Tableau, Udemy
- Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory - Feb 16, 2017.
Apache Parquet and Apache Arrow both focus on improving performance and efficiency of data analytics. These two projects optimize performance for on disk and in-memory processing
Apache, Apache Arrow, Apache Spark, Data Science, Dremio, In-Memory Computing, Machine Learning, Python
- Spark Streaming Innovation Contest - Feb 15, 2017.
Build a Spark application on StreamAnalytix, a real-time streaming analytics platform and win $10K. Register by March 31, 2017.
Apache Spark, Competition, Impetus, Streaming Analytics
- Navigating the World of Big Data Analytics - Dec 8, 2016.
Fulcrum Agile Analytics Lab- helps our partners test new technologies, new methodologies and new data sets quickly in an environment that can scale up and down and that meets all of their security and compliance requirements. Read to learn more and schedule a consultation.
Agile, Analytics, Apache Spark, Big Data, Consulting, Hadoop
- Ten Take-Aways from IBM World of Watson - Nov 14, 2016.
“Enterprise applications, Cloud, Cognitive computing and IBM Watson”, Yes, you guessed it right. This article talks about highlights of 2016 World of Watson conference organised at Las Vegas,NV.
AI, Apache Spark, Cognitive Computing, IBM, Watson
- How Hadoop, Spark, and Data Science are evolving – Nov 10 Webinar - Nov 8, 2016.
Find out how Hadoop and Spark are evolving for Data Science in this Nov 10 webinar and live Q&A with guest speaker, Forrester VP and Principal Analyst Mike Gualtieri.
Apache Spark, Cazena, Data Lakes, Hadoop, Mike Gualtieri
- Apache: Big Data Europe (Nov. 14-16) – Leading Event for Big Data Technologists - Oct 13, 2016.
Apache: Big Data Europe (Nov 14-16, Seville, Spain) will gather together the Apache projects, people and technologies working in Big Data, ubiquitous computing and data engineering and science to educate, collaborate and connect. Register by Nov 3 to save over $250!
Apache, Apache Spark, Big Data, Europe, Hadoop, Spain
- O’Reilly Live Training–Real-time. Real experts. Real learning. - Sep 26, 2016.
Get intensive, hands-on training from O'Reilly's expert network on critical data topics - from SQL fundamentals to distributed computing; enterprise strategy to data science at scale.
Apache Spark, Courses, Distributed Systems, Hadoop, O'Reilly, scikit-learn, SQL
- Spark for Scale: Machine Learning for Big Data - Sep 23, 2016.
This post discusses the fundamental concepts for working with big data using distributed computing, and introduces the tools you need to build machine learning models.
Pages: 1 2 3
Apache Spark, Big Data, Hadoop, HDFS, Machine Learning, MapReduce
- KDnuggets™ News 16:n34, Sep 21: The Great Algorithm Tutorial Roundup; 7 Steps to Mastering Apache Spark 2.0 - Sep 21, 2016.
The Great Algorithm Tutorial Roundup; 7 Steps to Mastering Apache Spark 2.0; Machine Learning in a Year: From Total Noob to Effective Practitioner; Learning From Data (Introductory Machine Learning) Caltech MOOC
Algorithms, Apache Spark, Career, Data Scientist, Decision Trees, Machine Learning, MOOC
7 Steps to Mastering Apache Spark 2.0 - Sep 16, 2016.
Looking for a comprehensive guide on going from zero to Apache Spark hero in steps? Look no further! Written by our friends at Databricks, this exclusive guide provides a solid foundation for those looking to master Apache Spark 2.0.
Pages: 1 2 3
7 Steps, Apache Spark, Databricks
- A simple approach to anomaly detection in periodic big data streams - Aug 24, 2016.
We describe a simple and scaling algorithm that can detect rare and potentially irregular behavior in a time series with periodic patterns. It performs similarly to Twitter's more complex approach.
Anomaly Detection, Apache Spark, BMW, Time Series, Twitter
- Big Data Key Terms, Explained - Aug 11, 2016.
Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.
Pages: 1 2
3Vs of Big Data, Apache Spark, Big Data, Business Intelligence, Cloud Computing, Data Warehouse, Explained, Hadoop, Key Terms, Predictive Analytics
- Online Courses: Big Data Projects and Data Science Pipelines - Jul 14, 2016.
Check out these online courses from O'Reilly Media on managing big data projects and building distributed data pipelines.
Apache Spark, Big Data, Cassandra, Kafka, O'Reilly
- BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark - Jun 27, 2016.
An overview of a recent paper outlining BigDebug, which provides real-time interactive debugging support for Data-Intensive Scalable Computing (DISC) systems, or more particularly, Apache Spark.
Pages: 1 2
Apache Spark, Big Data, Data Science
- Achieving End-to-end Security for Apache Spark with Databricks - Jun 23, 2016.
The Databricks just-in-time data platform takes a holistic approach to solving the enterprise security challenge by building all the facets of security — encryption, identity management, role-based access control, data governance, and compliance standards — natively into the data platform with DBES.
Apache Spark, Databricks, Security
- Apache Spark Key Terms, Explained - Jun 13, 2016.
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Pages: 1 2
Apache Spark, Databricks, Dataset, Explained, Key Terms, RDD, Tungsten
- Infinite Data Overlap Detection Arrives to Speed Business Insights - Jun 8, 2016.
Infinite Data Overlap Detection(IDOD) is a new, Spark-based technology that empowers non-technical business users to automatically discover data patterns and blendany data type for any set of values from multiple sources – both inside and outside the enterprise.
Apache Spark, ClearStory Data, Data Cleaning, Data Preparation
- Open Data Science in Collaborative Workflows – IBM June 6 event - Jun 3, 2016.
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality at the Apache Spark Maker Community Event in San Francisco, Attend in person or watch live.
Apache Spark, Collaborative, Data Science, IBM, R
- Analyzing Log Data with Spark and Looker - May 31, 2016.
Learn how to set up a modern pipeline that collects, processes, and analyzes high-volume, machine-generated data. This on-demand webinar discusses popular collection mechanisms, does a hands-on log-parsing example in Spark, and shows how to use Looker to get insights from event data.
Apache Spark, Looker, Streaming Analytics
- Jupyter+Spark+Mesos: An “Opinionated” Docker Image - May 31, 2016.
Check "opinionated" Docker-based stacks for Jupyter, including one to combine Jupyter and Spark right out of the gate.
Apache Spark, Docker, IBM, Jupyter
- Hadoop Key Terms, Explained - May 30, 2016.
An straightforward overview of 16 core Hadoop ecosystem concepts. No Big Picture discussion, just the facts.
Pages: 1 2
Apache Spark, Explained, Hadoop, HBase, HDFS, Key Terms, MapReduce, YARN
- Be Part of Spark Summit 2016, the Premier Big Data Event Dedicated to Apache Spark - May 25, 2016.
Whether you’re an Apache Spark newbie or a hardcore enthusiast, Spark Summit, June 6-8 in San Francisco, is the place to be to gain new insights and make valuable connections. Use promo code KDNuggets to save 15%
Apache Spark, CA, Databricks, San Francisco
- Boosting Productivity of the Next-Generation Data Scientist: IBM June 6 event - May 20, 2016.
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality at the Apache Spark Maker Community Event in San Francisco, Attend in person or watch live.
Apache Spark, IBM, R
- Spark 2.0 Preview Now on Databricks Community Edition: Easier, Faster, Smarter - May 17, 2016.
The preview of Spark 2.0 is here, and it promises to be easier, faster, and smarter.
Apache Spark, Databricks, SQL
- Spark with Tungsten Burns Brighter - May 4, 2016.
Apache Spark is one of “the hottest technology” for data science and analytics. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. Understand how it works, and why it improves Spark performance so much.
Apache Spark, In-Memory Computing, Tungsten
- Top Data Science Courses on Udemy - Apr 27, 2016.
An overview of the very best that Udemy has to offer in data science education. Includes courses covering machine learning, Python, Hadoop, visualization, and more.
Pages: 1 2 3
Apache Spark, Brendan Martin, Data Science, Hadoop, Machine Learning, Python, Udemy
- Analyzing Log Data with Spark & Looker, Webinar May 18 - Apr 22, 2016.
Learn how to set up a modern pipeline that collects, processes, and analyzes high-volume, machine-generated data. We’ll talk about popular collection mechanisms, do a hands-on log-parsing example in Spark, and discuss how to use Looker to get insights from event data.
Apache Spark, Looker, Streaming Analytics
- XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink - Mar 24, 2016.
An overview of XGBoost4J, a JVM-based implementation of XGBoost, one of the most successful recent machine learning algorithms in Kaggle competitions, with distributed support for Spark and Flink.
Apache Spark, Distributed Systems, Flink, Kaggle, XGBoost
- Introducing GraphFrames, a Graph Processing Library for Apache Spark - Mar 7, 2016.
An overview of Spark's new GraphFrames, a graph processing library based on DataFrames, built in a collaboration between Databricks, UC Berkeley's AMPLab, and MIT.
Apache Spark, Databricks, Graph Analytics
- Apache Big Data, Vancouver, May 9-12, KDnuggets Discount, Early bird ends Mar 6 - Mar 4, 2016.
Apache Big Data brings together the full suite of Big Data open source projects - check the amazing lineup of keynotes and breakout sessions and save with code APBD16KDN20.
Apache, Apache Spark, Big Data, Canada, Doug Cutting, Hadoop, Matei Zaharia, Vancouver
- Top Big Data Processing Frameworks - Mar 3, 2016.
A discussion of 5 Big Data processing frameworks: Hadoop, Spark, Flink, Storm, and Samza. An overview of each is given and comparative insights are provided, along with links to external resources on particular related topics.
Apache Samza, Apache Spark, Apache Storm, Flink, Hadoop
- Top Spark Ecosystem Projects - Mar 2, 2016.
Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.
Apache Mesos, Apache Spark, Cassandra, Databricks, Distributed Systems
- KDnuggets™ News 16:n08, Mar 2: Citizen Data Scientist Mirage; Spark Tipping Point; 80% Machine Learning - Mar 2, 2016.
The Mirage of a Citizen Data Scientist; Why Spark Reached the Tipping Point in 2015; The Machine Learning Problem of The Next Decade; How The Algorithm Economy And Containers Are Changing The Apps.
Algorithms, Apache Spark, Data Visualization, IBM Watson, TensorFlow