- Difference between distributed learning versus federated learning algorithms - Nov 19, 2021.
Want to know the difference between distributed and federated learning? Read this article to find out.
Algorithms, Distributed Systems, Federated Learning
- How to Speed Up Pandas with Modin - Mar 10, 2021.
The Modin library has the ability to scale your pandas workflows by changing one line of code and integration with the Python ecosystem and Ray clusters. This tutorial goes over how to get started with Modin and how it can speed up your pandas workflows.
Data Science, Distributed Systems, Modin, Pandas, Python, Workflow
- Getting Started with Distributed Machine Learning with PyTorch and Ray - Mar 3, 2021.
Ray is a popular framework for distributed Python that can be paired with PyTorch to rapidly scale machine learning applications.
Distributed Systems, Machine Learning, Python, PyTorch
- How to Speed up Scikit-Learn Model Training - Feb 11, 2021.
Scikit-Learn is an easy to use a Python library for machine learning. However, sometimes scikit-learn models can take a long time to train. The question becomes, how do you create the best scikit-learn model in the least amount of time?
Distributed Systems, Hyperparameter, Machine Learning, Optimization, Parallelism, Python, scikit-learn, Training
Train sklearn 100x Faster - Sep 11, 2019.
As compute gets cheaper and time to market for machine learning solutions becomes more critical, we’ve explored options for speeding up model training. One of those solutions is to combine elements from Spark and scikit-learn into our own hybrid solution.
Distributed Systems, Machine Learning, Python, scikit-learn, Training
- Distributed Artificial Intelligence: A primer on Multi-Agent Systems, Agent-Based Modeling, and Swarm Intelligence - Apr 18, 2019.
Distributed Artificial Intelligence (DAI) is a class of technologies and methods that span from swarm intelligence to multi-agent technologies. It is one of the subsets of AI where simulation has greater importance that point-prediction.
AI, Distributed Systems, Modeling, Swarm Intelligence
- Regeneron Pharmaceuticals: Genomics Data Scientist (Distributed Systems) (Tarrytown, NY) - Feb 15, 2019.
Seeking an R&D Spark Developer to join the Genome Informatics team to expand the RGC’s big data infrastructure and develop new algorithms/tools to support various workflows/analyses throughout the RGC and Regeneron.
Data Scientist, Distributed Systems, Genomics, NY, Regeneron Pharmaceuticals, Tarrytown
- Kinetica: Sr. Software Engineer (Machine Learning) [Arlington, VA] - Aug 21, 2018.
Join an accomplished team to help build out a new scalable, distributed machine learning and data science platform with tight integrations and pipelines to a distributed, sharded GPU-powered database.
Database, Distributed Systems, GPU, Kinetica, Machine Learning, Software Engineer
- Regeneron Pharmaceuticals: Genomics Data Scientist (Distributed Systems) (Tarrytown, NY) - Aug 2, 2018.
Seeking an R&D Spark Developer to join the Genome Informatics team to expand the RGC’s big data infrastructure and develop new algorithms/tools to support various workflows/analyses throughout the RGC and Regeneron.
Data Scientist, Distributed Systems, Genomics, NY, Regeneron Pharmaceuticals, Tarrytown
- Introduction to Apache Spark - Jul 6, 2018.
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
Apache Spark, Data Processing, Distributed Systems
- Ranking Popular Distributed Computing Packages for Data Science - Mar 20, 2018.
We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.
Apache Spark, Data Science, Distributed Systems, GitHub, Hadoop
- Introducing Dask-SearchCV: Distributed hyperparameter optimization with Scikit-Learn - May 12, 2017.
We introduce a new library for doing distributed hyperparameter optimization with Scikit-Learn estimators. We compare it to the existing Scikit-Learn implementations, and discuss when it may be useful compared to other approaches.
Dask, Distributed Computing, Distributed Systems, Machine Learning, Optimization, scikit-learn
- 5 Machine Learning Projects You Can No Longer Overlook, May - May 10, 2017.
In this month's installment of Machine Learning Projects You Can No Longer Overlook, we find some data preparation and exploration tools, a (the?) reinforcement learning "framework," a new automated machine learning library, and yet another distributed deep learning library.
Automated Machine Learning, Data Exploration, Deep Learning, Distributed Systems, Machine Learning, Overlook, Pandas, Reinforcement Learning
- Dask and Pandas and XGBoost: Playing nicely between distributed systems - Apr 27, 2017.
This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.
Dask, Distributed Systems, Pandas, Python, XGBoost
- O’Reilly Live Training–Real-time. Real experts. Real learning. - Sep 26, 2016.
Get intensive, hands-on training from O'Reilly's expert network on critical data topics - from SQL fundamentals to distributed computing; enterprise strategy to data science at scale.
Apache Spark, Courses, Distributed Systems, Hadoop, O'Reilly, scikit-learn, SQL
- Understanding Modern Data Systems - Jun 2, 2016.
A look at the four characteristics that differentiate data infrastructure development from traditional development, and the key issues to look out for.
Big Data, Data Infrastructure, Distributed Systems
- XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink - Mar 24, 2016.
An overview of XGBoost4J, a JVM-based implementation of XGBoost, one of the most successful recent machine learning algorithms in Kaggle competitions, with distributed support for Spark and Flink.
Apache Spark, Distributed Systems, Flink, Kaggle, XGBoost
- Top Spark Ecosystem Projects - Mar 2, 2016.
Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.
Apache Mesos, Apache Spark, Cassandra, Databricks, Distributed Systems
- Distributed TensorFlow Has Arrived - Mar 1, 2016.
Google has open sourced its distributed version of TensorFlow. Get the info on it here, and catch up on some other TensorFlow news at the same time.
Deep Learning, Distributed Systems, Google, Matthew Mayo, TensorFlow
- Yahoo! CaffeOnSpark: Distributed Deep Learning on Big Data Clusters - Feb 29, 2016.
Get an overview of Yahoo!'s CaffeOnSpark, the latest entrant into the world of distributed deep learning, directly from the developers.
Apache Spark, Caffe, Deep Learning, Distributed Systems
- Big Data Projects and Distributed Data Science Pipelines – online courses - Feb 15, 2016.
If you're managing big data projects or building distributed data science systems, you will find these online courses very useful: Building Distributed Pipelines for Data, March 1-3 and Managing Successful Big data Projects, March 15-16.
Big Data, Data Science, Distributed Systems, O'Reilly, Online Education, Project Fail
- Deep Learning with Spark and TensorFlow - Jan 28, 2016.
The integration of TensorFlow with Spark leverages the distributed framework for hyperparameter tuning and model deployment at scale. Both time savings and improved error rates are demonstrated.
Apache Spark, Deep Learning, Distributed Systems, TensorFlow
- Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet - Dec 4, 2015.
Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.
Pages: 1 2
Apache Spark, Caffe, Deep Learning, Distributed Systems, H2O, Matthew Mayo, Neural Networks
- The Big ‘Big Data’ Question: Hadoop or Spark? - Aug 5, 2015.
With a considerable number of similarities, Hadoop and Spark are often wrongly considered as the same. Bernard carefully explains the differences between the two and how to choose the right one (or both) for your business needs.
Pages: 1 2
Apache Spark, Bernard Marr, Data Science Tools, Distributed Systems, Hadoop, Machine Learning, Performance, RDD
- Interview: James Taylor, Salesforce on Phoenix + HBase – The Future of Big Data - Jun 6, 2015.
We discuss the advantages of Phoenix, upcoming features, soon coming-up support for transactions, trends, advice, and more.
Apache Phoenix, Distributed Systems, Future, HBase, Interview, James Taylor, Salesforce
- Interview: Dave McCrory, Basho on Why Data Gravity Cannot be Ignored in Architecture Design - Mar 17, 2015.
We discuss data gravity and its implications, Riak Enterprise 2.0, Riak CS 1.5, competitive landscape, challenges and more.
Basho, Challenges, Competition, Dave McCrory, Distributed Systems, Interview
- Interview: Dave McCrory, Basho on Distributed Database Needs of a Future Enterprise - Mar 16, 2015.
We discuss the future of distributed storage for enterprise, Scale-up vs. Scale-out, software design patterns in Cloud era, microservices model and the place for legacy database in modern enterprise IT.
Basho, Cloud Computing, Databases, Dave McCrory, Distributed Systems, Integration, Interview, SQL
- Interview: Peter Alvaro, UC Berkeley, on Consistency Challenge in Distributed Systems - Dec 17, 2014.
We discuss the performance limitations caused by treating datastore as black box, consistency as an application-level property, Dedalus and LDFI approach for testing.
Consistency, Data, Databases, Distributed Systems, NoSQL, Peter Alvaro, UC Berkeley