Search results for hdfs

Found 65 documents, 5922 searched:

HDFS vs. HBase : All you need to know">HDFS vs. HBase : All you need to know
Hadoop Distributed File System (HDFS), and Hbase (Hadoop database) are key components of Big Data ecosystem. This blog explains the difference between HDFS and HBase with real-life use cases where they are best fit.
https://www.kdnuggets.com/2017/05/hdfs-hbase-need-know.html
Best Practices for Building ETLs for ML
This article talks about several best practices for writing ETLs for building training datasets. It delves into several software engineering techniques and patterns applied to ML.
https://www.kdnuggets.com/best-practices-for-building-etls-for-ml
Working with Big Data: Tools and Techniques
Where do you start in a field as vast as big data? Which tools and techniques to use? We explore this and talk about the most common tools in big data.
https://www.kdnuggets.com/working-with-big-data-tools-and-techniques
How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second
This article describes a large-scale data warehousing use case to provide reference for data engineers who are looking for log analytic solutions. It introduces the log processing architecture and real-case practice in data ingestion, storage, and queries.
https://www.kdnuggets.com/how-to-digest-15-billion-logs-per-day-and-keep-big-queries-within-1-second
Scaling Data Management Through Apache Gobblin
Software companies can manage big data at a hyper-scale on different infrastructure stacks using Apache Gobblin.
https://www.kdnuggets.com/2023/01/scaling-data-management-apache-gobblin.html
Top 10 MLOps Tools to Optimize & Manage Machine Learning Lifecycle
As more businesses experiment with data, they realize that developing a machine learning (ML) model is only one of many steps in the ML lifecycle.
https://www.kdnuggets.com/2022/10/top-10-mlops-tools-optimize-manage-machine-learning-lifecycle.html
Movie Recommendations with Spark Collaborative Filtering
Not sure what movie to watch? Ask your recommender system.
https://www.kdnuggets.com/2021/12/movie-recommendations-spark-collaborative-filtering.html
Inside recommendations: how a recommender system recommends
We describe types of recommender systems, more specifically, algorithms and methods for content-based systems, collaborative filtering, and hybrid systems.
https://www.kdnuggets.com/2021/11/recommendations-recommender-system.html
CSV Files for Storage? No Thanks. There’s a Better Option
Saving data to CSV’s is costing you both money and disk space. It’s time to end it.
https://www.kdnuggets.com/2021/08/csv-files-storage-better-option.html
Apache Spark Cluster on Docker
Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.
https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html
Some Things Uber Learned from Running Machine Learning at Scale
Uber machine learning runtime Michelangelo has been in operation for a few years. What has the Uber team learned?
https://www.kdnuggets.com/2020/07/some-things-uber-learned-machine-learning-scale.html
The Architecture Used at LinkedIn to Improve Feature Management in Machine Learning Models
The new typed feature schema streamlined the reusability of features across thousands of machine learning models.
https://www.kdnuggets.com/2020/05/architecture-linkedin-feature-management-machine-learning-models.html
The Benefits & Examples of Using Apache Spark with PySpark
Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.
https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html
State of the Machine Learning and AI Industry
Enterprises are struggling to launch machine learning models that encapsulate the optimization of business processes. These are now the essential components of data-driven applications and AI services that can improve legacy rule-based business processes, increase productivity, and deliver results. In the current state of the industry, many companies are turning to off-the-shelf platforms to increase expectations for success in applying machine learning.
https://www.kdnuggets.com/2020/04/machine-learning-ai-industry.html
Everything a Data Scientist Should Know About Data Management">Everything a Data Scientist Should Know About Data Management
For full-stack data science mastery, you must understand data management along with all the bells and whistles of machine learning. This high-level overview is a road map for the history and current state of the expansive options for data storage and infrastructure solutions.
https://www.kdnuggets.com/2019/10/data-scientist-data-management.html
How to Become More Marketable as a Data Scientist">How to Become More Marketable as a Data Scientist
As a data scientist, you are in high demand. So, how can you increase your marketability even more? Check out these current trends in skills most desired by employers in 2019.
https://www.kdnuggets.com/2019/08/marketable-data-scientist.html
Learn how to use PySpark in under 5 minutes (Installation + Tutorial)
Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.
https://www.kdnuggets.com/2019/08/learn-pyspark-installation-tutorial.html
The Data Science Gold Rush: Top Jobs in Data Science and How to Secure Them
Because big data touches almost every industry across the board, those who aren’t already working in data and analytics will soon be utilizing the technology for its undeniable business benefits. Whichever way you slice it, the future of work is through data.
https://www.kdnuggets.com/2019/01/top-jobs-data-science.html
Ontology and Data Science">Ontology and Data Science
In simple words, one can say that ontology is the study of what there is. But there is another part to that definition that will help us in the following sections, and that is ontology is usually also taken to encompass problems about the most general features and relations of the entities which do exist.
https://www.kdnuggets.com/2019/01/ontology-data-science.html
Practical Apache Spark in 10 Minutes
Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.
https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html
Apache Spark Introduction for Beginners">Apache Spark Introduction for Beginners
An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.
https://www.kdnuggets.com/2018/10/apache-spark-introduction-beginners.html
Things you should know when traveling via the Big Data Engineering hype-train
Maybe you want to join the Big Data world? Or maybe you are already there and want to validate your knowledge? Or maybe you just want to know what Big Data Engineers do and what skills they use? If so, you may find the following article quite useful.
https://www.kdnuggets.com/2018/10/big-data-engineering-hype-train.html
Hadoop for Beginners">Hadoop for Beginners
An introduction to Hadoop, a framework that enables you to store and process large data sets in parallel and distributed fashion.
https://www.kdnuggets.com/2018/09/hadoop-beginners.html
Introduction to Apache Spark
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
https://www.kdnuggets.com/2018/07/introduction-apache-spark.html
Apache Spark : Python vs. Scala">Apache Spark : Python vs. Scala
When it comes to using the Apache Spark framework, the data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. This article compares the two, listing their pros and cons.
https://www.kdnuggets.com/2018/05/apache-spark-python-scala.html
Presto for Data Scientists – SQL on anything
Presto enables data scientists to run interactive SQL across multiple data sources. This open source engine supports querying anything, anywhere, and at large scale.
https://www.kdnuggets.com/2018/04/presto-data-scientists-sql.html
Deep Learning With Apache Spark: Part 1
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.
https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html
Ranking Popular Distributed Computing Packages for Data Science
We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.
https://www.kdnuggets.com/2018/03/top-distributed-computing-packages-data-science.html
My Journey into Deep Learning
In this post I’ll share how I’ve been studying Deep Learning and using it to solve data science problems. It’s an informal post but with interesting content (I hope).
https://www.kdnuggets.com/2018/01/journey-into-deep-learning.html
Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI">Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI
A complete and unbiased comparison of the three most common Cloud Technologies for Machine Learning as a Service.
https://www.kdnuggets.com/2018/01/mlaas-amazon-microsoft-azure-google-cloud-ai.html
Big Data: Main Developments in 2017 and Key Trends in 2018">Big Data: Main Developments in 2017 and Key Trends in 2018
As we bid farewell to one year and look to ring in another, KDnuggets has solicited opinions from numerous Big Data experts as to the most important developments of 2017 and their 2018 key trend predictions.
https://www.kdnuggets.com/2017/12/big-data-main-developments-2017-key-trends-2018.html
Graph Analytics Using Big Data
An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.
https://www.kdnuggets.com/2017/12/graph-analytics-using-big-data.html
Updates & Upserts in Hadoop Ecosystem with Apache Kudu
A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
https://www.kdnuggets.com/2017/10/updates-upserts-hadoop-ecosystem-apache-kudu.html
Are Data Lakes Fake News?">Are Data Lakes Fake News?
The quick answer is yes, and the biggest problem is that the term “Data Lakes” has been overloaded by vendors and analysts with different meanings, resulting in an ill-defined and blurry concept.
https://www.kdnuggets.com/2017/09/data-lakes-fake-news.html
277 Data Science Key Terms, Explained">277 Data Science Key Terms, Explained
This is a collection of 277 data science key terms, explained with a no-nonsense, concise approach. Read on to find terminology related to Big Data, machine learning, natural language processing, descriptive statistics, and much more.
https://www.kdnuggets.com/2017/09/data-science-key-terms-explained.html
Apache Flink: The Next Distributed Data Processing Revolution?">Apache Flink: The Next Distributed Data Processing Revolution?
Will Apache Flink displace Apache Spark as the new champion of Big Data Processing? We compare Spark and Apache Flink performance for batch processing and stream processing.
https://www.kdnuggets.com/2017/07/apache-flink-distributed-data-processing-revolution.html
How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I
As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!
https://www.kdnuggets.com/2017/06/feature-engineering-help-kaggle-competition-1.html
Top Stories, May 22-28: Analytics, Data Science, Machine Learning Software Poll Results; Machine Learning Crash Course
New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll; Machine Learning Crash Course: Part 1; Text Mining 101: Mining Information From A Resume; Data science platforms are on the rise and IBM is leading the way; An Introduction to the MXNet Python API
https://www.kdnuggets.com/2017/05/top-news-week-0522-0528.html
Simplifying Data Pipelines in Hadoop: Overcoming the Growing Pains
Moving to Hadoop is not without its challenges—there are so many options, from tools to approaches, that can have a significant impact on the future success of a business’ strategy. Data management and data pipelining can be particularly difficult.
https://www.kdnuggets.com/2017/05/simplify-data-pipelines-hadoop.html
Data Science & Machine Learning Platforms for the Enterprise
A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale.
https://www.kdnuggets.com/2017/05/data-science-machine-learning-platforms-enterprise.html
Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory
Apache Parquet and Apache Arrow both focus on improving performance and efficiency of data analytics. These two projects optimize performance for on disk and in-memory processing
https://www.kdnuggets.com/2017/02/apache-arrow-parquet-columnar-data.html
Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud
This article covers the value of understanding the virtualization constructs for the data scientist and data engineer as they deploy their analysis onto all kinds of cloud platforms. Virtualization is a key enabling layer of software for these data workers to be aware of and to achieve optimal results from.
https://www.kdnuggets.com/2017/01/data-scientist-engineer-understand-virtualization-cloud.html
50+ Data Science, Machine Learning Cheat Sheets, updated">50+ Data Science, Machine Learning Cheat Sheets, updated
Gear up to speed and have concepts and commands handy in Data Science, Data Mining, and Machine learning algorithms with these cheat sheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark, Matlab, and Java.
https://www.kdnuggets.com/2016/12/data-science-machine-learning-cheat-sheets-updated.html
Evaluating HTAP Databases for Machine Learning Applications
Businesses are producing a greater number of intelligent applications; which traditional databases are unable to support. A new class of databases, Hybrid Transactional and Analytical Processing (HTAP) databases, offers a variety of capabilities with specific strengths and weaknesses to consider. This article aims to give application developers and data scientists a better understanding of the HTAP database ecosystem so they can make the right choice for their intelligent application.
https://www.kdnuggets.com/2016/11/evaluating-htap-databases-machine-learning-applications.html
The top 5 Big Data courses to help you break into the industry
Here is an updated and in-depth review of top 5 providers of Big Data and Data Science courses: Simplilearn, Cloudera, Big Data University, Hortonworks, and Coursera
https://www.kdnuggets.com/2016/08/simplilearn-5-big-data-courses.html
Big Data Key Terms, Explained
Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.
https://www.kdnuggets.com/2016/08/big-data-key-terms-explained.html
Apache Spark Key Terms, Explained
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html
R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results
R remains the leading tool, with 49% share, but Python grows faster and almost catches up to R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.
https://www.kdnuggets.com/2016/06/r-python-top-analytics-data-mining-data-science-software.html
Top 10 Data Science Resources on Github
The top 10 data science projects on Github are chiefly composed of a number of tutorials and educational resources for learning and doing data science. Have a look at the resources others are using and learning from.
https://www.kdnuggets.com/2016/03/top-10-data-science-github.html
Top Big Data Processing Frameworks
A discussion of 5 Big Data processing frameworks: Hadoop, Spark, Flink, Storm, and Samza. An overview of each is given and comparative insights are provided, along with links to external resources on particular related topics.
https://www.kdnuggets.com/2016/03/top-big-data-processing-frameworks.html
Top Spark Ecosystem Projects
Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.
https://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html
Python Data Science with Pandas vs Spark DataFrame: Key Differences
A post describing the key differences between Pandas and Spark's DataFrame format, including specifics on important regular processing features, with code samples.
https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html
50 Deep Learning Software Tools and Platforms, Updated
We present the popular software & toolkit resources for Deep Learning, including Caffe, Cuda-convnet, Deeplearning4j, Pylearn2, Theano, and Torch. Explore the new list!
https://www.kdnuggets.com/2015/12/deep-learning-tools.html
50+ Data Science and Machine Learning Cheat Sheets
Gear up to speed and have Data Science & Data Mining concepts and commands handy with these cheatsheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark and Machine learning algorithms.
https://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html
R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites
R is the most popular overall tool among data miners, although Python usage is growing faster. RapidMiner continues to be most popular suite for data mining/data science. Hadoop/Big Data tools usage grew to 29%, propelled by 3x growth in Spark. Other tools with strong growth include H2O (0xdata), Actian, MLlib, and Alteryx.
https://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html
Hadoop as a Service: 18 Cloud Options
Hadoop as a service in the cloud makes big data applications and projects easier to approach and these 18 platforms each provide their own unique solutions.
https://www.kdnuggets.com/2015/04/hadoop-as-service-18-cloud-options.html
Interview: Arno Candel, H2O.ai on the Basics of Deep Learning to Get You Started
We discuss how Deep Learning is different from the other methods of Machine Learning, unique characteristics and benefits of Deep Learning, and the key components of H2O architecture.
https://www.kdnuggets.com/2015/01/interview-arno-candel-0xdata-deep-learning.html
IE Masters in Analytics and Big Data – first hand report
First hand report on Master in business analytics and big data program at IE (Madrid, Spain) - why, what, how, days, and challenges.
https://www.kdnuggets.com/2015/01/ie-data-science-education-first-hand-report.html
16 NoSQL, NewSQL Databases To Watch
NoSQL and NewSQL databases have become much more important with the proliferation of big, mobile, and networked data, and these sixteen database solutions are some of the biggest up-and-comers.
https://www.kdnuggets.com/2014/12/16-nosql-newsql-databases-to-watch.html
R and Hadoop make Machine Learning Possible for Everyone
R and Hadoop make machine learning approachable enough for inexperienced users to begin analyzing and visualizing interesting data to start down the path in this lucrative field.
https://www.kdnuggets.com/2014/11/r-hadoop-make-machine-learning-possible-everyone.html
18 essential Hadoop tools
Hadoop tools develop at a rapid rate, and keeping up with the latest can be difficult. Here we detail 18 of the most essential tools that work well with Hadoop.
https://www.kdnuggets.com/2014/08/18-essential-hadoop-tools.html
KDnuggets Analytics, Data Mining, Data Science Software Poll – Analyzed
We analyze the results of KDnuggets Software Poll, including correlations between tools, and relationships between commercial, free, and Hadoop/Big Data tools. We identify a potential capability gap. Download anonymized data and analyze it yourself.
https://www.kdnuggets.com/2014/06/analytics-data-mining-data-science-software-poll-analyzed.html
KDnuggets 15th Annual Analytics, Data Mining, Data Science Software Poll: RapidMiner Continues To Lead
With over 3,000 data miners taking part in KDnuggets 15th Annual Software Poll, RapidMiner continues to lead. Free software is used much more outside US, and Hadoop usage grows fastest in Asia.
https://www.kdnuggets.com/2014/06/kdnuggets-annual-software-poll-rapidminer-continues-lead.html
Poll Results: Data Types/Sources Analyzed
Trends in data sources for data mining include: table data dominates, followed by time series and text; audio, JSON grows in popularity, while itemsets decline; 70% access DB engines, but only 20% access NoSQL stores; Hadoop, MongoDB used more for text; Europe is lagging in NoSQL usage.
https://www.kdnuggets.com/2014/05/poll-results-data-types-sources-analyzed.html
KDnuggets™ News 13:n02, Jan 30
Features (10) | Software (4) | Courses, Events (2) | Webcasts (3) | Jobs (12) | Academic (5) | Competitions (4) | Publications (12) | NewsBriefs Read more »
https://www.kdnuggets.com/2013/n02.html

Search results for hdfs

Top Posts