Search results for hdfs

    Found 65 documents, 5922 searched:

  • HDFS vs. HBase : All you need to know">Silver Blog, May 2017HDFS vs. HBase : All you need to know

    Hadoop Distributed File System (HDFS), and Hbase (Hadoop database) are key components of Big Data ecosystem. This blog explains the difference between HDFS and HBase with real-life use cases where they are best fit.

    https://www.kdnuggets.com/2017/05/hdfs-hbase-need-know.html

  • Best Practices for Building ETLs for ML

    This article talks about several best practices for writing ETLs for building training datasets. It delves into several software engineering techniques and patterns applied to ML.

    https://www.kdnuggets.com/best-practices-for-building-etls-for-ml

  • Working with Big Data: Tools and Techniques

    Where do you start in a field as vast as big data? Which tools and techniques to use? We explore this and talk about the most common tools in big data.

    https://www.kdnuggets.com/working-with-big-data-tools-and-techniques

  • How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second

    This article describes a large-scale data warehousing use case to provide reference for data engineers who are looking for log analytic solutions. It introduces the log processing architecture and real-case practice in data ingestion, storage, and queries.

    https://www.kdnuggets.com/how-to-digest-15-billion-logs-per-day-and-keep-big-queries-within-1-second

  • Scaling Data Management Through Apache Gobblin

    Software companies can manage big data at a hyper-scale on different infrastructure stacks using Apache Gobblin.

    https://www.kdnuggets.com/2023/01/scaling-data-management-apache-gobblin.html

  • Top 10 MLOps Tools to Optimize & Manage Machine Learning Lifecycle

    As more businesses experiment with data, they realize that developing a machine learning (ML) model is only one of many steps in the ML lifecycle.

    https://www.kdnuggets.com/2022/10/top-10-mlops-tools-optimize-manage-machine-learning-lifecycle.html

  • Movie Recommendations with Spark Collaborative Filtering

    Not sure what movie to watch? Ask your recommender system.

    https://www.kdnuggets.com/2021/12/movie-recommendations-spark-collaborative-filtering.html

  • Inside recommendations: how a recommender system recommends

    We describe types of recommender systems, more specifically, algorithms and methods for content-based systems, collaborative filtering, and hybrid systems.

    https://www.kdnuggets.com/2021/11/recommendations-recommender-system.html

  • CSV Files for Storage? No Thanks. There’s a Better Option

    Saving data to CSV’s is costing you both money and disk space. It’s time to end it.

    https://www.kdnuggets.com/2021/08/csv-files-storage-better-option.html

  • Apache Spark Cluster on Docker

    Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.

    https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html

  • Some Things Uber Learned from Running Machine Learning at Scale

    Uber machine learning runtime Michelangelo has been in operation for a few years. What has the Uber team learned?

    https://www.kdnuggets.com/2020/07/some-things-uber-learned-machine-learning-scale.html

  • The Architecture Used at LinkedIn to Improve Feature Management in Machine Learning Models

    The new typed feature schema streamlined the reusability of features across thousands of machine learning models.

    https://www.kdnuggets.com/2020/05/architecture-linkedin-feature-management-machine-learning-models.html

  • The Benefits & Examples of Using Apache Spark with PySpark

    Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.

    https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html

  • State of the Machine Learning and AI Industry

    Enterprises are struggling to launch machine learning models that encapsulate the optimization of business processes. These are now the essential components of data-driven applications and AI services that can improve legacy rule-based business processes, increase productivity, and deliver results. In the current state of the industry, many companies are turning to off-the-shelf platforms to increase expectations for success in applying machine learning.

    https://www.kdnuggets.com/2020/04/machine-learning-ai-industry.html

  • Platinum BlogEverything a Data Scientist Should Know About Data Management">Silver BlogPlatinum BlogEverything a Data Scientist Should Know About Data Management

    For full-stack data science mastery, you must understand data management along with all the bells and whistles of machine learning. This high-level overview is a road map for the history and current state of the expansive options for data storage and infrastructure solutions.

    https://www.kdnuggets.com/2019/10/data-scientist-data-management.html

  • Platinum BlogHow to Become More Marketable as a Data Scientist">Silver BlogPlatinum BlogHow to Become More Marketable as a Data Scientist

    As a data scientist, you are in high demand. So, how can you increase your marketability even more? Check out these current trends in skills most desired by employers in 2019.

    https://www.kdnuggets.com/2019/08/marketable-data-scientist.html

  • Learn how to use PySpark in under 5 minutes (Installation + Tutorial)

    Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.

    https://www.kdnuggets.com/2019/08/learn-pyspark-installation-tutorial.html

  • The Data Science Gold Rush: Top Jobs in Data Science and How to Secure Them

    Because big data touches almost every industry across the board, those who aren’t already working in data and analytics will soon be utilizing the technology for its undeniable business benefits. Whichever way you slice it, the future of work is through data.

    https://www.kdnuggets.com/2019/01/top-jobs-data-science.html

  • Ontology and Data Science">Silver BlogOntology and Data Science

    In simple words, one can say that ontology is the study of what there is. But there is another part to that definition that will help us in the following sections, and that is ontology is usually also taken to encompass problems about the most general features and relations of the entities which do exist.

    https://www.kdnuggets.com/2019/01/ontology-data-science.html

  • Practical Apache Spark in 10 Minutes

    Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.

    https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html

  • Apache Spark Introduction for Beginners">Silver BlogApache Spark Introduction for Beginners

    An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.

    https://www.kdnuggets.com/2018/10/apache-spark-introduction-beginners.html

  • Things you should know when traveling via the Big Data Engineering hype-train

    Maybe you want to join the Big Data world? Or maybe you are already there and want to validate your knowledge? Or maybe you just want to know what Big Data Engineers do and what skills they use? If so, you may find the following article quite useful.

    https://www.kdnuggets.com/2018/10/big-data-engineering-hype-train.html

  • Hadoop for Beginners">Silver BlogHadoop for Beginners

    An introduction to Hadoop, a framework that enables you to store and process large data sets in parallel and distributed fashion.

    https://www.kdnuggets.com/2018/09/hadoop-beginners.html

  • Introduction to Apache Spark

    This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.

    https://www.kdnuggets.com/2018/07/introduction-apache-spark.html

  • Apache Spark : Python vs. Scala">Silver BlogApache Spark : Python vs. Scala

    When it comes to using the Apache Spark framework, the data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. This article compares the two, listing their pros and cons.

    https://www.kdnuggets.com/2018/05/apache-spark-python-scala.html

  • Presto for Data Scientists – SQL on anything

    Presto enables data scientists to run interactive SQL across multiple data sources. This open source engine supports querying anything, anywhere, and at large scale.

    https://www.kdnuggets.com/2018/04/presto-data-scientists-sql.html

  • Deep Learning With Apache Spark: Part 1

    First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.

    https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html

  • Ranking Popular Distributed Computing Packages for Data Science

    We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.

    https://www.kdnuggets.com/2018/03/top-distributed-computing-packages-data-science.html

  • My Journey into Deep Learning

    In this post I’ll share how I’ve been studying Deep Learning and using it to solve data science problems. It’s an informal post but with interesting content (I hope).

    https://www.kdnuggets.com/2018/01/journey-into-deep-learning.html

  • Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI">Gold BlogComparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI

    A complete and unbiased comparison of the three most common Cloud Technologies for Machine Learning as a Service.

    https://www.kdnuggets.com/2018/01/mlaas-amazon-microsoft-azure-google-cloud-ai.html

  • Big Data: Main Developments in 2017 and Key Trends in 2018">Silver BlogBig Data: Main Developments in 2017 and Key Trends in 2018

    As we bid farewell to one year and look to ring in another, KDnuggets has solicited opinions from numerous Big Data experts as to the most important developments of 2017 and their 2018 key trend predictions.

    https://www.kdnuggets.com/2017/12/big-data-main-developments-2017-key-trends-2018.html

  • Graph Analytics Using Big Data

    An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.

    https://www.kdnuggets.com/2017/12/graph-analytics-using-big-data.html

  • Updates & Upserts in Hadoop Ecosystem with Apache Kudu

    A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.

    https://www.kdnuggets.com/2017/10/updates-upserts-hadoop-ecosystem-apache-kudu.html

  • Are Data Lakes Fake News?">Silver Blog, Sep 2017Are Data Lakes Fake News?

    The quick answer is yes, and the biggest problem is that the term “Data Lakes” has been overloaded by vendors and analysts with different meanings, resulting in an ill-defined and blurry concept.

    https://www.kdnuggets.com/2017/09/data-lakes-fake-news.html

  • 277 Data Science Key Terms, Explained">Silver Blog, Sep 2017277 Data Science Key Terms, Explained

    This is a collection of 277 data science key terms, explained with a no-nonsense, concise approach. Read on to find terminology related to Big Data, machine learning, natural language processing, descriptive statistics, and much more.

    https://www.kdnuggets.com/2017/09/data-science-key-terms-explained.html

  • Apache Flink: The Next Distributed Data Processing Revolution?">Silver Blog, Jul 2017Apache Flink: The Next Distributed Data Processing Revolution?

    Will Apache Flink displace Apache Spark as the new champion of Big Data Processing? We compare Spark and Apache Flink performance for batch processing and stream processing.

    https://www.kdnuggets.com/2017/07/apache-flink-distributed-data-processing-revolution.html

  • How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I

    As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!

    https://www.kdnuggets.com/2017/06/feature-engineering-help-kaggle-competition-1.html

  • Top Stories, May 22-28: Analytics, Data Science, Machine Learning Software Poll Results; Machine Learning Crash Course

    New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll; Machine Learning Crash Course: Part 1; Text Mining 101: Mining Information From A Resume; Data science platforms are on the rise and IBM is leading the way; An Introduction to the MXNet Python API

    https://www.kdnuggets.com/2017/05/top-news-week-0522-0528.html

  • Simplifying Data Pipelines in Hadoop: Overcoming the Growing Pains

    Moving to Hadoop is not without its challenges—there are so many options, from tools to approaches, that can have a significant impact on the future success of a business’ strategy. Data management and data pipelining can be particularly difficult.

    https://www.kdnuggets.com/2017/05/simplify-data-pipelines-hadoop.html

  • Data Science & Machine Learning Platforms for the Enterprise

    A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale.

    https://www.kdnuggets.com/2017/05/data-science-machine-learning-platforms-enterprise.html

  • Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory

    Apache Parquet and Apache Arrow both focus on improving performance and efficiency of data analytics. These two projects optimize performance for on disk and in-memory processing

    https://www.kdnuggets.com/2017/02/apache-arrow-parquet-columnar-data.html

  • Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

    This article covers the value of understanding the virtualization constructs for the data scientist and data engineer as they deploy their analysis onto all kinds of cloud platforms. Virtualization is a key enabling layer of software for these data workers to be aware of and to achieve optimal results from.

    https://www.kdnuggets.com/2017/01/data-scientist-engineer-understand-virtualization-cloud.html

  • 50+ Data Science, Machine Learning Cheat Sheets, updated">2016 Dec Gold Blog50+ Data Science, Machine Learning Cheat Sheets, updated

    Gear up to speed and have concepts and commands handy in Data Science, Data Mining, and Machine learning algorithms with these cheat sheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark, Matlab, and Java.

    https://www.kdnuggets.com/2016/12/data-science-machine-learning-cheat-sheets-updated.html

  • Evaluating HTAP Databases for Machine Learning Applications

    Businesses are producing a greater number of intelligent applications; which traditional databases are unable to support. A new class of databases, Hybrid Transactional and Analytical Processing (HTAP) databases, offers a variety of capabilities with specific strengths and weaknesses to consider. This article aims to give application developers and data scientists a better understanding of the HTAP database ecosystem so they can make the right choice for their intelligent application.

    https://www.kdnuggets.com/2016/11/evaluating-htap-databases-machine-learning-applications.html

  • The top 5 Big Data courses to help you break into the industry

    Here is an updated and in-depth review of top 5 providers of Big Data and Data Science courses: Simplilearn, Cloudera, Big Data University, Hortonworks, and Coursera

    https://www.kdnuggets.com/2016/08/simplilearn-5-big-data-courses.html

  • Big Data Key Terms, Explained

    Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.

    https://www.kdnuggets.com/2016/08/big-data-key-terms-explained.html

  • Apache Spark Key Terms, Explained

    An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.

    https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html

  • R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

    R remains the leading tool, with 49% share, but Python grows faster and almost catches up to R. RapidMiner remains the most popular general Data Science platform. Big Data tools used by almost 40%, and Deep Learning usage doubles.

    https://www.kdnuggets.com/2016/06/r-python-top-analytics-data-mining-data-science-software.html

  • Top 10 Data Science Resources on Github

    The top 10 data science projects on Github are chiefly composed of a number of tutorials and educational resources for learning and doing data science. Have a look at the resources others are using and learning from.

    https://www.kdnuggets.com/2016/03/top-10-data-science-github.html

  • Top Big Data Processing Frameworks

    A discussion of 5 Big Data processing frameworks: Hadoop, Spark, Flink, Storm, and Samza. An overview of each is given and comparative insights are provided, along with links to external resources on particular related topics.

    https://www.kdnuggets.com/2016/03/top-big-data-processing-frameworks.html

  • Top Spark Ecosystem Projects

    Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.

    https://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html

  • Python Data Science with Pandas vs Spark DataFrame: Key Differences

    A post describing the key differences between Pandas and Spark's DataFrame format, including specifics on important regular processing features, with code samples.

    https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html

  • 50 Deep Learning Software Tools and Platforms, Updated

    We present the popular software & toolkit resources for Deep Learning, including Caffe, Cuda-convnet, Deeplearning4j, Pylearn2, Theano, and Torch. Explore the new list!

    https://www.kdnuggets.com/2015/12/deep-learning-tools.html

  • 50+ Data Science and Machine Learning Cheat Sheets

    Gear up to speed and have Data Science & Data Mining concepts and commands handy with these cheatsheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark and Machine learning algorithms.

    https://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

  • R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites

    R is the most popular overall tool among data miners, although Python usage is growing faster. RapidMiner continues to be most popular suite for data mining/data science. Hadoop/Big Data tools usage grew to 29%, propelled by 3x growth in Spark. Other tools with strong growth include H2O (0xdata), Actian, MLlib, and Alteryx.

    https://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html

  • Hadoop as a Service: 18 Cloud Options

    Hadoop as a service in the cloud makes big data applications and projects easier to approach and these 18 platforms each provide their own unique solutions.

    https://www.kdnuggets.com/2015/04/hadoop-as-service-18-cloud-options.html

  • Interview: Arno Candel, H2O.ai on the Basics of Deep Learning to Get You Started

    We discuss how Deep Learning is different from the other methods of Machine Learning, unique characteristics and benefits of Deep Learning, and the key components of H2O architecture.

    https://www.kdnuggets.com/2015/01/interview-arno-candel-0xdata-deep-learning.html

  • IE Masters in Analytics and Big Data – first hand report

    First hand report on Master in business analytics and big data program at IE (Madrid, Spain) - why, what, how, days, and challenges.

    https://www.kdnuggets.com/2015/01/ie-data-science-education-first-hand-report.html

  • 16 NoSQL, NewSQL Databases To Watch

    NoSQL and NewSQL databases have become much more important with the proliferation of big, mobile, and networked data, and these sixteen database solutions are some of the biggest up-and-comers.

    https://www.kdnuggets.com/2014/12/16-nosql-newsql-databases-to-watch.html

  • R and Hadoop make Machine Learning Possible for Everyone

    R and Hadoop make machine learning approachable enough for inexperienced users to begin analyzing and visualizing interesting data to start down the path in this lucrative field.

    https://www.kdnuggets.com/2014/11/r-hadoop-make-machine-learning-possible-everyone.html

  • 18 essential Hadoop tools

    Hadoop tools develop at a rapid rate, and keeping up with the latest can be difficult. Here we detail 18 of the most essential tools that work well with Hadoop.

    https://www.kdnuggets.com/2014/08/18-essential-hadoop-tools.html

  • KDnuggets Analytics, Data Mining, Data Science Software Poll – Analyzed

    We analyze the results of KDnuggets Software Poll, including correlations between tools, and relationships between commercial, free, and Hadoop/Big Data tools. We identify a potential capability gap. Download anonymized data and analyze it yourself.

    https://www.kdnuggets.com/2014/06/analytics-data-mining-data-science-software-poll-analyzed.html

  • KDnuggets 15th Annual Analytics, Data Mining, Data Science Software Poll: RapidMiner Continues To Lead

    With over 3,000 data miners taking part in KDnuggets 15th Annual Software Poll, RapidMiner continues to lead. Free software is used much more outside US, and Hadoop usage grows fastest in Asia.

    https://www.kdnuggets.com/2014/06/kdnuggets-annual-software-poll-rapidminer-continues-lead.html

  • Poll Results: Data Types/Sources Analyzed

    Trends in data sources for data mining include: table data dominates, followed by time series and text; audio, JSON grows in popularity, while itemsets decline; 70% access DB engines, but only 20% access NoSQL stores; Hadoop, MongoDB used more for text; Europe is lagging in NoSQL usage.

    https://www.kdnuggets.com/2014/05/poll-results-data-types-sources-analyzed.html

  • KDnuggets™ News 13:n02, Jan 30

    Features (10) | Software (4) | Courses, Events (2) | Webcasts (3) | Jobs (12) | Academic (5) | Competitions (4) | Publications (12) | NewsBriefs Read more »

    https://www.kdnuggets.com/2013/n02.html

Refine your search here: