Search results for datasets

    Found 100 documents, 8822 searched:

  • Datasets for Data Mining and Data Science

    ...he largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases. Datasets.co, datasets for data geeks, find and share Machine Learning datasets. DataSF.org, a clearinghouse of datasets available from the City &...

    https://www.kdnuggets.com/datasets/index.html

  • Interesting Social Media Datasets

    ...asets to experiment with: The Stanford Large Network Dataset Collection (SNAP) is an excellent resource because not only does it have a wide range of datasets from different sources, but it also has datasets of varying size, which can be useful depending on your applications. SNAP is also a library...

    https://www.kdnuggets.com/2014/08/interesting-social-media-datasets.html+

  • 9 Must-Have Datasets for Investigating Recommender Systems

    ...e summarized below. Some of them are standards of the recommender system world, while others are a little more non-traditional. These non-traditional datasets are the ones we are most excited about because we think they will most closely mimic the types of data seen in the wild. Before we get...

    https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html

  • Awesome Public Datasets on GitHub

    ...etworks traffic, employing statistical and machine learning techniques on distributed processing platforms such as Apache Spark. So, which dataset would you pick today? Would you like to add anything to this list? Let us know your thoughts in the comments below. Related: KDnuggets Datasets for Data...

    https://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html

  • A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

    ...rk 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs. Mostly, I will focus on DataFrames and Datasets, because in Apache Spark 2.0, these two APIs are unified. Our primary...

    https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html

  • Where can I find good datasets for data mining?

    ...d this discussion Free Public Datasets Many conversations happen on Google group get.theinfo groups.google.com/group/get-theinfo and www.reddit.com/r/datasets/ - subreddit devoted to datasets Quora answer www.quora.com/Where-can-I-get-large-datasets-open-to-the-public?q=dataset OpenData...

    https://www.kdnuggets.com/faq/datasets-for-data-mining.html

  • Choosing an Open Source Machine Learning Library: TensorFlow, Theano, Torch, scikit-learn, Caffe

    …e. Despite using a less common language than Python, it’s widely adopted – Facebook, Google, and Twitter are known for using it in their AI projects. Datasets and models You can find a list of popular datasets to be loaded for use in Torch on its GitHub cheatsheet page. Moreover, Facebook released…

    https://www.kdnuggets.com/2017/11/choosing-open-source-machine-learning-library.html

  • Data-Planet Statistical Datasets

    .... Editors: Use Data-Planet Statistical Datasets to identify and create content for publication and research. Add Your Data to Data-Planet Statistical Datasets! Add your data to Data-Planet Statistical Datasets and integrate it with all the other data in our system. Many customers purchase or create...

    https://www.kdnuggets.com/2015/11/data-planet-statistical-datasets.html

  • Announcing Microsoft Research Open Data, a cloud hosted platform for sharing datasets

    ...ets, facilitate collaboration between researchers using cloud-based resources and enable reproducibility of research. It has been seeded with over 50 datasets with more being added incrementally. All datasets are available for any researcher / data scientist to freely download, or with a few...

    https://www.kdnuggets.com/2018/06/microsoft-research-open-data.html

  • Introducing VisualData: A Search Engine for Computer Vision Datasets

    ...ew ideas. The answer to the problem is open datasets. Instead of building your own dataset, there already exists a rich collection of computer vision datasets contributed by academic researchers, hobbyists and companies. These datasets include diverse topics from recognizing objects to...

    https://www.kdnuggets.com/2018/09/introducing-visualdata-search-engine-computer-vision-datasets.html

  • Datasets Over Algorithms

    ...the public credit for ending the last AI winter,” concluded Alexander Wissner-Gross, “the real news might be that prioritizing the cultivation of new datasets and research communities around them could be essential to extending the present AI summer.” We wonder if algorithmic trading systems might...

    https://www.kdnuggets.com/2016/05/datasets-over-algorithms.html

  • Big RAM is eating big data – Size of datasets used for analytics

    …ase of datasets of 10^0.075 ~ 1.2 that is 20%. The median dataset size increases from 6 GB (2006) to 30 GB (2015). That’s all tiny, even more for raw datasets, and it implies that over 50% of analytics professionals work with datasets that (even in raw form) can fit in the memory of a single…

    https://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html

  • KDnuggets™ News 15:n11, Apr 15: Big Data Predictive Analytics Gainers & Losers; Awesome Public Datasets

    ...s, Intro to Machine Learning with sci-kit, and more. Top stories for Apr 5-11: 10 things statistics taught us about big data analysis; Awesome Public Datasets on GitHub - Apr 12, 2015. 10 things statistics taught us about big data analysis; The Grammar of Data Science: Python vs R; Predictive...

    https://www.kdnuggets.com/2015/n11.html

  • 10 Data Acquisition Strategies for Startups

    ...ng as the core technology of their business. While many algorithms and software tools are open sourced and shared across the research community, good datasets are usually proprietary and hard to build. Owning a large, domain-specific dataset can therefore become a significant source of competitive...

    https://www.kdnuggets.com/2016/06/10-data-acquisition-strategies-startups.html

  • Top 10 Open Dataset Resources on Github

    ...Congressional districts as GeoJSON, versioned within Git 10. CERN Open Data Portal Stars: 79, Forks: 34 This is the source code for the CERN Open Data Portal, described as "the access point to a growing range of data produced through the research performed at CERN." Related: Awesome Public...

    https://www.kdnuggets.com/2016/05/top-10-datasets-github.html

  • 7 Steps to Mastering Apache Spark 2.0">Silver Blog7 Steps to Mastering Apache Spark 2.0

    ...because they form the core data abstraction concept in Spark and underpin all other higher-level data abstractions and APIs, including DataFrames and Datasets. In Spark 2.0, DataFrames and Datasets, built upon RDDs, form the core high-level and structured distributed data abstraction, across most...

    https://www.kdnuggets.com/2016/09/7-steps-mastering-apache-spark.html

  • Top Datasets on Reddit

    By Gregory Piatetsky, Dec 28, 2013. Thanks to +RichGillin for a pointer to a Reddit page on Datasets www.reddit.com/r/datasets/ The top datasets for December 2013 include NFL Game Metadata Since 1980 (CSV file). mapItOut reddit user explains how to link the metadata with the results: Download the...

    https://www.kdnuggets.com/2013/12/top-datasets-on-reddit.html

  • Behind the Dream of Data Work as it Could Be

    ...t. I have a feeling I’m going to get a lot more out of data.world. In the next few days, I’ll get emails that prompt me to learn more about Uploading Datasets and Combining Datasets. I think I’ll upload some new OW data and see what people do with it. I’m excited to share my findings internally,...

    https://www.kdnuggets.com/2016/09/behind-dream-data-work.html

  • Three techniques to improve machine learning model performance with imbalanced datasets

    ...llowing subsections, I describe three techniques I used to overcome the data imbalance problem. First, let’s get started familiarizing with datasets: Datasets: There are three labels [1, 2, 3] in the training data which makes the problem a multi-class problem. Training datasets have 17 features and...

    https://www.kdnuggets.com/2018/06/three-techniques-improve-machine-learning-model-performance-imbalanced-datasets.html

  • Coursera / Stanford Mining Massive Datasets MOOC, Jan-Mar 2015

    ...apReduce algorithms from good algorithms in general. The rest of the course is devoted to algorithms for extracting models and information from large datasets. Students will learn how Google's PageRank algorithm models importance of Web pages and some of the many extensions that have been used for...

    https://www.kdnuggets.com/2015/01/coursera-stanford-mining-massive-datasets-mooc-jan.html

  • Mining Massive Datasets, free Stanford online course, starts Oct 11

    ...capital firm. Jeff Ullman, is a retired professor of Computer Science at Stanford. About the Course The course is based on the text Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, who by coincidence are also the instructors for the course. The book is published by...

    https://www.kdnuggets.com/2016/09/stanford-mining-massive-datasets-course.html

  • What is Academic Torrents and Where is Data Sharing Going?

    ...ome examples are: Video Lectures: academictorrents.com/collection/video-lectures Deep Learning: academictorrents.com/collection/deep-learning Spatial Datasets: academictorrents.com/collection/spatial-datasets Computer Vision: academictorrents.com/collection/computer-vision Standard webservers can...

    https://www.kdnuggets.com/2016/10/academic-torrents-data-sharing.html

  • Coursera / Stanford Mining Massive Datasets MOOC

    ...apReduce algorithms from good algorithms in general. The rest of the course is devoted to algorithms for extracting models and information from large datasets. Students will learn how Google's PageRank algorithm models importance of Web pages and some of the many extensions that have been used for...

    https://www.kdnuggets.com/2014/09/coursera-stanford-mining-massive-datasets-mooc.html

  • Top KDnuggets tweets, May 25-31: 19 Free eBooks to learn #programming with #Python; Awesome collection of public datasets on Github

    ...Degree - numerous free books and resources https://t.co/EyAgE1aKll https://t.co/LtfcF2XsFd Most Clicked: Check out this awesome collection of public #datasets on @Github! https://t.co/kU7ffMvD4x https://t.co/WuDFh6v7P1 Top 10 most engaging Tweets Introducing our Hybrid lda2vec Algorithm via Stitch...

    https://www.kdnuggets.com/2016/06/top-tweets-may25-31.html

  • Coursera/Stanford “Mining Massive Datasets”, free online course

    ...apReduce algorithms from good algorithms in general. The rest of the course is devoted to algorithms for extracting models and information from large datasets. Participants will learn how Google's PageRank algorithm models importance of Web pages and some of the many extensions that have been used...

    https://www.kdnuggets.com/2015/07/coursera-stanford-mining-massive-datasets-free-online-course.html

  • Lessons Learned From Benchmarking Fast Machine Learning Algorithms

    …he leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise…

    https://www.kdnuggets.com/2017/08/lessons-benchmarking-fast-machine-learning-algorithms.html

  • US Open Data Action Plan and Datasets

    ...l agency websites as well as on the federal open data repository, Data.gov, which celebrated its fifth anniversary a few weeks ago. More than 100,000 datasets are available for download on Data.gov (http://catalog.data.gov/dataset). Moreover, some of the datasets are also available in an...

    https://www.kdnuggets.com/2014/05/us-open-data-action-plan-data-sets.html

  • 11 Clever Methods of Overfitting and how to avoid them

    ...he past problems that helps on future problems. Data set selection subverts this and is very difficult to detect. Remedy: Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way. 9. Reprobleming: Alter the problem so that...

    https://www.kdnuggets.com/2015/01/clever-methods-overfitting-avoid.html

  • Top stories for Sep 14-20: Coursera Mining Massive Datasets MOOC; Rattle package for Data Mining in R

    Most viewed news items Coursera / Stanford Mining Massive Datasets MOOC - Sep 16, 2014. Rattle package for Data Mining and Data Science in R - Sep 17, 2014. Hiring Data Scientists: What to look for? - Sep 9, 2014. edX “Learning From Data” Caltech course - Sep 18, 2014. Most Viewed...

    https://www.kdnuggets.com/2014/09/top-news-week-sep-14.html

  • The new Enigma Public – the platform connecting people to data

    …y, unusual, and essential datasets in Enigma Public. Whether you’re new to public data or a briefed veteran, you’ll find guidance on working with key datasets plus quick descriptions of datasets you’ve (probably) never heard of or explored. Check back with Enigma Public frequently as we update the…

    https://www.kdnuggets.com/2017/09/new-enigma-public-platform.html

  • Curated, Cleansed Datasets can Make A World of Difference

    …g and its ability to help accelerate data science and add value to the quality of big data analysis. To seed the platform, our data contains the open datasets that we’ve prepared as a result of over 90+ projects executed for leading global enterprises. Datasets are organized by business use-case or…

    https://www.kdnuggets.com/2015/09/crowdanalytix-curated-cleansed-datasets.html

  • KDnuggets™ News 16:n16, May 4: How to Remove Duplicates from Large Data; Datasets over Algorithms; When Automation goes too far

    ...r AI: Google, Facebook, Amazon, Apple; Comprehensive Guide to Learning #Python    Quote "Perhaps the most important news of our day is that datasets - not algorithms - might be the key limiting factor to development of human-level artificial intelligence" - Alexander Wissner-Gross in...

    https://www.kdnuggets.com/2016/n16.html

  • How To Stay Competitive In Machine Learning Business

    …t’s it and it makes their feed pretty unique. Let’s go back to Google. Does anyone else have access to that type of search data? No. Those are unique datasets. How do you get access to unique datasets without “going Google?” This is where business strategy planning comes into play. The reason…

    https://www.kdnuggets.com/2017/01/stay-competitive-machine-learning-business.html

  • Interview: Andrew Duguay, Prevedere on Economic Intelligence from Integrating Public Datasets

    ...lities of Economic Intelligence software by Prevedere? AD: The main features of Prevedere Economic Intelligence include the following: Housing global datasets in a clear and consistent fashion, updated on a daily basis A patent pending correlation engine that will compare the 1.5 million data sets...

    https://www.kdnuggets.com/2015/07/interview-andrew-duguay-prevedere-economic-intelligence.html

  • Book: Mining of Massive Datasets, 2nd Edition, free download

    Mining of Massive Datasets , by Jure Leskovec @jure, Anand Rajaraman @anand_raj, and Jeff Ullman. The first edition was published by Cambridge University Press, and you get 20% discount by buying it here. The second edition of the book will also be published soon. Jure Leskovec was added as a...

    https://www.kdnuggets.com/2014/02/book-mining-massive-datasets-2nd-edition-free-download.html

  • Top /r/MachineLearning posts, Jan 11-17

    ...is post is interesting not just because it has some good datasets (there are many place to find great datasets) but because it focuses on open-source datasets. Most of these datasets should offer some form of open-source license, though I would check the particular dataset you’re interested to be...

    https://www.kdnuggets.com/2015/01/top-machine-learning-posts-jan11-17.html

  • SBP15 Grand Data Challenge

    ...se datasets are merely intended to provide a starting point, and are not required for the submission. Contestants are encouraged to provide their own datasets for the community. All of the datasets that follow are available on the SBP Grand Challenge website (http://sbp-conference.org/challenge/):...

    https://www.kdnuggets.com/2014/12/sbp15-grand-data-challenge.html

  • Visualizing Time-Series Change

    ...of communicating change in our dataset. Periodic Change   While plotting change in absolute units allows us to make comparisons within specific datasets, it is not particularly effective for comparing change across datasets with vastly different scales. If we examine the periods of 1990–1994,...

    https://www.kdnuggets.com/2017/03/visualizing-time-series-change.html

  • Top stories for Apr 12-18: Awesome Public Datasets on GitHub; Cloud Machine Learning Wars: Amazon vs IBM Watson vs Microsoft Azure

    ...for Text Understanding from Scratch - Mar 13, 2015. Top stories for Apr 5-11: 10 things statistics taught us about big data analysis; Awesome Public Datasets on GitHub - Apr 12, 2015. Interview: Ksenija Draskovic, Verizon on Dissecting the Anatomy of Predictive Analytics Projects - Apr 15, 2015....

    https://www.kdnuggets.com/2015/04/top-news-week-apr-12.html

  • 7 Ways to Get High-Quality Labeled Training Data at Low Cost

    ...ficient for training your models. To bootstrap training, you might pretrain with free public data that is roughly related to your domain. If the free datasets include acceptable labels, all the better. You might then retrain the model on smaller, higher quality, labeled datasets that are directly...

    https://www.kdnuggets.com/2017/06/acquiring-quality-labeled-training-data.html

  • Building a Data Science Portfolio: Machine Learning Project Part 1

    ...ing direct value. Some good places to find datasets like this are: /r/datasets – a subreddit that has hundreds of interesting datasets. Google Public Datasets – public datasets available through Google BigQuery. Awesome datasets – a list of datasets, hosted on Github. As you look through these...

    https://www.kdnuggets.com/2016/07/building-data-science-portfolio-machine-learning-project-part-1.html

  • Contest Winner: Winning the AutoML Challenge with Auto-sklearn

    ...hyperparameter settings for previous similar datasets. Specifically, Auto-sklearn comes with a database of previous optimization runs on 140 diverse datasets from OpenML. For a new dataset, it first identifies the most similar datasets and starts from the saved best settings for those. A second...

    https://www.kdnuggets.com/2016/08/winning-automl-challenge-auto-sklearn.html

  • Tour of Real-World Machine Learning Problems

    ...ee Access Challenge. Given historical resource access changes for employees predict the resources required by employees.   Most Popular Research Datasets The next 10 machine learning problems are the most popular on the University California at Irvine Machine Learning Repository website that...

    https://www.kdnuggets.com/2015/12/tour-real-world-machine-learning-problems.html

  • Top KDnuggets tweets, Sep 15-16

    ...and Free) #DataScience Books on Statistical Learning, Machine Learning, Mining #BigData & more t.co/xzZyODrYyG Best #BigData Deal: Mining Massive Datasets on Coursera, by top Stanford researchers, starts Sep 29 and is free t.co/U2zOimghbQ Coursera / Stanford Mining Massive Datasets MOOC...

    https://www.kdnuggets.com/2014/09/top-tweets-sep15-16.html

  • CrowdSignals.io, Building Big Mobile Social Sensor dataset

    …//crowdsignals.io, Indiegogo Page: http://igg.me/at/crowdsignals Videos: Crowdfunding video: https://vimeo.com/159919080, Endorsements reel: https://vimeo.com/141841480 Related: 9 Must-Have Datasets for Investigating Recommender Systems Awesome Public Datasets on GitHub Interesting Social Media…

    https://www.kdnuggets.com/2016/03/crowdsignals-io-big-mobile-social-sensor-dataset.html

  • Top April stories: Awesome Public Datasets on GitHub; Forrester Wave Big Data Predictive Analytics – Gainers and Losers

    ...Analytics 2015: Gainers and Losers - Apr 3, 2015. Cloud Machine Learning Wars: Amazon vs IBM Watson vs Microsoft Azure - Apr 16, 2015. Awesome Public Datasets on GitHub - Apr 6, 2015. Top LinkedIn Groups for Analytics, Big Data, Data Mining, and Data Science - from “Big Bang” to Now -...

    https://www.kdnuggets.com/2015/05/top-news-2015-apr.html

  • Deep Feature Synthesis: How Automated Feature Engineering Works

    ...most common type of enterprise data used today: a survey of 16,000 data scientists on Kaggle found that they spent 65% of their time using relational datasets. 2. Across datasets, many features are derived by using similar mathematical operations. To understand this, let’s consider a dataset of...

    https://www.kdnuggets.com/2018/02/deep-feature-synthesis-automated-feature-engineering.html

  • Data Analyst

    ...Chartis consulting engagement teams Work with client to aid data request fulfillment Receive major client datasets Thoroughly QA and cleanse all raw datasets Build out datasets to maximize utility using advanced data management tools Execute core set of analytics for strategy, operations, or...

    https://www.kdnuggets.com/jobs/13/05-30-chartis-data-analyst.html

  • The Chartis Group: Data Manager

    ...Chartis consulting engagement teams Work with client to aid data request fulfillment Receive major client datasets Thoroughly QA and cleanse all raw datasets Build out datasets to maximize utility using advanced data management tools Execute core set of analytics for strategy, operations, or...

    https://www.kdnuggets.com/jobs/14/02-27-chartis-data-manager.html

  • Google BigQuery Public Datasets

    ...highlighting open datasets on Google BigQuery - once data is loaded there, you can make it public, let others analyze with SQL. Here are some notable Datasets publicly available on Google BigQuery (via reddit). GDELT Worldwide news GDELT announcement Top words queries All events:...

    https://www.kdnuggets.com/2015/02/google-bigquery-public-datasets.html

  • Scikit Flow: Easy Deep Learning with TensorFlow and Scikit-learn">2016 Silver BlogScikit Flow: Easy Deep Learning with TensorFlow and Scikit-learn

    ...% score) Want to try a Naive Bayes classifier? That doesn't require much of a change: from sklearn.naive_bayes import GaussianNB from sklearn import datasets, metrics iris = datasets.load_iris( ) classifier = GaussianNB( ) classifier.fit(iris.data, iris.target) score =...

    https://www.kdnuggets.com/2016/02/scikit-flow-easy-deep-learning-tensorflow-scikit-learn.html

  • Dataiku Data Science Studio – intuitive solution for data professionals

    ...e an easy way to initiate data manipulations, such as cleansing and aggregation. DSS 2.0 now features: The Split recipe: For the creation of multiple datasets from a singular dataset; The Stack recipe: For the vertical stacking of datasets; The Sample/Filter recipe: For the selection and sampling...

    https://www.kdnuggets.com/2015/07/dataiku-data-science-studio.html

  • Apache Spark Key Terms, Explained

    ..."json" ) # Execute SQL query results = context.sql( """SELECT * FROM people JOIN json ...""" ) 4. Dataset Introduced in Spark 1.6, the goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and...

    https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html

  • Doing Data Science: A Kaggle Walkthrough Part 5 – Adding New Data

    ...ures that our dataset includes all users regardless of this fact. The second step could use an inner or an outer join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever...

    https://www.kdnuggets.com/2016/06/doing-data-science-kaggle-walkthrough-adding-new-data.html

  • Top /r/MachineLearning posts, August 2018: Everybody Dance Now; Stanford class Machine Learning cheat sheets; Academic Torrents for sharing enormous datasets

    ...linked in the sidebar! A lot of people would find this information extremely useful.” 3. Academic Torrents: A distributed system for sharing enormous datasets “We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and...

    https://www.kdnuggets.com/2018/09/top-reddit-machine-learning-august.html

  • Applying Deep Learning to Real-world Problems">Gold Blog, Jun 2017Applying Deep Learning to Real-world Problems

    ...websites which contain a collection of various trained models by academics, companies and deep learning enthusiasts. See here, here, or here. Public datasets: there are many datasets out there on the web. So don’t waste time on collecting the dataset yourself, but rather check if there is already...

    https://www.kdnuggets.com/2017/06/applying-deep-learning-real-world-problems.html

  • Challenges in Machine Learning for Trust

    …xpert to tell if a user’s details are fraudulent. Also in some cases, experts might disagree on the labels. A second challenge is lack of open source datasets and models to build upon. There are many datasets for image recognition like MNIST, CIFAR, Imagenet etc4. Numerous researchers have worked…

    https://www.kdnuggets.com/2017/05/challenges-machine-learning-trust.html

  • Data: Portals, Government, State, City, Local, and Public

    ...a comprehensive source of US statistics and more. United States Census Bureau. USA: State, City, and Local CA: San Francisco Data, a clearinghouse of datasets available from the City and County of San Francisco, CA. IL: Chicago data . NY: New York NYC Open Dat WA: Seattle data . Canada Canada Open...

    https://www.kdnuggets.com/2013/07/data-portals-government-state-city-local-public.html

  • Data: APIs, Hubs, Marketplaces, and Platforms

    ...reamX, a global marketplace for commercial data, bringing together buyers and vendors of data onto one simple-to-use platform. dataX, a collection of datasets curated and crowdsourced by CrowdAnalytix community of data scientists Enigma Public, the world's broadest collection of public data....

    https://www.kdnuggets.com/datasets/api-hub-marketplace-platform.html

  • Do You Need Big Data or Smart Data? Part 1

    ...tain point of time. Hence, building your predictive models on such huge datasets may not be beneficial. Let’s look at the example below that compares datasets. i) Datasets with 100 observations ii) Datasets with 1,000,000 observations The above datasets have similar mean and standard deviation, but...

    https://www.kdnuggets.com/2016/06/big-data-smart-data-part-1.html

  • Top KDnuggets tweets, May 8-9: Essential Data Mining Cheat Sheet; Quandl R Package – 5M free datasets, clever data search

    ...: Has #BigData Made Anonymity Impossible? Yes. There is no "digital privacy" anymore bit.ly/11TBXln Most Retweeted: Quandl R Package - 5,000,000 free datasets and very clever data search #rstats bit.ly/12UX4mE Most Favorited: Essential! Data Mining Cheat Sheet - Discovering and Visualizing Patterns...

    https://www.kdnuggets.com/2013/05/top-tweets-may08-may09.html

  • Bitcoin tools and datasets

    ...o securely and anonymously retain and transfer value in a decentralized P2P network, and is one of the largest open microeconomic transaction network datasets available. Recently, Bitcoin has experienced a bubble, causing its total value to quadruple (from 500M USD to 2B USD) over the past month...

    https://www.kdnuggets.com/2013/04/bitcoin-tools-datasets.html

  • Data: Government, State, City, Local and Public

    ...visualization of US public data. DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets. EconData, thousands of economic time series, produced by a number of US Government agencies. FEDSTATS, a comprehensive source of US...

    https://www.kdnuggets.com/datasets/government-local-public.html

  • KDnuggets™ News 14:n26, Oct 8

    ...y Databricks, and Spark tutorial events at major universities. Top stories for Sep 28 - Oct 4: Mirador, a free tool for visual exploration of complex datasets - Oct 5, 2014. Mirador, a free tool for visual exploration of complex datasets; Data Science is mainly a Human Science; Get Started in Text...

    https://www.kdnuggets.com/2014/n26.html

  • KDnuggets 14:n01, Unicorn Data Scientists vs Team? Top Datasets on Reddit; What is wrong with Data Science

    Date: Jan 08, 2014 Latest KDnuggets News 14:n01, (Jan 08, 2014) Features: New Poll: Data Science Skills - Individual vs Team Approach Top Datasets on Reddit "Data Scientist" catches "Statistician", surpasses "Data Miner" PAW: Predictive Analytics World for Manufacturing, Chicago, June 17-18...

    https://www.kdnuggets.com/2014/01/pub-kdnuggets-14-n01-unicorn-data-scientists-vs-team-top-datasets-reddit-what-is-wrong-with-data-science.html

  • Top stories for Sep 28 – Oct 4: Mirador, a free tool for visual exploration of complex datasets

    Most viewed news items Mirador, a free tool for visual exploration of complex datasets - Oct 1, 2014. Top stories for Sep 21-27: Data Science is mainly a Human Science; Data Analytics for Business Leaders Explained - Sep 28, 2014. Get Started in Text Analytics - Sep 30, 2014. Automotive Customer...

    https://www.kdnuggets.com/2014/10/top-news-week-sep-28.html

  • Top stories for Apr 5-11: 10 things statistics taught us about big data analysis; Awesome Public Datasets on GitHub

    ...items 10 things statistics taught us about big data analysis - Feb 10, 2015. The Grammar of Data Science: Python vs R - Mar 28, 2015. Awesome Public Datasets on GitHub - Apr 6, 2015. 7 common mistakes when doing Machine Learning - Mar 7, 2015. Forrester Wave(tm) Big Data Predictive Analytics 2015:...

    https://www.kdnuggets.com/2015/04/top-news-week-apr-5.html

  • Mirador, a free tool for visual exploration of complex datasets

    ...ration, sent to me by Andres Colubri, a researcher at Harvard University and the Broad Institute. Mirador is a tool for visual exploration of complex datasets, developed by the Sabeti Lab at Harvard University, the Broad Institute, and Fathom Information Design. Fathom was founded by Ben Fry, a...

    https://www.kdnuggets.com/2014/10/mirador-visual-exploration-complex-datasets.html

  • Medical Image Analysis with Deep Learning 

    ...thy. Dicom Library : DICOM Library is a free online medical DICOM image or video file sharing service for educational and scientific purposes. Osirix Datasets: Provides a large range of human datasets acquired through a variety of imaging modalities. Visible Human Datasets: Parts of the Visible...

    https://www.kdnuggets.com/2017/03/medical-image-analysis-deep-learning.html

  • A Solution to Missing Data: Imputation Using R

    ...ted datasets for modelling. With this in mind, I can use two functions - with() and pool(). The with() function can be used to fit a model on all the datasets just as in the following example of linear model #fit a linear model on all datasets together...

    https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html

  • How to Organize Data Labeling for Machine Learning: Approaches and Tools

    ...lications. It can be used for training neural networks — models used for object recognition tasks. Such projects require specialists to prepare large datasets consisting of text, image, audio, or video files. The more complex the task, the larger the network and training dataset. When a huge amount...

    https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html

  • Tidying Data in Python

    ...h b 2 Jane Doe b 11 Mary Johnson b 1 Tidying messy datasets   Through the following examples extracted from Wickham’s paper, we’ll wrangle messy datasets into the tidy format. The goal here is not to analyze the datasets but rather prepare them in a standardized way prior to the analysis....

    https://www.kdnuggets.com/2017/01/tidying-data-python.html

  • The Doing Part of Learning Data Science

    ...courses, “These techniques are easy to learn, hard to Master”. True. The MOOC courses on Machine Learning or even initial hands on experiences on toy datasets can very well give an illusion of competence, unless one has really gone into matters in point 1 above in depth and correlated Machine...

    https://www.kdnuggets.com/2018/02/doing-part-learning-data-science.html

  • Detecting Sarcasm with Deep Convolutional Neural Networks">Gold BlogDetecting Sarcasm with Deep Convolutional Neural Networks

    ...sed model consistently outperforms all the other models. Generalizability capabilities of the models were tested and the main finding was that if the datasets differed in nature, this significantly impacted the results. (See visualization of the datasets rendered via PCA below). For instance,...

    https://www.kdnuggets.com/2018/06/detecting-sarcasm-deep-convolutional-neural-networks.html

  • Additions to KDnuggets Directory in November

    ...d business. In Datasets :: Government and Public Data CMS.gov Centers for Medicare and Medicaid Services, Research, Statistics, Data, and Systems. In Datasets HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine...

    https://www.kdnuggets.com/2013/12/added-to-kdnuggets-in-november.html

  • Learning from Imbalanced Classes

    ..., the example classes werebalanced, meaning there were approximately the same number of examples of each class. Instructors usually employ cleaned up datasets so as to concentrate on teaching specific algorithms or techniques without getting distracted by other issues. Usually you’re shown examples...

    https://www.kdnuggets.com/2016/08/learning-from-imbalanced-classes.html

  • Additions to KDnuggets Directory in November

    ...d business. In Datasets :: Government and Public Data CMS.gov Centers for Medicare and Medicaid Services, Research, Statistics, Data, and Systems. In Datasets HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine...

    https://www.kdnuggets.com/2013/12/added-to-kdnuggets-in-november.html

  • Top KDnuggets tweets, Feb 4-5: A data scientist collection of useful and open datasets http; Big Data For Dummies

    Most popular KDnuggets tweets (see twitter.com/kdnuggets ) for Feb 4-5 were Top 10 Tweets A data scientist collection of useful and open datasets bit.ly/Y6cBNZ Never thought will see this title, but here it is ... "Big Data For Dummies" amzn.to/XaAOnm David Brooks finds small anecdotes in...

    https://www.kdnuggets.com/2013/02/top-tweets-feb4-feb5.html

  • Top stories for Dec 29 – Jan 4: Unicorn Data Scientists vs Data Science Teams; Top Datasets on Reddit

    Most viewed news items Unicorn Data Scientists vs Data Science Teams - Dec 30, 2013. Top Datasets on Reddit - Dec 28, 2013. Top stories in December: A Programmer Guide to Data Mining - Free Download; 3 Stages of Big Data - Jan 2, 2014. Top stories for Dec 22-29: Data Mining Applications with R;...

    https://www.kdnuggets.com/2014/01/top-news-week-Dec-29.html

  • KDnuggets™ News 15:n24, Jul 29: Big Data to Big Profits; Mining Massive Datasets; Data for Humanity

    ...;|  Tweets | Quote    Features From Big Data to Big Profits: A Lesson from Google's Nest Coursera / Stanford "Mining Massive Datasets", free online course Data for Humanity: A Request for Support To Code or Not to Code with KNIME Data Mining/Data Science "Nobel Prize": ACM...

    https://www.kdnuggets.com/2015/n24.html

  • Top stories, Apr 24-30: How to Remove Duplicates in Large Datasets; The “Thinking” Part of “Thinking Like A Data Scientist”

    ...st unique views went down, and : post views went up. Larger triangle indicates larger change. Apr 24-30 Most Shared How to Remove Duplicates in Large Datasets - Apr 27, 2016. The "Thinking" Part of "Thinking Like A Data Scientist" - Apr 26, 2016. Microsoft is Becoming M(ai)crosoft - Apr 25, 2016....

    https://www.kdnuggets.com/2016/05/top-news-week-0424-0430.html

  • How to Remove Duplicates in Large Datasets

    comments By Suresh Kondamudi, CleverTap. Dealing with large datasets is often daunting. With limited computing resources, particularly memory, it can be challenging to perform even basic tasks like counting distinct elements, membership check, filtering duplicate elements, finding minimum,…

    https://www.kdnuggets.com/2016/04/clevertap-remove-duplicates-large-datasets.html

  • Auto-Scaling scikit-learn with Spark

    ...ibrary, and reports the best model back to the master: The code is the same as before, except for a one-line change: from sklearn import grid_search, datasets from sklearn.ensemble import RandomForestClassifier # Use spark_sklearn’s grid search instead: from spark_sklearn import GridSearchCV digits...

    https://www.kdnuggets.com/2016/02/auto-scaling-scikit-learn-spark.html

  • Implementing Your Own k-Nearest Neighbor Algorithm Using Python

    ...ew of kNN can be read here. A more in depth implementation with weighting and search trees is here. Full script The full script follows: from sklearn.datasets import load_iris from sklearn import cross_validation from sklearn.metrics import classification_report, accuracy_score from operator import...

    https://www.kdnuggets.com/2016/01/implementing-your-own-knn-using-python.html

  • Approaching (Almost) Any Machine Learning Problem

    ...boost instead of the implementation of GBM in scikit-learn since xgboost is much faster and more scalable. We can also do feature selection of sparse datasets using RandomForestClassifier / RandomForestRegressor and xgboost. Another popular method for feature selection from positive sparse datasets...

    https://www.kdnuggets.com/2016/08/approaching-almost-any-machine-learning-problem.html

  • Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders

    ...elatively up to date (depending on the task), and in a format that we may be able to work with. If we are lucky, there are APIs or maybe even curated datasets out there. For instance, for a soccer-prediction hobby project, I ended up writing cron-tab powered Python web scrapers for dozens of...

    https://www.kdnuggets.com/2016/10/data-preparation-tips-tricks-tools.html

  • NYC Taxi Hackathon – find privacy risks in public taxi datasets

    By Jeff Garber, Director of Technology and Innovation, New York City Taxi and Limousine Commission (TLC) Help TLC provide more public data! The TLC has been a pioneer in sharing big data since 2010. With over 21,000 medallion taxis and street-hail liveries equipped to capture GPS-enabled trip...

    https://www.kdnuggets.com/2016/09/nyc-taxi-hackathon-privacy-public-datasets.html

  • Strata + Hadoop World 2015 Singapore – Day 2 Highlights

    …nt opinion they select samples which are not representative. Why to do qualitative research- it tells you why and how behind the what of quantitative datasets, can also reveal opportunities for innovation with existing datasets. Valuable Insights from Selected Sessions: Building and deploying real…

    https://www.kdnuggets.com/2015/12/strata-hadoop-2015-singapore-highlights-day2.html

  • Predicting purchases at retail stores using HPE Vertica and Dataiku DSS

    …for you. Great, isn’t it ? 2. Data preparation The files are disparate, have unnecessary data… With Dataiku DSS, we can easily join, filter, split datasets together using visual recipes, with no need to code. So what do we do now ? We first merge the 3 initial datasets (transactions, products,…

    https://www.kdnuggets.com/2016/06/dataiku-predicting-purchases-retail-stores-hpe-vertica.html

  • Caravel: Airbnb’s data exploration platform

    …ther accelerates analysis cycles by taking delays out of the equation. A thin semantic layer Caravel allows you to manage a thin layer to enrich your datasets’ metadata. This simple layer defines how your dataset is exposed to the user and is composed of: Descriptions, definitions, and verbose…

    https://www.kdnuggets.com/2016/04/caravel-airbnb-data-exploration-platform.html

  • Small Data requires Specialized Deep Learning and Yann LeCun response

    ...ith that, as well as several new operators that are now common in deep learning (such as ReLUs and contrast normalization). It’s only since 2010 with datasets like LabeLMe and ImageNet that computer vision datasets have been large enough to train large convolutional nets on natural images. I do...

    https://www.kdnuggets.com/2015/03/small-data-specialized-deep-learning-yann-lecun.html

  • Data APIs, Hubs, Marketplaces, Platforms, and Search Engines

    …base of information about companies. Enigma, wants to be “Google for public data” and provide easy access to government, NGO, and other public domain datasets. Exversion, search 100K+ datasets, consume them through one simple API or upload your own data to collaborate, publish, or share. Factual…

    https://www.kdnuggets.com/2013/08/data-apis-hubs-marketplaces-platforms-search-engines.html

  • Additions to KDnuggets in July

    ...tivity, literacy, education, housing, urbanisation, fertility, mortality, and more. Data.gov.au provides an easy way to find, access and reuse public datasets from the Australian Government. Australian Bureau of Statistics, access to the full range of ABS statistical and reference information. Open...

    https://www.kdnuggets.com/2013/08/added-to-kdnuggets-in-july.html

  • 2013 Dec: Analytics, Big Data, Data Mining and Data Science News

    ...tive approaches to leveraging advanced statistical and econometric modeling techniques to perform marketing mix modeling research on multiple massive datasets. Analytics Researcher, Adobe Research at Adobe, San Jose, CA - Dec 11, 2013.Research the next generation digital marketing applications and...

    https://www.kdnuggets.com/2013/12/index.html

  • Chordalysis: a new method
    to discover the structure of data

    ...elling. The result is Chordalysis: a log-linear analysis method for high-dimensional data. Chordalysis makes it possible to discover the structure of datasets with hundreds of variables on a standard computer. So far we've applied it successfully to datasets with up to 750 variables. A model...

    https://www.kdnuggets.com/2013/11/chordalysis-new-method-to-discover-structure-data.html

  • Poll Results: Largest Dataset Analyzed/Data Mined

    ...Pandre, Ph.D. - @ Gregory: thanks for doing this poll for many years! I think your poll is showing that "Big Data" is a Science fiction: only 10% of datasets above 10TB and almost 79% of datasets are below 1TB in size... Gregory Piatetsky-Shapiro - Not every data miner voted in this poll, but the...

    https://www.kdnuggets.com/2013/04/poll-results-largest-dataset-analyzed-data-mined.html

  • New Hybrid Rare-Event Sampling Technique for Fraud Detection

    ...represent subsets of Noise observations. We compute subsets of NT as NT1,NT2, ... to contain disjoint samples from NT. We build models from training datasets (ST, SV, SH, NT1), (ST, SV, SH, NT2), (ST, SV, SH, NT3) with corresponding validation and test datasets using noise subsets NT1, NV1, and...

    https://www.kdnuggets.com/2015/04/new-hybrid-rare-event-sampling-technique-fraud-detection.html

  • Cloud Machine Learning Wars: Amazon vs IBM Watson vs Microsoft Azure

    ...ific prediction problems. As Amazon's service does not feature deep learning or machine perception functionality, and can only be trained on supplied datasets (as opposed to more universal datasets like Imagenet, or large text corpora), it's unlikely to compete directly with MetaMind. Zachary Chase...

    https://www.kdnuggets.com/2015/04/cloud-machine-learning-amazon-ibm-watson-microsoft-azure.html

  • Additions to KDnuggets Directory in August

    ...ublish, search, and get updates about data. DataProvider, crawls the web to create a database of information about companies. Exversion, search 100K+ datasets, consume them through one simple API or upload your own data to collaborate, publish, or share. Factual location platform enriches mobile...

    https://www.kdnuggets.com/2013/09/added-to-kdnuggets-august.html

  • Understanding Convolutional Neural Networks for NLP

    …apply it to Sentiment Analysis and Text Categorization tasks. Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples). [17] explores to…

    https://www.kdnuggets.com/2015/11/understanding-convolutional-neural-networks-nlp.html

Refine your search here:

Sign Up