- The List of Top 10 Lists in Data Science - Aug 14, 2020.
The list of Top 10 lists that Data Scientists -- from enthusiasts to those who want to jump start a career -- must know to smoothly navigate a path through this field.
- New Poll: What was the largest dataset you analyzed / data mined? - Jun 9, 2020.
Take part in KDnuggets latest survey to have your voice heard, and let the community know what the largest dataset size you have worked with is.
- Dataset Splitting Best Practices in Python - May 26, 2020.
If you are splitting your dataset into training and testing data you need to keep some things in mind. This discussion of 3 best practices to keep in mind when doing so includes demonstration of how to implement these particular considerations in Python.
- Data context and how to get started with understanding COVID-19 data - Apr 22, 2020.
If you are already applying your Data Science skills or getting ready to contribute to analyzing COVID-19 data, then be sure to take sufficient time to appreciate the context of the numbers to focus on what's most important as we collaborate on this global battle.
- 3 Best Sites to Find Datasets for your Data Science Projects - Apr 9, 2020.
When first learning data science, you will inevitably find yourself looking for more datasets to practice with. Here, we recommend the 3 best sites to find datasets to spark your next data science project.
- 10 Must-read Machine Learning Articles (March 2020) - Apr 9, 2020.
This list will feature some of the recent work and discoveries happening in machine learning, as well as guides and resources for both beginner and intermediate data scientists.
- 20+ Machine Learning Datasets & Project Ideas - Mar 9, 2020.
Upgrading your machine learning, AI, and Data Science skills requires practice. To practice, you need to develop models with a large amount of data. Finding good datasets to work with can be challenging, so this article discusses more than 20 great datasets along with machine learning project ideas for you to tackle today.
- The Big Bad NLP Database: Access Nearly 300 Datasets - Feb 28, 2020.
Check out this database of nearly 300 freely-accessible NLP datasets, curated from around the internet.
- Passive Data Collection and Actionable Results: What to Know - Feb 21, 2020.
There are plenty of ways to get actionable results by using passive data. However, such an outcome will not happen without careful forethought. Data analysts must consider several crucial specifics, including what questions they want and expect the information to answer, and how they'll apply the findings to aid the business.
- Google Dataset Search Provides Access to 25 Million Datasets - Jan 29, 2020.
Google's dataset search is out of beta, and provides centralized access to 25 million datasets.
- The 5 Most Useful Techniques to Handle Imbalanced Datasets - Jan 22, 2020.
This post is about explaining the various techniques you can use to handle imbalanced datasets.
- What is Data Catalog and Why You Should Care? - Dec 23, 2019.
Learn why data catalogs could be just the thing you need to meet the challenges of data and metadata management and collaboration.
- Data Sources 101 - Oct 28, 2019.
Data collection is one of the first steps of the data lifecycle — you need to get all the data you require in the first place. To collect the right data, you need to know where to find it and determine the effort involved in collecting it. This article answers the most basic question: where does all the data you need (or might need) come from?
- Know Your Data: Part 2 - Oct 8, 2019.
To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.
- Training a Machine Learning Engineer - Oct 3, 2019.
There is no clear outline on how to study Machine Learning/Deep Learning due to which many individuals apply all the possible algorithms that they have heard of and hope that one of implemented algorithms work for their problem in hand. Below, I've listed out some of the steps that one should adopt while solving a machine learning problem.
- Know Your Data: Part 1 - Sep 30, 2019.
This article will introduce the different type of data sets, data object and attributes.
- Version Control for Data Science: Tracking Machine Learning Models and Datasets - Sep 13, 2019.
I am a Git god, why do I need another version control system for Machine Learning Projects?
- 5 Ways to Deal with the Lack of Data in Machine Learning - Jun 10, 2019.
Effective solutions exist when you don't have enough data for your models. While there is no perfect approach, five proven ways will get your model to production.
- How to Automate Tasks on GitHub With Machine Learning for Fun and Profit - May 3, 2019.
Check this tutorial on how to build a GitHub App that predicts and applies issue labels using Tensorflow and public datasets.
- Synthetic Data Generation: A must-have skill for new data scientists - Dec 27, 2018.
A brief rundown of methods/packages/ideas to generate synthetic data for self-driven data science projects and deep diving into machine learning methods.
Pages: 1 2
- Handling Imbalanced Datasets in Deep Learning - Dec 4, 2018.
It’s important to understand why we should do it so that we can be sure it’s a valuable investment. Class balancing techniques are only really necessary when we actually care about the minority classes.
- Machine Learning Classification: A Dataset-based Pictorial - Nov 5, 2018.
In order to relate machine learning classification to the practical, let's see how this concept plays out, step by step (and with images), specifically in direct relation to a dataset.
- New Poll: What was the largest dataset you analyzed / data mined? - Oct 12, 2018.
New KDnuggets Poll is asking: What was the largest dataset you analyzed / data mined? Please vote and we will analyze the trends and publish the results.
- Semantic Interoperability: Are you training your AI by mixing data sources that look the same but aren’t? - Oct 9, 2018.
Semantic interoperability is a challenge in AI systems, especially since data has become increasingly more complex. The other issue is that semantic interoperability may be compromised when people use the same system differently.
- Introducing VisualData: A Search Engine for Computer Vision Datasets - Sep 26, 2018.
Instead of building your own dataset, there already exists a rich collection of computer vision datasets contributed by academic researchers, hobbyists and companies.
- Announcing Microsoft Research Open Data, a cloud hosted platform for sharing datasets - Jun 28, 2018.
Microsoft announces Microsoft Research Open Data, datasets representing many years of data curation and research efforts by Microsoft that were published as research outcomes.
- How (dis)similar are my train and test data? - Jun 7, 2018.
This articles examines a scenario where your machine learning model can fail.
- Human Involvement Helps Researchers Perfect New Algorithms to Train Robots - Mar 22, 2018.
Many underestimate the role of humans in successful deployment of AI solutions. Alegion engine produces AI training data and enables content moderation, sentiment analysis, data enrichment, tagging, categorization, and more.
- Training Sets, Test Sets, and 10-fold Cross-validation - Jan 9, 2018.
More generally, in evaluating any data mining algorithm, if our test set is a subset of our training data the results will be optimistic and often overly optimistic. So that doesn’t seem like a great idea.
- 70 Amazing Free Data Sources You Should Know - Dec 20, 2017.
70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects.
- How (and Why) to Create a Good Validation Set - Nov 24, 2017.
The definitions of training, validation, and test sets can be fairly nuanced, and the terms are sometimes inconsistently used. In the deep learning community, “test-time inference” is often used to refer to evaluating on data in production, which is not the technical definition of a test set.
- Building a Wikipedia Text Corpus for Natural Language Processing - Nov 23, 2017.
Wikipedia is a rich source of well-organized textual data, and a vast collection of knowledge. What we will do here is build a corpus from the set of English Wikipedia articles, which is freely and conveniently available online.
- 5 Machine Learning Projects You Can No Longer Overlook – Episode VI - Sep 20, 2017.
Deep learning, data preparation, data visualization, oh my! Check out the latest installation of '5 Machine Learning Projects You Can No Longer Overlook' for insight on... well, what machine learning projects you can no longer overlook.
- The new Enigma Public – the platform connecting people to data - Sep 11, 2017.
Public data has tremendous potential and different people can use it to solve variety of problems. Enigma relaunches Enigma Public — the platform connecting people to data.
- Interesting Things Learned as a Student of Machine Learning - Jun 29, 2017.
Did you ever learn something you didn't really want to? The path to machine learning mastery is paved with such collateral knowledge. Here are a few examples of such information I have gleaned while trekking away.
- Data for Democracy: The First Two Months of D4D - Feb 20, 2017.
Let’s hear about how Data Science is used for democracy and well being of human societies by Data for Democracy organisation.
- More Data or Better Algorithms: The Sweet Spot - Jan 17, 2017.
We examine the sweet spot for data-driven Machine Learning companies, where is not too easy and not too hard to collect the needed data.
- Data Sources for Cool Data Science Projects - Dec 20, 2016.
One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project.
- Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists - Nov 8, 2016.
The majority (57%) of respondents only worked with Gigabyte range data. More junior Data Scientists enter the market, but Petabyte Big Data Scientists still stand apart.
- What is Academic Torrents and Where is Data Sharing Going? - Oct 26, 2016.
Learn more about Academic Torrents, a platform for researchers to share data consisting of a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast.
- New Poll: What was the largest dataset you analyzed / data mined? - Oct 22, 2016.
New KDnuggets Poll is asking: What was the largest dataset you analyzed / data mined? Please vote
- Data Science Basics: 3 Insights for Beginners - Sep 22, 2016.
For data science beginners, 3 elementary issues are given overview treatment: supervised vs. unsupervised learning, decision tree pruning, and training vs. testing datasets.
- 10 Data Acquisition Strategies for Startups - Jun 14, 2016.
An interesting discussion of the myriad methods in which startups may choose to acquire data, often the most overlooked and important aspect of a startup's success (or failure).
Pages: 1 2
- Top KDnuggets tweets, May 25-31: 19 Free eBooks to learn #programming with #Python; Awesome collection of public datasets on Github - Jun 1, 2016.
Introducing Hybrid lda2vec Algorithm via Stitch Fix; #DeepLearning and Deep #Gaussian Processes - explainer; Awesome collection of public #datasets on Github; #DataScience foundations: 19 Free eBooks to learn #programming with #Python.
- Top 10 Open Dataset Resources on Github - May 31, 2016.
The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.
- Datasets Over Algorithms - May 3, 2016.
The average elapsed time between key algorithm proposals and corresponding advances is about 18 years; the average elapsed time between key dataset availabilities and corresponding advances is less than 3 years, 6 times faster.
- CrowdSignals.io, Building Big Mobile Social Sensor dataset - Mar 25, 2016.
CrowdSignals.io a crowdfunding campaign to generate the largest mobile and sensor dataset available to the Data Science community for use in research and product development.
- Interconnecting World Open Data Portals, Mar 8 Webinar - Feb 24, 2016.
Join OpenDataSoft for a web conference to contribute to building the next evolution of the List of 1600 Open Data portals worldwide, dubbed Open Data Inception by its creators.
- 9 Must-Have Datasets for Investigating Recommender Systems - Feb 11, 2016.
Gain some insight into a variety of useful datasets for recommender systems, including data descriptions, appropriate uses, and some practical comparison.
- Tour of Real-World Machine Learning Problems - Dec 26, 2015.
The tour lists 20 interesting real-world machine learning problems for data science enthusiasts to learn by solving.
- Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range - Aug 18, 2015.
A majority of data scientists (56%) work in Gigabyte dataset range. We note a small increase in Petabyte (web-scale) data miners, and a decline in Megabyte data miners. US, Australia/NZ, and Asia lead in percentage of Terabyte and Petabyte analysts.
- Interview: Andrew Duguay, Prevedere on Economic Intelligence from Integrating Public Datasets - Jul 30, 2015.
We discuss Analytics at Prevedere Software, understanding the impact of external factors on a company’s performance, features of in-memory correlation engine and economic intelligence by Prevedere.
- Additions to KDnuggets Directory in April - May 3, 2015.
20+ new meetings, including Smartcon (Istabul), Collab. Data Science, Boston Data Festival, SIGMOD 2016, ICDM 2016; Awesome public datasets; DecisionIQ, VisualText and more.
- KDnuggets™ News 15:n11, Apr 15: Big Data Predictive Analytics Gainers & Losers; Awesome Public Datasets - Apr 15, 2015.
Awesome Public Datasets on GitHub; Gold Mine or Blind Alley? Functional Programming for Machine Learning; Inside Deep Learning - Convolutional networks; KDnuggets Free Pass to Strata Hadoop World London.
- Top /r/MachineLearning Posts, Mar 29-Apr 4: Andrew Ng AMA, Deep Learning for NLP, and OpenCL Convnets - Apr 10, 2015.
Andrew Ng's upcoming AMA, scikit-learn updates, Richard Socher's Deep Learning NLP videos, Criteo's huge new dataset, and convolutional neural networks on OpenCL are the top topics discussed this week on /r/MachineLearning.
- Awesome Public Datasets on GitHub - Apr 6, 2015.
A long, categorized list of large datasets (available for public use) to try your analytics skills on. Which one would you pick?
Pages: 1 2
- Interview: Anthony Bak, Ayasdi on Novel Insights using Topological Summaries - Jan 29, 2015.
We discuss examples of Topological Data Analysis (TDA) revealing new insights, recommended approach for creating Topological Summaries, Manual vs Automation approach and trends.
- Top /r/MachineLearning posts, Jan 11-17 - Jan 18, 2015.
SVMs, open source datasets, Bayesian decision theory, game AI, and deep learning visualizations are all featured in the past week's top /r/MachineLearning posts.
- SBP15 Grand Data Challenge - Dec 5, 2014.
Use social media analytics on public data to help analyze and explore social inequality and aid the disadvantaged in SBP15 Grand Data Challenge. Submissions due Jan 20.
- Free Urban Data – What’s It Good For? - Nov 1, 2014.
See how the increasing availability of free urban datasets that has come with more cities participating in free data programs can be applied to solve interesting problems in this Big Data article.
- TweetNLP: Twitter Natural Language Processing - Oct 24, 2014.
A short overview of Natural Language Processing tools and utilities developed by Prof. Noah Smith, CMU and his team to analyze Twitter data.
- Top KDnuggets tweets, Oct 17-19: Air traffic analyzed to predict Ebola spread; Cool public data for data science - Oct 20, 2014.
Air traffic data analyzed to predict Ebola spread; Some cool public data sources you can use for your next data science project; Data science can't be point and click ! Finding random correlation is too easy; Bayes Rule in an animated gif.
- Interactive Network and Graph Data Repository - Oct 17, 2014.
The network repository currently hosts over 500+ graphs/networks that span 19 collections of graphs from social science, machine learning, scientific computing, and many others.
- MOOC: “Process Mining: Data science in Action” - Sep 10, 2014.
This 6 week online course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.
- Top KDnuggets tweets, Aug 13-14: Boyfriend as a statistically “significant” other - Aug 15, 2014.
xkcd: Boyfriend as a statistically "significant" other; Interesting Social Media Datasets; Sibyl: a System for Large Scale Machine Learning at Google; We don't need such hype: "Big Data scientists get 100 recruiter emails a day".
- Interesting Social Media Datasets - Aug 13, 2014.
Learn about some of the many interesting social media datasets available to you, some of which are quite new, and the different features and challenges they offer you for your next big data science project.
- Top KDnuggets tweets, May 30 – Jun 1: Guide to Setting Up an R-Hadoop ; 100+ Interesting Data Sets - Jun 2, 2014.
Tutorial: Step-by-Step Guide to Setting Up an R - #Hadoop System; 100+ Interesting Data Sets for Statistics (and Data Science); #BigData sets available for free - big list from Data Science Central ; Twitter to release all tweets to scientists - a research boon and an ethical dilemma.
- US Open Data Action Plan and Datasets - May 31, 2014.
We summarize the key findings in the recently released US Open Data Action Plan, highlighting the principles, commitments, datasets released and future outlook.
- Top KDnuggets tweets, Mar 21-23: Machine Learning in Parallel with SVM; Good Data Sets for Data Science Practice - Mar 24, 2014.
Machine Learning in Parallel with SVM, GLM; Good Data Sets for Data Science Practice: Big enough, requires data engineering, rich; Cartoon: Why Madame Zaza, Fortune Teller, changes to Predictive Analytics; Top 45 #BigData Tools and Platforms for Developers