KDnuggets Home » Polls » More Data or Better Algorithm? (Apr 2008)

What will usually give better improvement in data mining results:


 
  
Poll
What will usually give better improvement in data mining results: [122 votes total]

Adding more data (55) 45%
Using more advanced algorithms (24) 20%
It depends (please comment) (43) 35%


Comments

Ed Freeman, More Data or Better Algorithms?
What really helps:
1) Having a good problem to work on
2) Having a good approach to that problem
3) Having decent data. Quality is much more important than quantity
4) Handling the data well -- good transforms, good missing value handling, making sure that all approaches make sense for the problem.

Dean Abbott, More Data or More Advanced Algorithms
The question is a good one, and generally speaking, I find that more data, or better representation of existing data, gives better performance than more advanced algorithms. Usually, I find that just more data is not enough, but better features (particularly multi-variate features) can provide significant model improvement.

However, just as important is the expertise of the modeler using the algorithm. For example, a top-notch modeler can wring more information out of less data than a novice modeler can with lots of data.

Greg Safarz, more data
I have consistently gotten better results by adding more data to a problem rather than try new methods. I do not mean more data in terms of added observations, but more columns of data. More attributes and features wins hands down. You can have the best algorithms ever invented, but if you do not have the 'right' information to put through those algorithms you have nothing.

Janaki Gopalan, More data or better algorithm
In my experience with Mining data - whether we need more data or better algorithm is definitely problem dependent.
Ideally more data is always a best bet because we get more data to 'train', more data to 'test' and more data to 'validate'. Increasing the sample space in all areas of data pre-processing will make a better model. Nevertheless, I worked on mining real time breast cancer data and my problem was I had limited data. But, the limited data had enough potential to qualify for mining and the challenge was to find better algorithms or combinations of techniques to mine the data and get better results.
A balance of both (reasonable sample space of data with a good algorithm) will probably work out well.

Peter, Data or Algorithms
In my experience, more salient features created out of higher quality data nearly always trump adding more data or trying different algorithms while holding fixed the features and data quality.

Jozo Kovac, What is important to YOU?
You can have enought data, proven algorithm and still have questionable results.
But what are "results"? Model accuracy, model benefits in real world, new extracted knowledge(rules) about your customers?
Always keep your real world goals in mind and decide after considering all aspects. Gathering more data/owning new algorithm can be much more time consuming than good analysis and clever solution.
Sometimes you create more models, sometimes you change cutoffs, redefine target, divide task into subtasks, combine scoring with other criteria, change processes and actions related to scoring, ... There are always ways to be successful.

Patrick Herron, It depends upon "improvement"
More data or better algorithms? Generally the answer is yes to both. However what constitutes "improvement" is entirely dependent on the context, the project requirements. Do you wish to increase speed? The answer I would guess is better algorithms. Lower costs? Better algorithms. Improve accuracy, where cost and speed are of little issue? More data, please. Even this answer is too too general; there are many exceptions to what I am saying.

Alexandru Floares, More data or better algorithms
If the data set is small, e.g., the number of cases is less than 10 x # features, and the quality is reasonable, adding data can improve the accuracy. If the data quality is low, adding data can improve the accuracy, by increasing the number of informative cases, which remain in the data set after pre-processing or cleaning the initial data.
On the algorithm side, balancing unbalanced data (e.g. two classes: Class A 10% and Class B 90%) can improve the accuracy and ensemble methods (boosting, bagging, etc.) can improve the accuracy of the results.

Louise Francis, Data Quality
Given the current grim situaltion in my industry (insurance) and many other industries with respect to data quality, I believe effort expended in creating databases with high quality data wild yield a greater improvement than applying more sophisticated techniques to poor data. Part of creating better databases could involve including more data from more sources in data mining databases. However, significant effort is needed to make data from all sources accurate, valid, complete and timely.

Laurence Moseley, More data or better algorithms
We need both. However, the 'more data' part would be valuable only if the quality of the data is improved. Given that many of the interesting and valuable findings occur at the extreme ends of the distribution, poor quality data can have a substantial deleterious effect on what we can reasonably conclude. If, say, the demographic characteristics of someone who is in a disease category in which the prevalence is at the 1 in 1,000 level, the accuracy with which those characteristics have been recorded becomes a major consideration

Raj Nagappan, more data vs better algorithm
I find that a better algorithm may improve results by 2-5% with considerable effort. But improving the data that goes in can easily bump up several algorithms at the same time by up to 10%.

KDnuggets Home » Polls » More Data or Better Algorithm? (Apr 2008)