RequestsFrom: Gregory Piatetsky-Shapiro gpsDate: Thu, 25 Jan 2001 10:22:51 +0100 Subject: How to deal with the case when one class is rare? Two readers asked similar questions about building models when one class is rare. Please reply to editor and I will summarize the latest collective wisdom. I usually had good success with building a balanced sample first and building models on that. Gregory PS. -- daniel.perruchoud@ubs.com writes: given: a sample of 100'000 targets "0" and 1'000 targets "1" wanted: a reasonable oversampling strategy what according to you is the better solution for modeling: (A) take only 1'000 targets "0" and all the 1'000 targets "1" or (B) sample 25'000 targets "0" and 25'000 targets "1" by replication or (C) some other strategy with a smaller enrichment/boosting factor???? -- A similar question is asked by uwe steinlein steinlein@gmx.de: I'm a phd student from germany and I want to write about the influence of samples on the results or the variables that come up i.e. on a decision tree or a regression. when i have a dataset with 200.000 non-responders and just 500 responders i take a sample of 5.000 non-responders and all of the responders and run my models. when i do that with different samples i can get different results (differnt influencing variables, different gain charts...) and i want to look, if there is a way to increase the quality of the results. do you know, if there exist papers that deal with that kind of problem ?? maybe i there exist some in the area of medical diagnostic (they have the same problems with cancer - 2 have it in 1000), but i didn't find anything. thank you very much in advance, greetings
|
Copyright © 2001 KDnuggets. Subscribe to KDnuggets News!