KDnuggets News 01:03, item 35, Requests

KDnuggets : News : 2001 : n03 : item35 (previous | next)

Requests

From: Gregory Piatetsky-Shapiro gps
Date: Thu, 25 Jan 2001 10:22:51 +0100
Subject: How to deal with the case when one class is rare?

Two readers asked similar questions about building models when one class is rare. Please reply to editor and I will summarize the latest collective wisdom. I usually had good success with building a balanced sample first and building models on that. Gregory PS. --

daniel.perruchoud@ubs.com writes:

given: a sample of 100'000 targets "0" and 1'000 targets "1" wanted: a reasonable oversampling strategy

what according to you is the better solution for modeling:

(A) take only 1'000 targets "0" and all the 1'000 targets "1"

(B) sample 25'000 targets "0" and 25'000 targets "1" by replication

-- A similar question is asked by uwe steinlein steinlein@gmx.de:

I'm a phd student from germany and I want to write about the influence of samples on the results or the variables that come up i.e. on a decision tree or a regression. when i have a dataset with 200.000 non-responders and just 500 responders i take a sample of 5.000 non-responders and all of the responders and run my models. when i do that with different samples i can get different results (differnt influencing variables, different gain charts...) and i want to look, if there is a way to increase the quality of the results. do you know, if there exist papers that deal with that kind of problem ?? maybe i there exist some in the area of medical diagnostic (they have the same problems with cancer - 2 have it in 1000), but i didn't find anything. thank you very much in advance, greetings

KDnuggets : News : 2001 : n03 : item35 (previous | next)