Genetic Data Mining: The Correlation Coefficient (KDnuggets News 08:17, item 24, Publications)

KDnuggets : News : 2008 : n17 : item24

Publications

From: Bruce Ratner
Date: 05 Sep 2008
Subject: Genetic Data Mining: The Correlation Coefficient

Assessing the relationship between a predictor variable and a target variable is an essential task in statistical linear regression model building. If the relationship is straight-line (linear), then no extra work of straightening the relationship is needed: Simply test the predictor variable's statistical importance to stay in the model.

If the relationship is not linear, then one of the two variables is re-expressed (altho, sometimes both variables are re-expressed) to affect the observed relationship such that the "re-expressed" relationship is as linear as the data permit. Then, the re-expressed variable is tested for inclusion into the model. Most methods of assessing relationships among variables are based on the well-known correlation coefficient, which is often misused because its linearity assumption (i.e., the true underlying relationship is straight-line) is not tested by the scatterplot.

The purpose of this article is to illustrate a genetic data mining method -- the GenIQ Model -- that is one of the better "data-straightener" methods available. I use a small dataset to make the "GenIQ data-straightener" method tractable and attractive for the everyday model builder to make it part of the modeler's toolkit. I present a succinct discussion of the genetic-based method, along with a basic statement of the GenIQ Model, and what's "good-to-know" about the GenIQ Model output to ease the understanding of genetic data mining for the correlation coefficient.