Assignment 4: Feature Reduction

Start from the ALL_AML_train_processed data set (zipped file in Data directory). This data was thresholded to >= 20 and <= 16,000, with genes in rows. You should have generated it as part of assignment 3. Name it ALL_AML_gr.thr.train.csv (Note almost all of this data preprocessing is much easier done with data with genes-in-rows format). We will convert the data to genes-in-columns when we are ready to do the modeling).

1. Examining gene variation

A. Write a program (or a script) to compute a fold difference for each gene. Fold difference is the Maximum Value across samples divided by minimum value. This value is frequently used by biologists to assess gene variability.

B. What is the largest fold difference and how many genes have it?

C. What is the lowest fold difference and how many genes have it?

D. Count how many genes have fold ratio in the following ranges

Range	Count
Val <= 2	..
2 <Val <= 4	..
4 <Val <= 8	..
8 <Val <= 16	..
16 <Val <= 32	..
32 <Val <= 64	..
64 <Val <= 128	..
128 <Val <= 256	..
256 <Val <= 512	..
512 <Val	..

E: Extra Credit: Graph fold ratio distribution appropriately.

2. Finding most significant genes

For train set, samples 1-27 belong to class ALL, and 28-38 to class AML.
Let Avg1, Avg2 be the average expression values.
Let Stdev1, Stdev2 be the sample standard deviations, which can be computed as

Stdev = sqrt((N*Sum_sq - Sum_val*Sum_val)/(N*(N-1)))

Here N is the number of observations,
Sum_val is the sum of values,
Sum_sq is the sum of squares of values.

Signal to Noise (S2N) ratio is defined as (Avg1 - Avg2) / (Stdev1 + Stdev2)

T-value is defined as (Avg1 - Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev2*Stdev2/N2)

A. Write a script that will compute for each gene, the average and standard deviation for both classes ("ALL" and "AML").
Also compute for each gene T-value and Signal to Noise ratio.

B. Select for each class, top 50 genes with the highest S2N ratio.

Which gene has the highest S2N for "ALL" (i.e. its high value is most correlated with class "ALL")? 50th highest ? Give gene names and S2N values
Same question for "AML" class.
What is the relationship between S2N values for ALL and AML ?

C. Select for each class top 50 genes with the highest T-value

Which gene has the highest T-value for class "ALL"? 50th highest ? Give gene names and T values
For AML ?
What is the relationship between T values for ALL and AML ? Will a similar relationship hold if there are more than 2 classes?

D. How many genes are in common between top 50 genes for ALL selected using S2N and those selected using T-value ? How many genes are in common among top 3 genes in each list?

E. Same question for top genes for "AML" .

3. Lessons Learned

What have you learned so far about the process of feature selection and data preparation?