Start from the ALL_AML_train_processed data set (zipped file in Data directory). This data was thresholded to >= 20 and <= 16,000, with genes in rows. You should have generated it as part of assignment 3. Name it ALL_AML_gr.thr.train.csv (Note almost all of this data preprocessing is much easier done with data with genes-in-rows format). We will convert the data to genes-in-columns when we are ready to do the modeling).
B. What is the largest fold difference and how many genes have it?
C. What is the lowest fold difference and how many genes have it?
D. Count how many genes have fold ratio in the following ranges
| Range | Count |
| Val <= 2 | .. |
| 2 <Val <= 4 | .. |
| 4 <Val <= 8 | .. |
| 8 <Val <= 16 | .. |
| 16 <Val <= 32 | .. |
| 32 <Val <= 64 | .. |
| 64 <Val <= 128 | .. |
| 128 <Val <= 256 | .. |
| 256 <Val <= 512 | .. |
| 512 <Val | .. |
E: Extra Credit: Graph fold ratio distribution appropriately.
Stdev = sqrt((N*Sum_sq - Sum_val*Sum_val)/(N*(N-1)))
Here N is the number of observations,
Sum_val is the sum of values,
Sum_sq is the sum of squares of values.
Signal to Noise (S2N) ratio is defined as (Avg1 - Avg2) / (Stdev1 + Stdev2)
T-value is defined as (Avg1 - Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev2*Stdev2/N2)
A. Write a script that will compute for each gene, the average and standard deviation for both classes ("ALL" and "AML").
Also compute for each gene T-value and Signal to Noise ratio.
B. Select for each class, top 50 genes with the highest S2N ratio.
C. Select for each class top 50 genes with the highest T-value
D. How many genes are in common between top 50 genes for ALL selected using S2N and those selected using T-value ? How many genes are in common among top 3 genes in each list?
E. Same question for top genes for "AML" .