Assignment 3: Data Cleaning and Preparing for Modeling

The previous assignment was with the selected subset of top 50 genes for a particular Leukemia dataset. In this assignment you will be doing the work of real data miner, and you will be working with an actual genetic dataset, starting from the beginning.

You will see that the process of data mining frequently has many small steps that all need to be done correctly to get good results. However tedious these steps may seem, the goal is a worthy one -- help make an early diagnosis for leukemia -- a common form of cancer. Making a correct diagnosis is literally a life and death decision, and so we need to be careful that we do the analysis correctly.

3A. Get data

Take ALL_AML_original_data.zip file from Data directory and extract from it

Train file: data_set_ALL_AML_train.txt
Test file: data_set_ALL_AML_independent.txt
Sample and class data: table_ALL_AML_samples.txt

This data comes from pioneering work by Todd Golub et al at MIT Whitehead Institute (now MIT Broad Institute).

1. Rename the train file to ALL_AML_grow.train.orig.txt and test file to ALL_AML_grow.test.orig.txt .

Convention: we use the same file root for files of similar type and use different extensions for different versions of these files. Here "orig" stands for original input files and "grow" stands for genes in rows. We will use extension .tmp for temporary files that are typically used for just one step in the process and can be deleted later.

Note: The pioneering analysis of MIT biologists is described in their paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (pdf).

Both train and test datasets are tab-delimited files with 7130 records.
The "train" file should have 78 fields and "test" 70 fields. The first two fields are
Gene Description (a long description like GB DEF = PDGFRalpha protein) and
Gene Accession Number (a short name like X95095_at)

The remaining fields are pairs of a sample number (e.g. 1,2,..38) and an Affymetix "call" (P is gene is present, A if absent, M if marginal).

Think of the training data as a very tall and narrow table with 7130 rows and 78 columns. Note that it is "sideways" from machine learning point of view. That is the attributes (genes) are in rows, and observations (samples) are in columns. This is the standard format for microarray data, but to use with machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to make files with genes in columns and samples in rows. We will do that in step 3B.6 of this assignment.

Here is a small extract

Gene Description	Gene Accession Number	1	call	2	call	...
GB DEF = GABAa receptor alpha-3 subunit	A28102_at	151	A	263	P	...
...	AB000114_at	72	A	21	A	...
...	AB000115_at	281	A	250	P	...
...	AB000220_at	36	A	43	A	...

3B: Clean the data

Perform the following cleaning steps on both the train and test sets. Use unix tools, scripts or other tools for each task.

Document all the steps and create intermediate files for each step. After each step, report the number of fields and records in train and test files. (Hint: Use unix command wc to find the number of records and use awk or gawk to find the number of fields).

Microarray Data Cleaning Steps

Remove the initial records with Gene Description containing "control".
(Those are Affymetrix controls, not human genes). Call the resulting files ALL_AML_grow.train.noaffy.tmp and ALL_AML_grow.test.noaffy.tmp.
Hint: You can use unix command grep to remove the control records.
How many such control records are in each file?
Remove the first field (long description) and the "call" fields, i.e. keep fields numbered 2,3,5,7,9,... Hint: use unix cut command to do that.
Replace all tabs with commas
Change "Gene Accession Number" to "ID" in the first record. (You can use emacs here).
(Note: That will prevent possible problems that some data mining tools have with blanks in field names.)
Normalize the data: for each value, set the minimum field value to 20 and the maximum to 16,000. (Note: The expression values less than 20 or over 16,000 were considered by biologists unreliable for this experiment.)
Write a small Java program or Perl script to do that.
Call the generated files ALL_AML_grow.train.norm.tmp and ALL_AML_grow.test.norm.tmp
Write a short java program or shell script to transpose the training data to get
ALL_AML_gcol.test.tmp and ALL_AML_gcol.train.tmp ("gcol" stands for genes in columns). These files should each have 7071 fields, and 39 records in "train", 35 records in "test" datasets.
Extract from file table_ALL_AML_samples.txt tables
ALL_AML_idclass.train.txt and ALL_AML_idclass.test.txt with sample id and sample labels, space separated.
Here you can use a combination of unix commands and manual editing by emacs.
Add a header row with "ID Class" to each of the files.
File ALL_AML_idclass.train.txt should have 39 records and two columns. First record (header) has "ID Class", next 27 records have class "ALL" and last 11 records have class "AML". Be sure to remove all spaces and tabs from this file.
ALL_AML_idclass.test.txt should have 20 "ALL" samples and 14 "AML" samples, intermixed.
Note that the sample numbers in ALL_AML_gcol*.csv files are in different order than in *idclass files. Use Unix commands to create combined files ALL_AML_gcol_class.train.csv and ALL_AML_gcol_class.test.csv which have ID as the first field, Class as the last field, and gene expression fields in between.

3C: Build Models on a full dataset

As in assignment 2, convert
ALL_AML_gcol_class.train.csv to ALL_AML_allgenes.train.arff
ALL_AML_gcol_class.test.csv to ALL_AML_allgenes.test.arff
Using ALL_AML_allgenes.train.arff as train file and ALL_AML_allgenes.test.arff as test, build a model using OneR. What accuracy do you get?
Now, excluding the field ID, build models using OneR, NaiveBayes Simple, and J4.8, using training set only.
What models and error rates you get with each method?
Warning: some of the methods may not finish or give you errors due to the large number of attributes for this data.
If you got thus far -- congratulations!
Based on your experience, what three things are important in the process of data mining ?