KDnuggets : Data Mining Course : Assignments : Final Project

Data Mining Course Final Project

Predict disease classes using genetic microarray data


Gene data is in genes-in-rows format, comma-separated values.
Take final_project_data.zip file from Data directory, and unzip to extract 3 files:


Training data: file pp5i_train.gr.csv, with with 7070 genes (no Affy controls) for 69 samples. A separate file pp5i_train_class.txt has classes for each sample, in the order corresponding to the order of samples in pp5i_train.gr.csv. There are 5 classes, labelled EPD, JPA, MED, MGL, RHB.

Test data: file pp5i_test.gr.csv, with 23 unlabelled samples and same genes. You can assume that the class distribution is similar.

Your goal is to learn the best model from the training data and use it to predict the label (class) for each sample in test data. You will also need to write a paper describing your effort.

Randomization experiments showed that one can get about 10-12 (from 23) correct answers with random guessing.

The final grade will be a combination of effort (40%), presentation (30%), and the accuracy of prediction, measured as 3*(Number_correct_answers - 11). The maximum grade is 106.

Below are suggested steps for doing this experiment, but you can vary and improve on the suggested approach, as long as you produce a prediction for the test set and describe your results.

Important Hints

Be sure that you don't use the sample number as one of the predictors. Training data is ordered by class, so sample number will appear to be a good predictor on cross-validation, but it will not work on the test data!

One of the MED samples in the training data is very likely misclassified (by a human). So the best result you can expect to get on cross validation is one error (on a MED sample) out of 69. However, this should not affect your accuracy on the test set (all labels there are correct).

You can make all the runs from Weka GUI interface, but if you can learn a Unix shell, you can run these repeated experiments much easier from the shell. (Caution: weka cross-validation uses a random number seed which is different in GUI and in shell, so cross-validation results may be slightly different if you call Weka from shell than if you use Weka Explorer).

You can complete the project using only simple steps, but the more advanced steps will give you extra credit and probably higher accuracy.

The following steps suggest one way of finding the best model -- you are welcome to make improvements, where you think appropriate.

See also Questions and Answers at the end.

Step 1. Data Cleaning

Threshold both train and test data to a minimum value of 20, maximum of 16,000.

Step 2. Selecting top genes by class

Step 3. Find the best classifier/best gene set combination

Use the following Weka classifiers:

a. For each classifier, using default settings, measure classifier accuracy on the training set using previously generated files with top N=2,4,6,8,10,12,15,20,25,30 genes.
For IBk, test accuracy with K=2, 3 and 4.

b. Select the model and the gene set with the lowest cross-validation error.
Optional: once you found the gene set with the lowest cross-validation error, you can vary 1-2 additional relevant parameters for each classifier to see if the accuracy will improve. E.g. for J4.8, you can vary reducedErrorPruning and binarySplits

c. Use the gene names from best train gene set and extract the data corresponding to these genes from the test set.
d. Convert test set to genes-in-columns format.
e. Add a Class column with "?" values as the last column

Step 4. Generate predictions for the test set

You should now have the best train file, call it pp5i_train.bestN.csv, (with 69 samples and bestN number of genes for whatever bestN you found) and a corresponding test file, call it pp5i_test.bestN.csv, with the same genes and 23 test samples. The train file will have all Class values while the test file Class column will have only "?"

a. Convert test file to arff format (you should already have .arff for train file from Step 3).

Important: In Weka, the variable declarations should be exactly the same for test and train file.
To achieve that, change the Class entry in pp5i_test.bestN.arff header section to be the same as in train file, i.e.

@attribute Class {MED,MGL,RHB,EPD,JPA}

b. Use the best train file and the matching test file and generate predictions for the test file class.

If you are using GUI, then

If you are using a shell, then you can generate predictions using, e.g.

java weka.classifiers.Classifier -t train_data.arff -T test.data.arff -p 0

Step 5. Write a paper describing your effort.

Document each step.

For each classifier used, give a paragraph describing this classifier.
Give a graph showing error rate versus number of genes.

Describe which classifier and which number of genes you have selected

Comment on the relative strengths and weaknesses of the classifiers you used for this type of data.

Questions and Answers

KDnuggets : Data Mining Course : Assignments : Final Project