Predict Treatment Outcome

Note: For this assignment, we used CART from Salford Systems, which was available to us under an educational license. If CART is not available, another decision tree tool, such as J4.8 in Weka can be used instead.

Start with genes-leukemia.csv dataset used in assignment 2. (See Dataset directory).

As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?" (missing)

Step 1. Examine the records where TREATMENT_RESPONSE is non-missing.

Q1: How many such records are there?

Q2: Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY , or Gender = X, etc)

Q3: Why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing?

Step 2. Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes-reduced.csv

Q4: Which sample fields you should keep?

Step 3. Build a CART Model using leave-one-out cross validation.

Q5: what tree do you get? and what is the expected error rate?

Q6: what are the important variables and their relative importance, according to CART?

Q7: Remove the top predictor -- and re-run the CART -- what do you get?

Step 4: Extra credit (10%):
Use Google to search the web for the name of top gene that predicts the outcome and briefly report relevant information that you find.

Step 5: Randomization test
Randomize the TREATMENT_RESPONSE variable 10 times and re-run the CART for each randomized class.

Q8: Report the trees and error rates you get.

Q9: Based on the results in Q8, do you think the tree that you found with the original data is significant?