Start with genes-leukemia.csv dataset used in assignment 2. (See Dataset directory).
As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?" (missing)
Step 1. Examine the records where TREATMENT_RESPONSE is non-missing.
Q1: How many such records are there?
Q2: Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY , or Gender = X, etc)
Q3: Why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing?
Step 2. Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes-reduced.csv
Q4: Which sample fields you should keep?
Step 3. Build a CART Model using leave-one-out cross validation.
Q5: what tree do you get? and what is the expected error rate?
Q6: what are the important variables and their relative importance, according to CART?
Q7: Remove the top predictor -- and re-run the CART -- what do you get?
Step 4: Extra credit (10%):
Use Google to search the web for the name of top gene that predicts the outcome and briefly report relevant information that you find.
Step 5: Randomization test
Randomize the TREATMENT_RESPONSE variable 10 times and re-run the CART for each randomized class.
Q8: Report the trees and error rates you get.
Q9: Based on the results in Q8, do you think the tree that you found with the original data is significant?