Assignment 2: Preparing the data and mining it

A. Take the file genes-leukemia.csv (here is the description of the data) and convert it to Weka file genes-a.arff.
You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way).
B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option.
C. Use genes-leukemia.arff to create two subsets:
genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data
genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).
D. Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option.
What decision tree do you get? What is its accuracy?
E. Now specify genes-leukemia-test.arff as the test set.
What decision tree do you get and how does its accuracy compare to one in the previous question?
F. Now remove the field "Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu)
and repeat steps D and E.
What do you observe? Does the accuracy on test set improve and if so, why do you think it does?
G. Extra credit: which classifier gives the highest accuracy on the test set?