- A. Take the file genes-leukemia.csv
(here is the description of the data) and convert it to Weka file genes-a.arff.
You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way).
- B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option.
- C. Use genes-leukemia.arff to create two subsets:
genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data
genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).
- D. Train J48 on genes-leukemia-train.arff and specify
"Use training set" as the test option.
What decision tree do you get? What is its accuracy?
- E. Now specify genes-leukemia-test.arff as the test set.
What decision tree do you get and how does its accuracy compare to one in the previous question?
- F. Now remove the field "Source" from the classifier
(unclick checkmark next to Source, and click on Apply Filter in the top menu)
and repeat steps D and E.
What do you observe? Does the accuracy on test set improve and if so, why do you think it does?
- G. Extra credit: which classifier gives the highest accuracy on the test set?
Assignment 2: Preparing the data and mining it