Assignment 2: Preparing the data and mining it

  • A. Take the file genes-leukemia.csv (here is the description of the data) and convert it to Weka file genes-a.arff.
    You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way).
  • B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option.
  • C. Use genes-leukemia.arff to create two subsets:
    genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data
    genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).
  • D. Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option.
    What decision tree do you get? What is its accuracy?
  • E. Now specify genes-leukemia-test.arff as the test set.
    What decision tree do you get and how does its accuracy compare to one in the previous question?
  • F. Now remove the field "Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu)
    and repeat steps D and E.
    What do you observe? Does the accuracy on test set improve and if so, why do you think it does?
  • G. Extra credit: which classifier gives the highest accuracy on the test set?