KDnuggets Data Mining Community's Top Resource since 1997
for Data Mining and Analytics Software, Jobs, Consulting, Courses, and more
 
advanced search              help


You are here: KDnuggets Home » Data Mining Course » Assignments » Assignment 2

Assignment 2: Preparing the data and mining it

  • A. Take the file genes-leukemia.csv (here is the description of the data) and convert it to Weka file genes-a.arff.
    You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way).
  • B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option.
  • C. Use genes-leukemia.arff to create two subsets:
    genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data
    genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).
  • D. Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option.
    What decision tree do you get? What is its accuracy?
  • E. Now specify genes-leukemia-test.arff as the test set.
    What decision tree do you get and how does its accuracy compare to one in the previous question?
  • F. Now remove the field "Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu)
    and repeat steps D and E.
    What do you observe? Does the accuracy on test set improve and if so, why do you think it does?

  • G. Extra credit: which classifier gives the highest accuracy on the test set?
Current KDnuggets News

SUBSCRIBE
Subscribe to KDnuggets News (free), the leading data mining & analytics newsletter, published twice a month.

Get top stories on data mining software, jobs, courses, datasets, meetings, webcasts, and more.

Your email will be used only for sending you KDnuggets News.



You are here: KDnuggets Home » Data Mining Course » Assignments » Assignment 2

Copyright © 2009 KDnuggets.  | SUBSCRIBE to KDnuggets News (free)  | About KDnuggets | Contact us