KDnuggets Home » Web Mining Course » Assignment 5 - Final Project

Web Mining Course: Assignment 5 - Final Project

This assignment builds on previous work and also separates bot visits from human visits. You should build upon the visit parsing program developed for Assignment 4 as the starting point for your work.

Refine the program to use a combination of ip and user agent as the visit identifier. This will allow you to separate hits from the same IP but with different agents into separate visits.
In this dataset there are over 200 such IP addresses, including some that use both human and bot user agents.

You will apply it to the full one-day log file da-11-16.ipntld.log (extract it from kdlog.zip archive).

A. Modify the visit parsing program to compute, for each visit, based on successful GETs,

  • number of requests of primary HTML pages (assume that these are files ending in .html, .htm, and / )
  • number of requests of component files (assume that these are files ending in .css, .gif, .jpg, .js, and .ico )
  • number of requests of robots.txt file
Print also the total numbers for the above.

A1: What are the total numbers of visits, hits, OK Gets, 404 requests, primary HTML requests, components, and robots.txt requests for d100.log?

A2: For da-11-16.ipntld.log ?

B. Add code to classify each visit as bot or human.
Classify the visit as a bot visit if

  • User agent is a recognizable bot
  • or visit included a request for robots.txt file
  • or visit had 0 component requests (this rule is usually true, but there are some exceptions, which you can ignore for this project).
See lecture 3a for discussion on bot user agents. It is enough to include most common bot user agents, which you can determine by analysing the log file.
Be sure to exclude user agents with libwww and Java/ in their strings.

B1: How many bot and human visits you found? Count also how many bot visits fell under each rule above.

B2: Extra credit: what additional rules can you use to classify a visit as a bot visit?

C. Save human visits in da-11-16.human.log, and recompute stats from Assignment 3 for the human visits.

Note: First, recompute the Top 20 IP addresses, including user agent, by hits, and if you see there a recognizable bot user agent, then go back and adjust your program.

What interesting observations can you make ?

KDnuggets Home » Web Mining Course » Assignment 5 - Final Project