KDnuggets : Web Mining Course : Assignment 5 - Final Project

Web Mining Course: Assignment 5 - Final Project

This assignment builds on previous work and also separates bot visits from human visits. You should build upon the visit parsing program developed for Assignment 4 as the starting point for your work.

Refine the program to use a combination of ip and user agent as the visit identifier. This will allow to separate hits from the same IP but with different agents into separate visits.
In this dataset there are over 200 such IP addresses, including some that use both human and bot user agents.

You will apply it to the full one-day log file da-11-16.ipntld.log (extract it from kdlog.zip archive).

A. Modify the visit parsing program to compute, for each visit, based on successful GETs,

Print also the total numbers for the above.

A1: What are the total numbers of visits, hits, OK Gets, 404 requests, primary HTML requests, components, and robots.txt requests for d100.log?

A2: For da-11-16.ipntld.log ?

B. Add code to classify each visit as bot or human.
Classify the visit as a bot visit if

See lecture 3a for discussion on bot user agents. It is enough to include most common bot user agents, which you can determine by analysing the log file.
Be sure to exclude user agents with libwww and Java/ in their strings.

B1: How many bot and human visits you found? Count also how many bot visits fell under each rule above.

B2: Extra credit: what additional rules can you use to classify a visit as a bot visit?

C. Save human visits in da-11-16.human.log, and recompute stats from Assignment 3 for the human visits.

Note: First, recompute the Top 20 IP addresses, including user agent, by hits, and if you see there a recognizable bot user agent, then go back and adjust your program.

What interesting observations can you make ?


KDnuggets : Web Mining Course : Assignment 5 - Final Project