Refine the program to use a combination of ip and user agent as the visit identifier. This will allow to separate hits from the same IP but with different agents into separate visits.
In this dataset there are over 200 such IP addresses, including some that use
both human and bot user agents.
You will apply it to the full one-day log file da-11-16.ipntld.log (extract it from kdlog.zip archive).
A. Modify the visit parsing program to compute, for each visit, based on successful GETs,
A1: What are the total numbers of visits, hits, OK Gets, 404 requests, primary HTML requests, components, and robots.txt requests for d100.log?
A2: For da-11-16.ipntld.log ?
B. Add code to classify each visit as bot or human.
Classify the visit as a bot visit if
B1: How many bot and human visits you found? Count also how many bot visits fell under each rule above.
B2: Extra credit: what additional rules can you use to classify a visit as a bot visit?
C. Save human visits in da-11-16.human.log, and recompute stats from Assignment 3 for the human visits.
Note: First, recompute the Top 20 IP addresses, including user agent, by hits, and if you see there a recognizable bot user agent, then go back and adjust your program.
What interesting observations can you make ?