Web Mining Course: Assignment 4 - Visit level analysis

Download web_log_parse.txt, change the file extension to .pl and get it to work.

Modify it to separate the log file into visits. Assume initially that all hits from the same IP on the same day belong to the same visit. For extra credit add a parameter that would equal the largest interval between primary page visits.

Apply to d100.log file, and compute for each visit

  1. Total number of hits
  2. number of successful (code 200 or 304) GETs
  3. number of requests with 404 (not found) status code
  4. visit start (as HHMMSS)
  5. visit length (in seconds)
  6. visit agent (assume that the user agent is the same and take it from the first request).

Write the visit information to a tab separated file d100.log.visits

For verification, also print to the screen

  • the total counts of hits,
  • successful GETs,
  • 404 requests,

Verify the total counts obtained from the perl program using Unix tools.

What interesting observations can you make ?

