(wadm IV) Log Analyzer

In the earlier section, we looked at how to create a parser by defining a little language that understood what
we were trying to parse, here we will try an create a little language that we can use to look at the data in a
more comprehensive way.

One of the more interesting things that the Web Server provides is the access logs that tells us who are visiting
our site, how many pages they visited, who are all referring to us, and how many (if any) pages have broken
links (ie how many 404 we have.)

Interesting info also includes the type of UserAgents, the pages that seem to get hit the largest, and our entry
points, (The pages that a unique ip will have accessed first,) our exit points (The page which the unique ips
accessed last) etc.

The commands that wadm provides for logs:

> info commands \*log\*  
get-log enable-access-log set-log-prop get-access-log get-access-log-prop get-log-prop   
set-access-log-buffer-prop rotate-log get-access-log-buffer-prop disable-access-log  

While the wadm does provide a means for us to get the access log as a whole, they are not meant to be
used directly. (ie they just dump the whole information as a chunk with out any way to filter the information to
what is relevant for us)

But this is not a very serious handicap as long as we have the scripting ability. Infact it provides us freedom
to define the interface that we like to to access the data which we are interested in.

Substructure: Access Log from wadm

> get-access-log --config=test agneyam  
format=%Ses->client.ip% - %Req->vars.auth-user% [%SYSDATE%] "%Req->reqpb.clf-request%"   
%Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%  
webcache.sun.com - - [20/Oct/2006:06:27:57 -0700] "GET / HTTP/1.1" 200 355  
webcache.sun.com - - [20/Oct/2006:06:27:58 -0700] "GET /favicon.ico HTTP/1.1" 200 3574  
webcache.sun.com - - [20/Oct/2006:06:28:03 -0700] "POST /cgi-bin/test-cgi HTTP/1.1" 200 596  
varunam.sun.com - - [20/Oct/2006:06:28:22 -0700] "GET / HTTP/1.1" 200 355  
varunam.sun.com - - [20/Oct/2006:06:28:22 -0700] "GET /favicon.ico HTTP/1.1" 200 3574  
varunam.sun.com - - [20/Oct/2006:06:28:24 -0700] "POST /cgi-bin/test-cgi HTTP/1.1" 200 602  
varunam.sun.com - - [20/Oct/2006:06:28:48 -0700] "GET /cgi-bin/test-cgi?myvar=new HTTP/1.1" 200 573  
varunam.sun.com - - [20/Oct/2006:06:29:06 -0700] "GET /whoami.html HTTP/1.1" 404 292  
hokus-pokus.sun.com - - [20/Oct/2006:23:09:15 +0530] "GET / HTTP/1.1" 200 355  

(You can also modify the settings of access log by using the following CLIs)

> get-access-log-prop  --config=test  
enabled=true  
file=../logs/access  
format=%Ses->client.ip% - %Req->vars.auth-user% [%SYSDATE%] "%Req->reqpb.clf-request%"   
%Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%  

Assuming your new custom format is in the variable $myformat

> set-access-log-prop --config=test format=$myformat

Choosing a superstructure

The access log is a matrix that is arranged cronologically. The SQL (Structured Query Language) is generally
useful in mining data out of these matrices. More so because the users will be able to transpose at least a part
of their familiarity with SQL to our API.

While it is useful to stick close to SQL, I am changing it slightly so as to make it easier to parse. In general
this is what I am going to do:

Syntax

select {select list} from {from list} where {where list} group by {id}

The where part and later are optional.
Each of the key words are followed by a list which specifies the list associated with it. The list will contain
the variables that will be extracted from the logs.

Some UseCases

# Avoiding this common use case for now to make the code a little more simpler  
# select {\*} from machine1 
# The reason for choosing $xxx is that we will be able to replace them with the corresponding  
# values just by evaluating them. This will make it much simpler to construct the result.  
  
select {$ip,$request,$response} from machine1  
select {$ip,$request,$response} from {machine1,machine2}  
select {$ip,$request,$response} from machine1 where {$response == 404}  
select {$ip [:sum $size] } from machine1 where {$response == 404} group by {$ip}  

More assumptions and short cuts (In the interest of simplicity)

While it is easy to parse the format line from get-access-log-prop format=%Ses->client.ip% - %Req->vars.auth-user% [%SYSDATE%] "%Req->reqpb.clf-request%" %Req->srvhdrs.clf-status% %Req->srvhdrs.content-length% We wont be parsing that, Instead we will use fixed positions for now.

ie: from the above

$ip : 0  
#dummy field -  
$auth : 2  
$date : 3  
#you can always massage the data to remove the _"_ from request and you will get the $method, $uri, $http from the $request.  
$request : 5  
$response : 6  
$size : 7  

Implementing the Language

As always it is a good idea to start with a namespace.

db.tcl

namespace eval Logs {  
    namespace export {select}  
    proc select args {  
            puts ">$args"  
    }  
}  

Remember what we discussed last section? The args is the variable args mechanism for tcl.
so any tcl statement that takes args as a parameter will take any number of parameters. With
that in mind, the proc select will be able to slurp the entire query that we will throw at it.

The namespace export marks for export the procedures that we want others to be able to
use natively. ie by using

> namespace import Logs::select

from global namespace, the user will be able to make use of our language with out polluting his
namespace with our local procedurs.

The select statement.

proc select args {  
    set select {}  
    set from {}  
    set where {}  
    set group {}  
    set group_by {}  
    set lastword select  
    foreach word $args {  
        #look for our delimiters  
        switch $word {  
            from {  
                set lastword from  
            }  
            where {  
                set lastword where  
            }  
            group {  
                set lastword group  
            }  
            by {  
                set lastword group_by  
            }  
            default {  
                lappend $lastword $word  
            }  
        }     
    }     
    exec_select [join $select { }] [join $from { }] [join $where { }] [join $group_by {}]  
}  

What we are doing here, is to take an easy way. We are just looking for the keywords ‘from’ ‘where’
and ‘group’ to extract the lists following them, we use the tcl’s dynamic nature to just change the variable
being appended to.

You would have under stood the reason whey we went for a {$ip $response} model rather than simply
$ip $response in the select statement,…
The reason is that if we had gone for $ip $response model, tcl would have evaluated them in the
execution local context rather than wait for us to supply them. the braces {} provides us a convenient
bubble to protect against evaluation before time.

check it out with the exec_select

proc exec_select {select from where group} {  
     puts "-------------"  
     puts "select>"  
     puts $select  
     puts "from>"  
     puts $from  
     puts "where>"  
     puts $where  
     puts "group_by>"  
     puts $group  
     puts "-------------"  
}  

> select {$ip -> $response} from agneyam where {$response == 404}  
-------------  
select>  
$ip -> $response  
from>  
agneyam  
where>  
$response == 404  
group_by>  

Using the $select list as an output formater, and $where as the condition.
Here is what the exec select would look like.

proc exec_select {select from where group} {  
    set collect 0  
    # check if we have grouping statements in select (:sum)  
    #  
    # if there are, then we use the current select statement as a template for two types of statements  
    # one to keep adding to an accumulator each variable that is grouped, and the other template to  
    # fetch all the data at the end of the loop from the accumulator and construct the result.  
    # if there are none, then we just have one template which will contain the variables that will be  
    # expanded inline at the end of each iteration.  

    #use_select and collect_select are the two templates. use_select sends data to accumulator and  
    #collect_select retrieves them.  

    if {[regexp {:[a-zA-Z]+} $select]} then {  
        set collect 1  
        regsub -nocase -all {:([a-zA-Z]+) } $select {use_\\1 [incr use_id]:$id } use_select  
        regsub -nocase -all {:([a-zA-Z]+) } $select {collect_\\1 [incr use_id]:$id } collect_select  
    } else {  
         set use_select $select  
    }  

    #process the from here.   
    #interleave-logs will collect all the logs from the list of machines supplied and splat it into a list.  
    set cmd {[interleave-logs $from]}  
    set all_logs [subst $cmd]  
    foreach line $all_logs {  
        set use_id 0  
        set id [subst_in_context $line $group]  
        if {[expr [string length $where] > 0]} {  
            if [expr $where] {  
                eval_select $line $use_select  
            }  
        } else {  
            eval_select $line $use_select  
        }  

    }  
    if $collect {  
        foreach id [array names Logs::context] {  
            set line $Logs::context($id)  
            set use_id 0  
            puts [subst_in_context $line $collect_select]  
            unset Logs::context($id)  
        }  
    }  
}  

Interleave logs is a tiny procedure that mixes the logs from multiple machines.

proc interleave-logs {args} {  
    set fmt [Date::GetFormat]  
    Date::SetFormat {\\[DD/MMM/YYYY:T24S}  
    set logs {}  
    foreach machine $args {  
        set log [get-log $machine]  
        set logs [concat $logs $log]  
    }  
    set sorted [lsort -command compare-dates $logs]  
    Date::SetFormat $fmt  
    return $sorted  
}  

The accumulator is pretty simple.

proc use_sum {_id val} {  
    upvar 1 ip ip  
    upvar 1 auth auth  
    upvar 1 date date  
    upvar 1 request request  
    upvar 1 response response  
    set id [subst $_id]  
    if [info exists Logs::results($id)] {  
        array set Logs::results [list $id "$Logs::results($id) $val"]  
    } else {  
        array set Logs::results [list $id "$val"]  
    }  
    return $val  
}  
proc collect_sum {_id val} {  
    upvar 1 ip ip  
    upvar 1 auth auth  
    upvar 1 date date  
    upvar 1 request request  
    upvar 1 response response  
    upvar 1 size size   
    set id [subst $_id]  
    set lst $Logs::results($id)  
    set sum 0  
    foreach l $lst {  
        set sum [expr $sum + $l]  
    }  
    set Logs::results($id) {}  
    return $sum  
}  

To define any new [:xxx ] procedure that you want to apply in select procedure, you just have to
copy the previous procedures, define the use_xx and collect_xxx, and change the list processing in
collect_xxx accourdingly (as given below.)

    set sum  
    foreach l $lst {  
        set sum [your_func $sum $l]  
    }  

As you would have noticed there is some redundancy which can be eliminated. but this form is
easier to understand than after eliminating the common portion.

The complete db.tcl is available here.

Using it

1) Make sure that the date.tcl from the below section is in your current directory along with db.tcl

> source db.tcl  
> select {$ip, $request, $response } from agneyam  
webcache.sun.com, GET /favicon.ico HTTP/1.1, 200   
webcache.sun.com, POST /cgi-bin/test-cgi HTTP/1.1, 200   
varunam.sun.com, GET /favicon.ico HTTP/1.1, 200    
varunam.sun.com, GET / HTTP/1.1, 200    
varunam.sun.com, POST /cgi-bin/test-cgi HTTP/1.1, 200    
varunam.sun.com, GET /cgi-bin/test-cgi?myvar=new HTTP/1.1, 200    
varunam.sun.com, GET /whoami.html HTTP/1.1, 404    
varunam.sun.com, GET /mexico.html HTTP/1.1, 404    
hokus-pokus.sun.com, GET / HTTP/1.1, 200  
  
> select {$ip, $request, $response } from agneyam where {$response == 404}  
varunam.sun.com, GET /whoami.html HTTP/1.1, 404   
varunam.sun.com, GET /mexico.html HTTP/1.1, 404   
  
> select {$ip [:sum $size] } from agneyam group by {$ip}  
hokus-pokus.sun.com 355   
varunam.sun.com 5688   
webcache.sun.com 4170   
  
> select {$response [:sum $size] } from agneyam group by {$response}  
404 584   
200  9629   
  
> select {$ip [:sum 1] } from agneyam group by {$ip}                     
hokus-pokus.sun.com 1   
varunam.sun.com 6   
webcache.sun.com 2   
                                                                        
> select {$response [:sum 1] } from agneyam group by {$response}  
404 2   
200 7   

required utilities:

date.tcl from here or here(external)
Download the sourcefile and save it as date.tcl. Use the below command to use it before you source the db.tcl (It is already sourced in db.tcl)

> source date.tcl

The complete db.tcl is available here.