Using DTS and Analysis Services to Analyse Blog Logs

Following on from the
entry I wrote about analysing blog logs
, one or two people have asked me how I
set everything up to do this. It's a really good use of Analysis Services, actually,
and offers a good example of how Microsoft's business intelligence tools work together.
Nothing is too complicated, although there are a number of steps involve to get everything
working. If you've a general idea of what DTS and Analysis Services are, you should
be able to follow these instructions fairly well - drop a note in the comments section
if you're struggling and I'll see what I can do to help out. Note that the blogs hosted
here use BlogX, and
so what I've written here presumes a log format that matches the one on our server.

Step 1: Generating a raw log file

The first step is of course to get hold of the raw log files that contain the
information to be analysed. The log files on my server are stored in a separate file
per day, so I use a standard command-line utility called wget to pull down these files
over HTTP using a batch file. (Incidentally, here's
a quick-and-dirty C# version of wget
that I wrote.) Once I've got hold of the
files, I concatenate them together into a single log file called bloglog.txt using
a batch file. This is about 60MB in size, as of the last count.

Step 2: Generating a destination table

Next we need to create a database table into which the cleansed, transformed version
of the log files will be inserted. To achieve this, simply execute the following SQL
(or similar):

    CREATE TABLE [Hits] (
      [HitID] [bigint] IDENTITY (1, 1) NOT NULL,
      [HitTime] [datetime] NULL,
      [URL] [varchar] (400) NULL,
      [URLReferrer] [varchar] (400) NULL,
      [UserAgent] [varchar] (400) NULL,
      CONSTRAINT [PK_Hits] PRIMARY KEY CLUSTERED ([HitID])  
   ) 

Step 3: Generating a DTS Package

The main thing to do here is to copy across the data from the raw tab-delimited
file to the Hits table. You can do this by creating a Transform Data task that copies
the appropriate columns across (using Copy Column transforms). For the date column,
you need to use an ActiveX Script transform, since the date format is different in
each case. Here's the VBScript to slice the input appropriately:

    '  Copy each source column to the destination column
   Function Main()
      DTSDestination("HitTime") = Left(DTSSource("Col003"), 10) & _
         " " & Mid(DTSSource("Col003"), 12, 8)
      Main = DTSTransformStat_OK
   End Function

Test the package, and then execute it. If all goes well, you should now have a SQL
Server table containing the log file. (If you'd like to save time in this step, you
can download the DTS package I
created and modify it to match your own table and source file names.)

(Optional) You might now want to go back and add some steps to the DTS package to
clean up the table before each execution, run the batch file created in step 1, and
even process the cube we'll create in step 4. This is how the final DTS package should look:

Step 4: Creating an Analysis Services Cube

Within Analysis Manager, create a new database. Set the data source to the SQL
database you've just created. Now right-click on the Cubes node and choose the Cube
Wizard. In a more complex data source, you'd split the fact table (the values you're
browsing) and the dimension tables (the things you use to slice and dice the values)
out into separate tables, but here everything is conveniently stored in the one table.
The HitID is the measure - although the value has no meaning, we'll set the usage
of this later.

Create four dimensions, using the star schema model and accepting all the defaults:

1.
URL (standard dimension type based on the URL column)
2.
Referrer (standard dimension type based on the URLReferrer column)
3.
User Agent (standard dimension type based on the UserAgent column)
4.
Time (time dimension type, year / quarter / month / day hierarchy, based on the HitTime
column)

When the wizard dumps you unceremoniously into the Cube Editor, select the HitId measure
and click the Properties button in the bottom left hand corner. Set the Aggregate
Function to be Count, not Sum. This switches the aggregations to count how many entries
there are, not the values contained in them. Lastly, choose Tools / Process Cube,
designing storage using the defaults and then executing the process. The cube is now
available for browsing.

NB Make sure you choose the option to incrementally update the shared dimensions,
if you've created the dimensions above as shared - otherwise the processing step will
fail on repeated attempts.

Step 5: Use Excel to Browse the Cube

Now comes the fun part! In Excel, choose Data / PivotTable and PivotChart Report.
Select External Data Source and get the data from the OLAP cube (you might have to
set up a new query using Microsoft Query). Accept all the other defaults, and you'll
wind up with an empty PivotTable embedded in the spreadsheet. You can now drag fields
to the PivotTable - make sure you drag HitID to the centre of the table, and then
drag the dimensions to rows or columns. The context-sensitive menus allow you to customise
sorting, grouping and formatting. Create a PivotChart to see a graphical view, such
as this:

Well done if you've got this far! Have fun analysing your blog data and identifying
some of the trends. You might be surprised as to which kinds of blog entries are the
most popular, or what aggregators are visiting your site...