Getting Started with Compute Cluster Server

If you're an infrastructure person, you'll likely be involved in the coming months or years with one of Microsoft's newest offerings on the Windows Server platform. The Compute Cluster Edition (CCE) of Windows is a new version of Windows Server 2003. The CCE is a special version of Windows, like the Web Server Edition, that is specially priced for servers being used as a compute cluster node. The Compute Cluster Pack (CCP) is the actual software that can be installed on the CCE, Standard or Enterprise Editions of Windows Server 2003 SP1 or R2. It contains a job scheduler, Microsoft-MPI stack, and management tools for operating an HPC environment. Both the CCP and CCE run only on 64-bit hardware.

The compute cluster server is something that many infrastructure folks will have some difficulty understanding, as it currently lives in the world of engineers doing complex math computations or parametric sweeps (entering thousands of variable variations into the same equation to see different outcomes). Fortunately, the software is easy enough to install and configure, but I wanted to make understanding this technology more available to the masses. In this post, I will provide you with a short demonstration of how these components in the CCP work, so that you'll have a baseline for understanding how such systems might need to be designed or troubleshot.

This lab uses two machines: Headnode01 and ComputeNode01. The following configurations were used during the design of the lab environment. Those settings that should be matched identically for completing the lab are noted with an asterisk.

SETTING

HEADNODE01

COMPUTENODE01

NICs*

Dual (one plugged to network, one to x-over cable)

Single (x-over cable to headnode)

Memory

4 GB

4 GB

Processor

AMD Opteron 64-bit 1.6 Ghtz Dual Processors

AMD Opteron 64-bit 1.6 Ghtz Dual Processors

Operating System

Windows Server 2003 R2 Enterprise Edition

Windows Server 2003 R2 Enterprise Edition

Domain Membership*

LAB DOMAIN

LAB DOMAIN

Compute Cluster Pack*

Installed, configured as head node (No RIS)

Installed, configured as compute node reporting to the head node. Admin Tools are included.

Networking*

Public and Private Networks (no MPI Network used) – uses Windows connection sharing via Cluster Pack features.

Private network only

Additional Folders*

C:\Resources

C:\Output

None

Shares*

Resources (Full Control – Everyone)

Output (Full Control – Everyone)

None

Environmental Variables*

HResources = \\headnode01\resources

HOutput = headnode01\Output

HResources = \\headnode01\resources

HOutput = headnode01\Output

Business Problem to Solve

For this lab, our business problem to solve is Internet Information Server log analysis. To complete the lab, you will need access to up to 500 log files from an IIS web server. The author of this lab presumes that anyone reading this document probably has an IIS server somewhere at their disposal. Those log files will need to be copied to the Resources folder on the Headnode to complete the lab. Log files should be recorded daily or hourly, depending on the server (average number of records per log file should be at least 50).

TASK: Copy up to 500 Internet Information Sever log files to the Resources Folder on the Compute Cluster Head Node server. For some of the exercises in the lab, the log files are presumed to be in the Microsoft IIS Log File Format (although you could make code adjustments to accommodate other types).

Exercise #1 – Scheduling Simple Jobs

The basic job we're accomplishing is quite simple: we must could the total number of hits recorded in each log file. For the first exercise, we will do this with a standard shell script. The shell script is called CountHits.bat and should be posted to the OUTPUT folder on the head node.

The script accepts a single parameter (a log file) and uses the FOR command with the /F switch to open the file passed and set a variable to count each line read in the file. NOTE: This command doesn't work if you have spaces in your file paths. The output for the file is a simple ECHO command that returns the total number of lines counted in the text file.

LISTING 1: CountHits.bat

@ECHO OFF
SETLOCAL

FOR /F %%J IN (%1) DO CALL :Calc %%J
ECHO Total Hits For %~n1: %Count%
GOTO END

:Calc
SET /a Count=Count+1

:END
ENDLOCAL

Task #1 – Running a Single Instance of the Job

For comparison purposes, our first task will be to use the command prompt to process all our log files stored on the head node in the resources folder. Open a command prompt, navigate to the Output folder and type the following command

FOR %I IN (%HResources%\*.*) DO CountHits.bat %I

Your output should look similar to the following.

C:\Output>FOR %I IN (%HResources%\*.*) DO CountHits.bat %I

C:\Output>CountHits.bat \\headnode01\resources\in050412.log
Total Hits for in050412: 18

C:\Output>CountHits.bat \\headnode01\resources\in050413.log
Total Hits for in050413: 106

C:\Output>CountHits.bat \\headnode01\resources\in050414.log
Total Hits for in050414: 31

C:\Output>CountHits.bat \\headnode01\resources\in050415.log
Total Hits for in050415: 62

C:\Output>CountHits.bat \\headnode01\resources\in050416.log
Total Hits for in050416: 63

(Continues for each of the files that must be processed)…

Task #2 – Submitting Count Hits as a Compute Cluster Job

For this task, we will be taking the standard job run in task #1 and submitting it to the compute cluster so that processing of the individual log files can be distributed across multiple servers. Because the job could be running across potentially hundreds of nodes (in the real world) we need a way to aggregate our results.

We will accomplish this by creating a single Job on the head node. Within that job, we will create hundreds of individual tasks (one for each log file that we will be examining). Each task will be individually configured to use some of the built-in properties of tasks for parallel jobs: Command Line, Standard Output, and Standard Error. Because we'll be creating so many individual tasks, we will use the JOB command from the command prompt to create these tasks.

The basic process we will be following is:

  1. Create a new, un-submitted job into the compute cluster queue
  2. For each file found in the Resource Folder create a task with
    1. A unique command that passes the log file name to the CountHits batch file
    2. A Standard Output location to redirect the results of the Echo statement in the batch file
    3. A standard error location to redirect any problems encountered on any of the tasks.

Create the Job

In the Output file on the head node, create a new batch file called CreateParallel.bat. Open it in Notepad and enter the code from Listing 2. The batch file essentially consists of two commands, one to create the job, and one to loop through each log file and create a task. The job creation command is as follows:

job new /numprocessors:4-4 /jobname:BasicParallelJob

It translates as create a new job that uses a minimum of 4 and maximum of 4 processors and name the job "BasicParallelJob." For purposes of the batch file, however, we wrap it within the FOR /F command so that we can get the ID of the job created and use that in the next command to associate each task with the job.

LISTING 2: MakeParallel.bat

@ECHO OFF
for /F "usebackq tokens=4" %%j in (`job new /numprocessors:4-4 /jobname:BasicParallelJob`) do SET JobID=%%j

FOR %%I IN (%HResources%\*.*) DO job add %JobID% /stdOut:%HOutput%\Results-%%~nI.log /stdErr:%HOutput%\Errors-%%~nI.log %HOutput%\CountHits.bat %%I

The second command instantiates a FOR loop and, for each log file discovered in the Resources Folder, it creates a task, using the following syntax:

JOB ADD [JOB ID] /stdOut:[Path]\Results-[Log File Name].log /stdErr:[Path]\Errors-[Log File Name].log CountHits.bat [Log File Name]

If you're new to shell scripting, there may be a few things that need to be cleared up. First, we must escape the % sign within a batch file so that the batch file operates with literal variable names (versus substituting the values). That is why some of the variables have %% and others are single %s. Also, we are using the ~n between the % sign and the actual parameter name (I). This tells the shell interpreter that we just want the file name, not the entire path.

Run the MakeParallel.bat command from the prompt and open the Compute Cluster Administrator, Select View Job Schedule and find your job. You should notice a few things:

  1. The job status is Not Submitted
  2. Upon selecting the job, the individual tasks should be listed in the lower pane
  3. The job is configured to use 4 processors, but the individual tasks only use 1 processor (this is okay, CCS will balance things out).
  4. If you right click on the job in the GUI, you cannot submit the job

Submit the Job

Before we submit the job, let's get the head node configured to watch the process. Make sure your windows are arranged so that you can see the following:

  1. The RESOURCES folder in Explorer (we'll watch for output to be posted)
  2. The _TOTAL %Processor Time for each node in the cluster on the System Monitor in the Compute Cluster Administrator

Once your viewing environment is set up, submit the job by typing the following at the command prompt

JOB SUBMIT /ID:15

where 15 is the actual job ID assigned to your job. On a two node system, you should see large spikes on both processors simultaneously, although the spike on the head node my be considerably higher (since it is handling the scheduling, output files, etc.).

Conclusion

The goal with this short tutorial is merely to expose infrastructure people to some of the parallelization possibilities in compute cluster server, using simplistic tools and batch files and solving a business problem that they might relate to. Most of the actual solutions used by the compute cluster server will be far more challenging (weather prediction, drug compounding, aerodynamics, etc.). However, with just a baseline understanding of how the Microsoft components in the architecture operate, setting up such systems in your own organizations should become slightly easier.