The content of this article is based on the experience I gained in a big data warehouse project. My initial intent writing this article is to provide a good starting point for other DWH projects where one have to design and build ETL processes.
In the first chapter I am writing about the data sources and data source formats. My experience was, that it is a very important part to define the sources carefully, precisely and completely.
After the examination of the structure of the source data, I will have a look at the data content and the possible parts which are important for ETL processes.
The third chapter is an accumulation of functional requirements one may face when designing ETL processes.
Don’t consider this document as a comprehensive guide for designing ETL processes. My intent is to maintain this document by extending and adding content.
The nature of a Data Warehouse project is that data is collected from various sources and also in various formats. So one of the most essential things is to define strict contracts for the interfaces thru which the data is collected.
- Here are some questions which should be answered by the contract
- Where does the data come from?
- Which data formats are used?
- How is the data formatted?
- Which Information is necessary to handle the data correctly?
- How are different data types formatted? (number, date, time, ...)
- Is there one record in the source for one Business Case? (for example one record for one customer)
- How are deleted records marked?
- Is it a complete set of data? (e.g. all customers or just a set of changes)
- Are trailing whitespaces part of the data?
- Data source
- E.g. data files in directory or on a share, database tables or views, web services, …
- Data format
- Comma separated file, EBCDIC encoded files, etc.
- How are the files encoded (Code page
- Database Collatio
- Language settings
- Date Formats
- Number Formats
My recommendation is to build up a repository where you store the mentioned contracts. This repository can be an Excel workbook or even better a SQL Server database. If the repository contains a comprehensive view of all the contracts and data structures of your sources, you can use the repository as your meta-data for your ETL processes. For example in the data warehouse project I mentioned at the beginning of the article, we generated some of our ETL processes automatically with the information from such a meta-data store.
Data validity period
In many cases it is important for the ETL Processes to know how to interpret the data. For example if the source systems provide information about the time spans within the data is valid. If this data is important for the target system, detailed information about the meaning of the data fields defining the validity period is required.
Here are some questions which should be answerable:
- How is the validity information provided (TimeFrom – TimeTo, Timestamp)
- How is the time information interpretable? E.g. what does a value for ValidTo means? (data is guilty < ValidTo or data is guilty <= ValidTo)
Historical data in a data warehouse
Data in a Data Warehouse is stored in one of two forms, historically or not historically. Historically stored data can be subdivided into time point related data or time span related data.
There are three types how data can be stored in the matter of historiography.
- Type 1: No history information is stored. Existing data is overridden by new values
- Type 2: The history of data changes is preserved. Every data row has a valid from and valid to date indicating the time period of the data’s validity.
- Type 3: This type is used for data that is valid for a specific point in time. Every record has a timestamp column indicating the point in time where it is valid.
The decision what type you use for your data should be made in the design phase of your target database. The important information for the ETL developer is where the history information for Type 2 und Type 3 data comes from.
- What time information can be used to made up the validity period information
- How to identify a deleted/invalid data record
- How to handle validity period information from the source system
- What does the validity period information mean (e.g. exact definition what a ValidTo date means: <, <=)
- What granularity is used to define validity periods?
What time information can be used to made up the validity period information
To build up a validity period you need information on what point in time a data change take effect. This point - for example - can be the timestamp of the time when the import process is running. But in most cases this time point is determined by the source system. So you have to find a way to hand this information over from the source system to your ETL process.
How to identify a deleted/invalid data record
The question is how to identify records which are no longer valid in the source system. This can be accomplished by sending the record with a flag indication the record status. If the source system always sends a complete list of valid records, another method can be to match the records from the source system with the records in the target system and set all the records which are not in the source, but in the target to invalid.
Setting a record to invalid can mean one of the following things:
- Delete the record in the target system (not the best choice)
- Mark the record as invalid (e.g. deleted timestamp in type 1 data, or set ValidToDate in type 2 data)
How to handle validity period information from the source system
Maybe the system where your data comes from already uses validity periods for their data and this information is handed over to the new system. In this case you have to define what this information means to the new system.
What does the validity period information mean?
Define exactly how validity periods are defined. What does a specific ValidTo date means? In terms of an SQL statement, does ValidTo means ‘<’ or ‘<=’.
What granularity is used to define validity periods?
In some cases the validity of data is specified on a low granularity, for example on a weekly or a monthly basis. This leaves a lot of space for interpretation. For example a data record which is defined as guilty for January 2010. What does this mean? Is the data guilty starting from the first of February or is the data guilty only for January? All this questions has to be clarified before the development of the ETL processes can start.
Functional Requirements for the ETL processes
My recommendation when planning your ETL is to think of all the functional requirements you packages must meet and build up some prototype or template packages.
- Transactional behavior / restart ability
- Batch processing
One important thing to consider is the behavior of your ETL processes in the case of an error, and how you can recover your system into a consistent state after a process abortion.
So the first task to complete is to identify the objects affected by your processes. In a DWH project where the task of the processes is to move data from a source to a destination, usually the only affected system is the target database. Fortunately there is a very good, well known mechanism to ensure the consistency of a database: transactions. So to ensure your processes do not leave your destination database in an inconsistent state is to assure all of the work done in the destination runs within one transaction.
Integration Services includes logging features that writes information off all kinds of events during the package execution to a customizable location. This logging information is very technically and it is hard to acquire the required business information out of this logs. So, one thing you should think about, when you are designing your packages is what additional information is useful or required and how/where to log this data.
- Package Start and Stop time
- The Batch Name or ID the package execution was part of
- Package parameters
- Names of the input files the packages has processed
- Number of records imported/declined/…
To acquire the logging requirements
- How to analyze package execution errors in the production system
- How to analyze data errors
- How to analyze the loading process (plausibility checks)
- Is the amount of loaded data comparable to the last load
- Is the load performance stable
- Is there an increase in the count of erroneous input data
Interview the business specialists and the IT guys which will operate the system to get an comprehensive list of all the questions which should be answerable by the logged data.
Usually, data delivered from a source system is provided in multiple parts (e.g. data files or data tables), and this parts are related to each other, and it is necessary to process this parts in one batch. Considering this requirement is very important in the design phase of your ETL processes, because the single processing units (ETL packages) must be aware of the batch they are part of.
For example a source system which provides customers data, and the data is split into multiple parts. One part (source file) contains information about the customers personal data (name, age, …). A second part contains the details about the customers contact information (email addresses, phone numbers, …). Assume that we have 2 SSIS packages to import the customer data in this simple example. It is obvious that the packages belong to each other and that the package which imports the customers personal data must be run before the package importing the customers personal data. This is a very simplified example where it is necessary to process the packages in a batch, which distinct dependencies between the members of the batch.
- Members of a batch are related to each other
- Batches should be as small as possible to minimize dependencies.
- All related parts must be in one batch
- Package executions should be assigned to a distinct batch (batch identifier)
- Logging information should include information about the batch (e.g. a batch identifier)
A rule of thumb is that you should use bulk inserts whenever possible. The performance improvement can be tremendous compared to non-bulk inserts when you are inserting more than just a few hundreds of records.
- Use table looks
- Parallel writes into one target table must be avoided
- If your target tables are connected with foreign keys, you have to pay special attention to the dependencies of your ETL processes (details in the next chapter).
DWH and Foreign-Key constraints
This chapter is not a comprehensive essay about foreign key constraints. I just want to write down some of my experiences with this topic.
The scenario is form a project I was involved. My task was to load data from data files into a database with a normalized data schema. The nature of a normalized database schema is that there are many relatively small tables (in the sense of number of columns) which are heavily related to each other. These relationships must be addressed when the database is loaded with data. For example the customer base table must be loaded before the customer details table and so on. The rule of thumb is that if there is a dependency between to tables in the data schema, this dependency must be addressed when loading the data into the database. In addition to this rule there might be some more dependencies to consider. For example if you are using bulk inserts to write data into the target tables.
Here is an example how bulk inserts can affect the ordering of your ETL processes.
In the example there are three tables in the target database. One ETL process – SSIS package – exits to load each of the tables. The tables are related to each other, as you can see in the picture.
When you use a bulk insert operation to insert data into the CustomerCountry table you should not bulk insert data into CustomerAddress table at the same time. Inserting in both tables at the same time could end up in an erroneous behavior in one of the insert tasks. Bulk insert operations require that the schema of the involved tables and objects stay the same during the whole operation. But bulk inserting into a table with a constraint will change the meta-data of the constraint. This change is threaded as a schema change of the constraint in the referenced table. So if a second bulk insert operation is running even on a different table, this operation is may be ended with an error message.
As you can see in the example you have to consider a lot of dependencies, when you design your package execution plan. So, not using foreign key constraint can make your work easier and your load process faster because you can load your warehouse with more parallelism.