Database normalization simplified?

--Making Sense of Database Normalization, Hopefully--

 

Following is my take on database normalization. Hopefully this simplified version will help your understand of a sometime complicated subject a little better.

 

When setting up a database, one of the most important steps to keep in mind in the process is normalization. Normalization is the process of organizing data in the database. This is done by applying a set of rules to the database where each rule results in a particular database form. As you move from one form to another, your database must meet additional criteria. The goal of meeting the different "normal" forms is to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

 

Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations which is time consuming and prone to errors. A change to a customer address is much easier to do if that data is stored only in the Customers table and nowhere else in the database.

 

Inconsistent dependency is a database design that makes certain assumptions about the location of data. For example, while it is intuitive for a user to look in the Customers table for the address of a particular customer, it may not make sense to look there for the salary of the employee who calls on that customer. The employee's salary is related to, or dependent on, the employee and thus should be moved to the Employees table. Inconsistent dependencies can make data difficult to access because the path to find the data may not be logical, or may be missing or broken.

 

--First Normal Form--

Getting back to normal forms, a database is said to be in First Normal Form when all entities have a unique identifier or key, and when every column in every table contains only a single value and doesn't contain a repeating group or composite field. One example of a repeating group is an array that contains a collection of similar values. This might consist of storing multiple departments that an employee worked in. A composite field contains multiple pieces of information in a column. For example, EmployeeNumber should not be composed of social security number and last name because these are separate attributes. This may save space but is difficult to retrieve individual pieces of data from the composite. For the repeating group situation, one way to normalize is to split the table into two tables. The first table holds the primary key and all of the non-repeating columns. The second table holds the same primary key (foreign key) as the first table and a column containing a single value from the repeating group. Thus, each value from the repeating group would occupy a single row in the second table. For a composite column, you would break the column up into individual columns.

 

--Second Normal Form--

A database is in Second Normal Form when it is in First Normal Form plus every non-primary key column in the table must depend on the entire primary key, not just part of it, assuming that the primary key is made up of composite columns. For example, assume that you have a primary key that consists of columns OrderNumber and OrderItemNumber. OrderNumber uniquely identifies a customer order and OrderItemNumber identifies an item in the order. A column like ItemNumber would be dependent on the entire primary key while a column like DateOrdered would only depend on the OrderNumber field. Note that this rule implies that a table with a single column primary key is by default in Second Normal Form.

 

--Third Normal Form--

A database is in Third Normal Form when it is in Second Normal Form and each column that isn't part of the primary key doesn't depend on another column that isn't part of the primary key. For example, suppose that a table contains a primary key of EmployeeNumber and the dependent fields Street, City, State, and ZipCode. This table is in Second Normal Form but not Third Normal Form as City and State are unique for (and thus dependent on) a particular ZipCode. This results in ZipCode fields being duplicated in multiple records. To fix this, you would need to create another table with ZipCode as the primary key and then move the City and State columns to the new table. There exists several other normal forms but Third Normal Form is typically as far as you need to go for most databases in order to strike a good compromise between performance and redundancy of data. In some situations, Third Normal Form may even be too much. For example, if you have a Customers table and you want to eliminate all possible interfield dependencies, you must create separate tables for cities, ZIP codes, sales representatives, customer classes, and any other factor that may be duplicated in multiple records. This in turn may degrade performance as many small tables may exceed open file and memory capacities. In this case, it may be more feasible to apply Third Normal Form only to data that changes frequently.