How should database models be persisted?

The conventional way to persist the semantic data in a graphical database design/modeling tool is as a single binary file. This is easy to implement but does not mesh well with the text oriented tools developers typically use (e.g. editors, source code control systems, file diff and file searching utilities). In addition, if something goes wrong recovering your data is difficult. Even if the file is persisted as XML, it requires users to understand and work with a new file format. I think a better way to persist semantic information is as a small number of T-SQL creation scripts. Information, like graphical layout, that has no T-SQL equivalent should be stored in a single XML file. This allows users to edit, search and compare versions of the database model using standard tools. Most importantly, users can easily resolve change conflicts using standard tools since the model is stored in a natural and familiar form (i.e. T-SQL).

This approach worked well in the design of the new Visual Studio 2005 class designer. The main issues we encountered in the design was the cost of parsing a large amount of source at startup and the problem of what to do with class shapes that no longer mapped to classes in the source.

We solved the problem of parsing large amounts of source by careful design and performance engineering as well as developing an infrastructure that delayed loading data until it was actually required. Delay loading data is a common technique in distributed systems but unusual in modeling tools.

If users change, say, a class name without the class designer open then, when the class designer is next opened, the corresponding class shape needs to be mapped to the new class. Because classes do not change much, we can usually match the class shape to the renamed class. Because several classes may be changed, we developed an efficient algorithm that makes the best overall choice. However, sometimes there is no matching class. In those cases, we show the class shape in a special way and the user can either manually match the class to the corresponding class shape or just delete the shape. The approach is not perfect but the pragmatics of how people develop software means that it works almost all of the time without affecting existing development processes.

Weighing against this view is that database modeling customers accept the idea of T-SQL creation scripts as something that a tool generates instead of the fundamental objects that the tool manipulates. Even for those customers, I think the ability to have several people working on the model at once and less data loss will be compelling advantages.

Given the desire to persist database information as natural T-SQL scripts, how should the T-SQL scripts be structured? There seems to be two basic ways that T-SQL developers factor their scripts—as several large files containing scripts for related objects, or many small files containing scripts for individual objects. I prefer the approach of files containing several related objects but plenty of smart people favor the second approach. I know of one team, at Microsoft, where even the primary key creation script for a table gets its own file. However, I have never heard a good explanation as to why some prefer many small files.

My theory is that the small files approach is a historical artifact of having to work with primitive, or nonexistent, source code control systems that did not allow simultaneous changes to a single file. Without modern tools, the cost of integrating changes would be large enough that many small files would make a lot of sense. With modern tools, the benefits of easy navigation between related objects outweigh the cost of integrating changes.

From a tools perspective, it generally best to generate only a small number of files since the cognitive burden on the user will be lower. If users mostly edit database creation scripts through a graphical tool then a single file that reduces file management issues becomes even more attractive.

I know I am probably missing something so I would appreciate hearing from people on why they prefer one approach or the other. In particular, I would like to hear why they prefer the many small files approach.

 

This posting is provided "AS IS" with no warranties, and confers no rights.