How should database models be persisted?


The conventional way to persist the semantic data in a graphical database design/modeling tool is as a single binary file. This is easy to implement but does not mesh well with the text oriented tools developers typically use (e.g. editors, source code control systems, file diff and file searching utilities). In addition, if something goes wrong recovering your data is difficult. Even if the file is persisted as XML, it requires users to understand and work with a new file format. I think a better way to persist semantic information is as a small number of T-SQL creation scripts. Information, like graphical layout, that has no T-SQL equivalent should be stored in a single XML file. This allows users to edit, search and compare versions of the database model using standard tools. Most importantly, users can easily resolve change conflicts using standard tools since the model is stored in a natural and familiar form (i.e. T-SQL).


 


This approach worked well in the design of the new Visual Studio 2005 class designer. The main issues we encountered in the design was the cost of parsing a large amount of source at startup and the problem of what to do with class shapes that no longer mapped to classes in the source.


 


We solved the problem of parsing large amounts of source by careful design and performance engineering as well as developing an infrastructure that delayed loading data until it was actually required. Delay loading data is a common technique in distributed systems but unusual in modeling tools.


 


If users change, say, a class name without the class designer open then, when the class designer is next opened, the corresponding class shape needs to be mapped to the new class. Because classes do not change much, we can usually match the class shape to the renamed class. Because several classes may be changed, we developed an efficient algorithm that makes the best overall choice. However, sometimes there is no matching class. In those cases, we show the class shape in a special way and the user can either manually match the class to the corresponding class shape or just delete the shape. The approach is not perfect but the pragmatics of how people develop software means that it works almost all of the time without affecting existing development processes.


 


Weighing against this view is that database modeling customers accept the idea of T-SQL creation scripts as something that a tool generates instead of the fundamental objects that the tool manipulates. Even for those customers, I think the ability to have several people working on the model at once and less data loss will be compelling advantages.


 


Given the desire to persist database information as natural T-SQL scripts, how should the T-SQL scripts be structured? There seems to be two basic ways that T-SQL developers factor their scripts—as several large files containing scripts for related objects, or many small files containing scripts for individual objects. I prefer the approach of files containing several related objects but plenty of smart people favor the second approach. I know of one team, at Microsoft, where even the primary key creation script for a table gets its own file. However, I have never heard a good explanation as to why some prefer many small files.


 


My theory is that the small files approach is a historical artifact of having to work with primitive, or nonexistent, source code control systems that did not allow simultaneous changes to a single file. Without modern tools, the cost of integrating changes would be large enough that many small files would make a lot of sense. With modern tools, the benefits of easy navigation between related objects outweigh the cost of integrating changes.


 


From a tools perspective, it generally best to generate only a small number of files since the cognitive burden on the user will be lower. If users mostly edit database creation scripts through a graphical tool then a single file that reduces file management issues becomes even more attractive.


 


I know I am probably missing something so I would appreciate hearing from people on why they prefer one approach or the other. In particular, I would like to hear why they prefer the many small files approach.


  


This posting is provided “AS IS” with no warranties, and confers no rights.


 


Comments (3)

  1. LeeB says:

    Hi Anthony,

    I believe there is a lot to be gained from using XML as the persistence format for database models. An XML model is perfect for post-processing to generate documentation and using as the input to code generation tools, etc. I would not like to even consider using raw T-SQL as the input to code gen!

    As you described above, a T-SQL model would need an additional file (XML) to hold designer information that couldn’t be stored as T-SQL. Why create two files when an XML file could capture everything?

    The other downside I see to using T-SQL is that it ties the designer to SQL Server. This may of course be intentional but it would prevent models from using constructs which can not be captured in T-SQL.

  2. Thanks for your feedback Lee. Using T-SQL creation scripts as a persistence mechanism does not prevent you from exposing a rich object model. Nor does it prevent you from generating XML reports, database change scripts, etc.

    The nice thing about the DSL framework that the class designer is built on is that you can easily link DSLs, via their object models, together to do interesting things. For example, from a database model you could link a class diagram that represented the LINQ classes to access the database. As you update the table definitions the classes could be automatically updated.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  3. David says:

    This is not really on the point, but nevertheless 😉

    I think there is a key difference how SQL databases and normal source code is dealt with during development. With normal sourcecode I might save a history of that code, but that history is not relevant to produce a build. A specific version of the source code is all that is needed to build the binaries for that version. While one could treat creation scripts for a database the same way, ie just say that a snapshot of the database structure can be stored as SQL creation scripts and those scripts can then be used to create the database, I think this ignores the major, major issue that you need change scripts that change deployed versions of your database to a new version of your database structure. I believe any modelling tool for databases should have a very strong answer to this issue, and integrate this very well into the source code control system. I would think something like this would do it (but I might be wrong): For every version in the source code system have create scripts that generate a new, empty database. Also, for every version have change scripts that modify the last publicly released database schema to the new version, while preserving the data in the database.

    Gee, this is a really crappy description of a really complicated problem. I hope you get what I mean 😉 I just have the feeling that EVERY program that uses a SQL database will face the problem that the structure of that database will need to change with a new version of the program. Right now developers have to do every step of that by hand: Create the change scripts, make sure they are robust, integrate all that into setup etc. I feel a modelling tool for that space that integrates with source code control and release management would be great. I also feel that a tool that can just capture a snapshot of a database structure would not be adequate for that….

    I am very excited about this blog, by the way!