EDMGen2 – Now with Reverse Engineering options

 


The information in this post is out of date.

Visit msdn.com/data/ef for the latest information on current and past releases of EF.


 

Visual Studio 2008 and .Net 3.5 SP1 come with two tools that are capable of supporting a database-first application design scenario: the EdmGen command-line tool and the New ADO.Net Entity Model item template. These two tools produce Entity Framework metadata artifacts from a database instance, but the generated model and mapping have a one-to-one correspondence with the database. Any inheritance or partitioning relationships that may be naturally present in the schema or data of that instance are not represented in the generated model.

The new version of EdmGen2, recently released on Code Gallery, includes an option “RetrofitModel”. This extension is the result of collaboration with the database team at Microsoft Research, notably summer intern Ankit Malpani. When using this option, EdmGen2 uses data mining techniques to determine if there are TPH, TPT, or vertical partitioning relationships that are inherent in the instance. It then generates a suitable model and mapping in the form of csdl, msl, ssdl, and edmx files. The following figure illustrates one of the patterns that EdmGen2 looks for in an instance, and the resulting model:

clip_image002

To use the tool, launch EdmGen2 with the following arguments:

/RetrofitModel "connection string" "provider" "model name"

For instance, the following will connect to a local AdventureWorks database, and create files and an EDM instance named AVWorks:

EdmGen2 /RetrofitModel "Server=(local); Integrated Security=true; Initial Catalog=AdventureWorks;" "System.Data.SqlClient" "AVWorks"

There is an optional final parameter that can be added. The parameter must be a numeric value between 0 and 1. This parameter represents a threshold ratio of elements in a generated child entity to the number of elements in its parent in a hierarchy. If the ratio of any generated inheritance falls below this threshold, the child entity is effectively removed and merged back into the parent. The tool may identify potential inheritance relationships on an insignificant portion of an entity; this parameter allows the user to define what “insignificant” means. If no parameter is specified, a default value of 0.05 is used.

Internally, EdmGen2 uses a sequence of rules that mine the database for patterns in data and schema that provide evidence of inheritance or partitioning relationships. We first use the System.Data.Entity.Design API to generate a default model. We then identify any entities that represent many-to-many associations and turn them into associations. Finally, we run the following rules, in order, to iteratively refine the model:

  1. Look for TPH patterns by looking for a discriminator column. We look for a column with relatively few active values (currently, 6 or fewer), partition the entity on that value, and determine if there are patterns in null values from the partitioning. We do this pattern twice – the first time assumes that a good discriminator column can be found by following associations, and the second time assumes that the discriminator column is found inline.
  2. Look for TPH patterns by looking strictly for patterns in null values, without a discriminator column. We build a frequent-pattern tree out of a table, using the presence or absence of values in each column as the features to identify. This tree can be built from two passes of the data. We use the frequent-pattern tree to determine if there are mutual exclusion relationships between properties in an entity.
  3. Push down associations in a hierarchy. If entity A has an association with entity B, C is a child of B, and all of A’s association set references instances of C, push the association down to C.
  4. Look for TPT and vertical partitioning using associations. Look for any 1-1 associations that remain after the previous steps. If there is a 1-1 relationship between the instances of the entities as well, merge the entities. If not, create an inheritance link.
  5. Check for abstraction. If, for any entity E, the union of all of the instances of E’s children sum up to all of E’s instances, mark E as abstract.

We have defined rules that can identify TPC relationships and horizontal partitioning as well. These rules were excluded from this release to ensure that all generated models could be viewable in the Visual Studio design surface. In the future, if there is sufficient demand, we will add this functionality to the release. We are also interested to see if there are additional patterns that signify inheritance hierarchies in live production databases that are not covered by these rules, and if so, how we can generalize them for automatic identification.

 

- James Terwilliger, Post Doc researcher, MSR