Data load is one of the most important processes in databases. However, relational databases are not optimal for loading complex relational data.
JSON can be used to improve performance and reduce complexity in data load process if you serialize some entities as JSON collections. In this post we will see how you can use JSON columns in data load process.
Problem – loading data in relational structures
In some cases relational model is not perfect for loading data. Imagine that you need to load set of related entities, such as customers, their attachments, orders/order items. You might end-up with complex multistage process of data load because related entities cannot be inserted until parent rows are saved.
The simplest structure with two related tables is shown on the following diagram:
In this example we have one primary table with sales orders (SalesOrderHeader) and child table (SalesOrderDetail) that contains related order items. Load works fine if you need to load external data into single table – just use some kind of bulk load and everything will be fine.
However, what happens if you need to load sales orders with related order items? If Orders table has an auto-generated primary key (e.g. IDENTITY or GUID), you would need to load one sales order, read generated value of primary key, use this value to load related order items.
Now imagine more complex structure such as customers, their attachments, orders/order items, etc. You will need to import one customer, read customer id, use that customer id to insert related attachments, use customer id to import one sales order, read generated order id to import related order items, then proceed with next order, and when you finish proceed with next customer. Imagine all these locks, read/writes, and multiple active result sets for every customer record.
There are tricks that can help you such as if you have some column that represents alternate key you might simply load records and connect them using these keys, and then update foreign keys with a separate process, or you might use IDENTITY INSERT ON/OFF to generate ids from application layer. This is not perfect solution, and if you cannot use these approaches you might have very slow import process.
When you need to load complex relational data, you might wish to have some NoSQL-like database where entities are represented as complex rich-format documents and where you simply insert one document with all related information with a single action. However, you don’t need NoSQL databases if you use JSON in Sql Server.
In this example we will talk about loading related information via staging column. Staging column is similar to staging tables – it is a temporary column that contains collection of data related to primary table. This column enables us to simplify load process. We can load information in two phases:
- Phase 1 – bulk load sales order data with a collection of sales order items into staging table
- Phase 2 – move data from staging column into child table
Phase 1: Loading related data using staging column
SQL Server 2016/Azure SQL Db enables you to store related/complex data structures as JSON documents. You just need to define text columns and add JSON constraint on them:
ALTER TABLE Sales.SalesOrderHeader
ADD OrderItems NVARCHAR(MAX)
CONSTRAINT [SalesOrder items must be formatted as JSON array]
In this example we have added a column that will contain order items for some sales order, instead of using separate table. There is no new syntax other than one additional built-in function that validates that JSON text is properly formatted (this check constraint is optional – you can use just a plain text column). Note that there is no custom data type – JSON is stored as a plain text. Since there is no custom binary format, plain text is optimized for data load – you can format related information as JSON string in app layer and directly load it into text columns. There are not additional parsing or format conversion – just load text as-is.
Now you can load every order and related order items with a single insert or bcp to single table (under assumption that you have formatted related records as JSON). As a result you might get the following content:
If you load this result set into a table with additional column that will contain related order items formatted as JSON, you don’t need to care about generated primary keys since you don’t update separate order items table. If you use some kind of bulk import it is the best way to load your data and it matches ideal case – loading data into a single table.
Note that you can create column without JSON constraint if you want faster load process.
Phase 2: Move information from staging column into child table
In this phase, we are using JSON column as a “staging column”. In the second stage you can “distribute” related information from a staging column into target related tables (i.e. Sales.SalesOrderItems in our example). Query that moves order items from JSON array into separate table is shown in the following script:
INSERT INTO Sales.SalesOrderItems (
OrderId, CarrierTrackingNumber, OrderQty, UnitPrice, UnitPriceDiscount, LineTotal) SELECT Sales.SalesOrderHeader.OrderId, CarrierTrackingNumber, OrderQty, UnitPrice, UnitPriceDiscount, LineTotal FROM Sales.SalesOrderHeader
CROSS APPLY OPENJSON (OrderItems)
OrderQty int '$.Item.Qty',
UnitPrice float '$.Item.Price',
UnitPriceDiscount float '$.Item.Discount',
LineTotal float '$.Item.Total'
Primary keys in SalesOrder table are already generated during the initial load and available. OPENJSON will unpack JSON array of orders from the OrderItems column, create inline table for each cell using the schema and mappings in the WITH clause. this query will attach orderid column from a parent row to this table using CROSS APPLY operator. Then this result set can be directly inserted in SalesOrderItems table.
Once you load data you can clean or drop staging JSON column or leave it for next load.
New JSON feature in SQL Server can help you to simplify your load process and gain better performance. Storing data as JSON might improve your load performance in some cases. If you have complex structures that should be loaded, and you have problem with parallel inserts in different tables, lock, etc., storing related info in JSON format might help you. Instead of loading data in a set of relational table, you can store all complex entities (arrays, objects) in JSON text column, insert everything in a single table, and then choose do you want to keep related JSON data or move it to child tables in the second stage.