Azure SQL DW Learning, Expanding

So I've been involved with learning SQL DW by jumping in the water.


Not all functions and methods work the same as they did before, when you look up a function in MSDN/TechNet, make sure you see SQLDW as one of the listed supported database types.

First lesson:

Natural keys, due to the nature of SQL DW going across several servers, you cannot rely on identity ID's anymore.

For example, lets say you have a DateDIM, with all the standard type of columns:

  • Date
  • FiscalYear
  • FiscalMonth
  • FiscalQuarter
  • CalendarYear
  • CalendarMonth
  • CalendarQuarter
  • IsHoliday

Note: No ID column.

The best way to use the above is by using the date column as a natural key, so this would relate to the measure table by date

  • Date
  • Value
  • Status
  • Event

All the values would be natural keys and hence Date to Date.

Yes you learned relational structures and normalization, you learned that for most business apps 3NF was all that was ever needed.

Yet with Azure SQL DW it's a different world:

Azure SQL Data Warehouse is a massively parallel processing (MPP) cloud-based, scale-out, relational database capable of processing massive volumes of data.

What does that mean, it means that several parallel servers process the data and is stored across several servers. Hence why Identity isn't suitable.

Before you say it, I know what you are thinking, GUID!, noooooooooooo!

Even in a traditional SQL database GUIDs were not a good way to create identities, now granted sometimes they were the only way. For example syncing across databases or processes, workflows, etc.. Yet GUIDs caused their own problems, mainly when indexing and searching.

Some applications such as SharePoint solved this problem by creating incremental GUID's, yea odd huh, but they did it, if you ever look at a database created by SharePoint, you will see all the GUIDs are incremental. Does this solve the problem? Well maybe for that application, but keep in mind that's one big key.

Natural keys are a better way to go, well why didn't we use them before? Different reasons, normally the size, a natural key could be a persons name "Jose c Fuentes" is much larger than "1". Think of 4NF and 5NF and think of natural keys and you get why it works better with SQL DW.

Azure SQL DW can process massive amounts of records, the key size isn't a factor like it once was and allows a cleaner relationship of data between measure groups and dimensions.

Second Lesson:

Azure SQL DW can process massive amounts of data, I've had it process 1 billion records in under 20 minutes. Never would have seen this before in a traditional SQL database.

There are limits, transitional limits exist, you may not have a large enough allocation to your SQL DW instance to allow X number of transitions to be processed at a single time. There is no solid number as it really depends on the data, how big the rows are and data types as to how many transactions any given table can be processed on a given SQL DW instance.

The best method I've had to work out issues is to process the data by date over time.

Lets look at a quick example to update a record.

DECLARE @StartDate date
DECLARE @EndDate date
Declare @FeedType varchar(10)
SET @StartDate = '2007-07-01'
SET @EndDate = '4/30/2015'
SET @FeedType = 'Cycles'
-- using a while loop increment from the start date
-- to the end date
DECLARE @LoopDate date
SET @LoopDate = @StartDate
WHILE @LoopDate <= @EndDate
Update [dbo].[DataRead]
Set DataReadDateTime  = cast(DataReadDate as datetime) +  cast(DataReadTime as datetime)
WHERE DataReadDate = @loopDate and DataFeedtype = @FeedType
 SET @LoopDate = DateAdd(d, 1, @LoopDate)

What the above does is process the data by date, it is possible to also do this by year

WHERE DataReadDate BETWEEN '1/1/2007' AND '12/31/2007'

You could loop and add by year instead of day and process by the year.

This will then allow updates to happen by the year and commit those records rather than risk hitting the limit and having it fail.

NOTE: when a transaction issue occurs, it rolls back the whole transaction. So any time spent is lost, with the above loop, days, years can be processed and if a transaction error occurs, simply make adjustments and process by a smaller increment.

Third Lesson:

SQL Management Studio, get the latest and greatest, 16 or 17 works fine.

This probably should have been one of the first lessons, but I figured you would use SQL Management Studio or better yet Visual Studio.

First thing you will notice is you can't use a designer for modifying any tables, views, you have to do everything by SQL.

Trick, build your structure (no data) in VS.NET or using a local SQL database, then simply Script it to the clipboard and paste it in an query window open to SQL DW database in question. Execute and you have your table structures.

Note: you cannot rename as the SP_rename SP is missing, so keep that in mind.

When building out my DIM's I found it easier to do this on the local SQL server, then simply do a Export to the SQL DW. Yes that works! You can also push data quickly into SQL DW using a local database in order to right click the local database (master works) Tasks->import/export data. Doesn't really matter which one you pick. First you select the source (could be a flat file, excel, an SQL server, etc.) then you select the destination (SQL DW), run though the wizard and boom data is migrated.

Don't do this to move billions of rows, but if you have very large CSV files, you can quickly move the data easily enough.

Forth Lesson:

Stage the data, lots of people forget about staging, after all, in todays age of big data and unstructured data, you just throw all the data in a hole and boom magically it's all usable.

Lets face some reality, the machine learning and other tools have to be tuned to figure out what data is in a mess of documents. Throughout this discussion we have not been talking about unstructured data. This has all been about structured data, and since we are dealing with structured data, sometimes pulling in raw data just doesn't mash well with other data that have been imported. Hence the need to restructure and combine. So no you didn't loose your jobs yet to some magic, sales people will show you all kinds of things. Though when you end up with 10 different areas with cities in them and some are spelled wrong, or differently and in slightly different structures or names (towns vs. cities) it gets harder to mash the data together. This is when restructuring the data with common elements plays its greatest role.

Staging the data in place and moving, as you saw above with processing by date/year, you can parse the data, clean it up, add staging columns, and then move the processed data to a new home that makes it easier.

Why? You can just fight with the tools to build reports that are inconstant across different structures or build a structure, populate it and have happy people happy to use the structure for all their reports. One way or another you will spend time, but once restructured you are not spending that time over and over again with the data visualization tools.

That's about all I have to talk about thus far, I've been learning a lot about Power BI, the biggest headache thus far is I can't have multiple columns (as of 5/31/2017) in the legend. Which really bites. Also multi-key structures are not liked, we will talk about that later as well.





Comments (0)

Skip to main content