PreScan: SQL Schema Data Types

Continuing on from my previous post PreScan: SQL Database Schema Distribution this Scan looks at the datatypes used in the database you need to tune.

Perhaps one of the less useful of the pre-scans it is still handy. It highlights the use of depreciated datatypes, new feature usage & uncommon design. So it really doesn't matter how many INT's or VARCHAR's there are. You are looking to see if the less common datatypes are in use.

It is background info & will only hold your attention for 1-3 mins. But may alert you to something out of the ordinary. Remember to run it against the Model database first. This it the template for all User databases & holds the System Schema. So if model holds 2 sql_variant cols then sql_variant is only interesting if there is more than 2 in your database.

Some of the things I look for are :-

NTEXT, TEXT & IMAGE fields Eventually these will need to converted to NVARCHAR(MAX), VARCHAR(MAX) & VARBINARY(MAX) or they will block a future upgrade. Also they lets you use TSQL String functions & can remove some messy special code. Might be an opportunity to speed up searches by using Full Text, could also consider FileStream to improve performance & streamline operations.

User Defined Types I like these as they reduce the potential for error by ensuring data type consistency throughout the schema. Especially useful for key columns & commonly used attributes like Surname, Phone & Address fields. But you need to get them right as changing them is a nightmare. 

SQL_VARIANT This sets off an alarm bell as usually it means some attempt at Object Relational Design or a central Application Parameter table. Both can cause huge locking issues & overhead, so it is often worthwhile to examine closely. Note: The system schema in SQL2008 has 2, ignore these.

UNIQUEIDENTIFIER  I’ve never been a huge fan of this 16 byte datatype. But it is handy for storing GUIDS. I have seen people use these for primary keys, yet have no code to deal with the very small but real possibility of a duplicate primary key. Many use them to resolve the issues of replicated systems, I prefer a composite key with a site ID, but can see merit in the GUID approach. Others use it to hibernate Objects in their code into the database. Whatever the reason, examine it closely as often there are more efficient alternatives.

TIMESTAMP Often this is used to implement Optimistic Concurrency Control which is basically a way to free up record locks by rolling your own transaction. I feel this approach is inferior to Optimistic Concurrency Control by Value. Where instead of looking to see if the timestamp has changed since you read the record, you look to see if any of the values you want to change are different. This increased concurrency as it permits multiple people to simultaneously update different fields in the same row without collision. For that reason I dislike seeing timestamp cols. Note: TIMESTAMP is an 8 byte binary counter which increments each time the row is modified, it has nothing to do with the actual time.

FLOAT is rarely useful in business systems due to its lack of precision. Check they aren’t storing financial data in float types. For most applications Financial Transactions should be stored in Money (or SmallMoney) it holds up to 4 decimal digits ie 1/100 cent). However Financial Institutions, Interest Rate calculations or Overseas currency amounts should consider Decimal. FOREX rates are quoted in fractional PIPs (5 decimal digits) so don’t use the default storage for Decimal. Reduce the bytes to consume the precision you actually need.

XML Extensive use of XML is neither good nor bad. It is bad if you are trying to put your entire database into 1 XML field. But generally it is underused by DBA’s & can be a very powerful, efficient solution when used correctly.

GEOGRAPHY, GEOMETRY There are spatial aspects to this database. 

HIERARCHYID Could be good. Many schemas have recursive structures which often cause pain. If they have converted it to HierarchyID may be 1 less problem to solve. That said, HierarchyID’s aren’t the magical solution I’d hoped they would be. You do need to manage their updates carefully which can be a source of data corruption. 

BINARY & BIT Bitmasks & Flags suggest a more sophisticated algorithm most often it is a good thing. It is worth finding out what they are doing. Rarely will this be a bottleneck, but large binary fields can be a workaround for IMAGE, so FileStream might be relevant.

SMALLINT, TINYINT, SMALLMONEY, SMALLDATETIME tell me the DBA was concerned with efficient storage & actually looked at range of values contained in each field. Boosts your confidence in this design.

DATE, DATETIME2 or some of the new Date types introduced in SQL 2008. Suggest the schema was enhanced or designed recently. It is less likely to have the common bottlenecks caused by trimming Time from Dates. But also raises a flag if the data is being consumed by SSIS or SSAS. These apps will process the higher precision of DATETIME2 but most of their TIME functions do not recognise DATETIME2 as a valid input format. This can be a pain. Similarly all the ParallelPeriod style functions in SSAS & PowerPivot think Date is the lowest granularity. So you need to do your own thing if you want to analyse in Hourly, 5 min, Min or Second.

NVARCHAR / NCHAR vs VARCHAR / CHAR Neither is correct, but useful to see if they are used consistently. Joins, Filters & Comparisons between ASCII & Unicode types will cause implicit conversions. This may result in an an Index Scan rather than an Index Seek. It can be a pain dealing with Sorting & Collation sequences. It slows development as UI interface may require different datatypes to hold the double byte characters & often the programmers don’t even think about it.  It also may cause problems if your company expands into overseas markets & parts of your application can deal with different characters & other parts break. If your entire database is Varchar / Char then you gain from reduced storage, reduced IO & more data remaining in cache. But you lose language flexibility. Either way make a decision & be consistent. The exception here is tables hold state internal to your application. They could be char while the rest of the system is NChar as they will never hold data inputed by the end user.

 

Sample of TSQL for this type of Pre-Scan

  

/* =========================================

Pre-Scan: DataTypes of Columns

Notes: What to look for :-

Depreicated types: Image, Text

New types: XML, Spatial (Geometry/Geography), CLR, HeirarchyID, DateTime+

Challenging types: uniqueidentifier (GUID), sql_variant, binary (what for)

User-Defined Types: Are they extensive?

-- =======================================*/

 

;WITH cols AS (

SELECT T.name AS [DataType], C.*

FROM sys.types AS T

JOIN sys.columns AS C ON C.system_type_id = T.system_type_id

)

SELECT COUNT(*) AS Count, DataType

--, name,

--MIN(datatype)

FROM cols

GROUP BY DataType

-- HAVING COUNT(*) > 10

ORDER BY 1 DESC

 

Sample of Output

PreScan Datatype Examples

In the above example I’m comparing the distribution of datatypes in a few production databases.

The Model database is a set of tables included in every database you create. I’ve highlighted in Purple some of the datatypes of interest. In the other databases, if they are completely purple then they aren’t using that type in any of the user tables. Where I’ve highlighted they numbers then you need to subtract from model to calculate the real number.

Legend

Red: Depreciated features Text & Image.
The Orange bands draw your attention to the Float vs Money discussion.
The Dark Green is New feature usage.
Light Green: they considered the range of values the column will hold & didn’t blindly map everything to INT.
Blues: ASCII vs Unicode chars.
Yellow: Explore how these are being used.

Quick Impressions:

Debtrak: Mainly ASCII based app. Need to check the use of Float & financial info. What are they using Text & Image for? (Probably to hold Scanned Documents)

 Oh: No Money Types, how is float being used. Mostly NChar but not all, why are the Char types different. What are they using GUID’s for? (possibly as keys for Scanned Images). Lots of Image fields. No Money, Is this CRM or are they using Decimal instead?

Spinn: UID’s are huge, bet it is used for all PK & FK’s, bummer. What are the Bit fields for?  No Date / Time fields probably written pre-2008. More Char than INT fields, probably CRM or text based system, clearly UID’s could be considered INT’s for this comparison.

ReportServer: Database used by SSRS. I leave it to you to check out why they are using GUID’s, Binary & Image fields. You can bet there will be some redesign when they ship the next SQL release.

AssetTrax: Uses Spatial Fields, Unicode & not much else. Probably a central store for a Mobile GPS system, perhaps it is an optional module for a bigger system.

SRDDBA: No Unicode, mainly Numbers, Dates, Characters. The 17 Numeric types could hold financials but I’d suspect this to be non-financial app, perhaps reference data.

I hope you find this useful, Let me know if you want me to publish more scans.

Dave