Using Checksum to create random data sets

Here’s a quick tip if you don’t want to try to use random and figure out seed, etc.  If you’ve got a lot of columns in a table, it is pretty likely that the checksum for all those data values are going to be random across your primary keys.

I recently needed to create a 5% random sample of my equities, so just ran the following query:

select TradingSymbol
from dbo.EquitySymbol
where EquityType=0
and CHECKSUM(*) % 20 = 0

That got me 367 out of about 7300 possible rows.  If you’re doing random sampling, I suggest the following table structure.

image

Then, you can create a random sample by inserting a sample ID into Data Sample and then insert from your selection into the DataSampleValue table:

insert into olap.DataSample select 'Random Sample 5% for EquityType=0'
select @@IDENTITY

insert into olap.DataSampleEquity
select 18, TradingSymbol
from dbo.EquitySymbol
where EquityType=0
and CHECKSUM(*) % 20 = 0

select * from olap.DataSample

 

(No column name) DataValue
18 AAII
18 AAON
18 ABC
18 ADGF
18 ADLS
18 ADP
18 AEC-B
18 AFCB
18 AFFM
18 AIMC
18 AIR
18 AKT
18 ALLB
18 ALZM
18