Indexes and SSDs and Random versus Sequential Access

If you’re one of the dozen or so people who follow my blog, you’re aware that I’m using SSDs for my doctoral research project involving simulations to find solutions to complex data correlation problems.  I finally have started to get down to the level of really looking at various query plans.  What I’ve found is that SQL often makes the decision to use a sequential scan plan rather than an index seek as the selectivity for a table decreases.

Decreasing the level of selectivity makes a lot of sense for traditional storage devices due to the latency costs associated with index seeks.  However, SSDs are a game changer as the access speed go down to near-zilch.  Recently, I have uploaded several months of intraday rows all the way back to January, 2009.  In my end-of-day history table, I recently added a field which I compute from the intraday data in order to speed up correlation processing for multiple queries that need to compare the intraday trading patterns.  The field will be maintained as part of a daily job that does a bunch of aggregations (you can find some earlier posts on the subject of why I sometimes chose to do this type of stuff for my database, rather than relying on queries or using indexed views).

In any case, There are 21,462,159 15 minute intraday rows in my database since January 4, 2010.  Along with this there are 2,086,605 daily history rows since January, 2010.  There are typically about 30 intraday 15 minute bars per equity per day.  Not all of the daily history rows have associated intraday rows.

My first attempt at updating the Intraday data was as follows:

update dbo.EquityHistory
set DayIntervalCount = (select COUNT(*) from dbo.EquityIntraday ei
where ei.MarketDate = EquityHistory.MarketDate
and ei.PeriodLength=15
and ei.TradingSymbol = dbo.EquityHistory.TradingSymbol)
from dbo.equityhistory
where EquityHistory.MarketDate > '20100101'

After 15 minutes, and not very impressive throughput on the SSDs, I cancelled the query.  The sub-plan that was used to feed the rows into the update is shown in the below graphic:

image

Notice, the Index Scan rather than seek on the Intraday rows, despite the fact that there is an index on Intraday containing the symbol, period length, and marketdate – all the fields involved in the search predicate.  However, the 20 million rows represent about 25 % of the total number of rows in the Intraday table – enough for SQL to say forget the index and decide to read the table sequentially.

My intuition was that if the index was used, this should happen quickly, to try this out, I experimented with lowering the granularity of the update such that a smaller percentage of rows were selected from Intraday.  Based on this, I found the maximum range I could use in order to get a scan on Intraday was about 6 trading days.  So, changing the query to:

update dbo.EquityHistory
set DayIntervalCount = (select COUNT(*) from dbo.EquityIntraday ei
where ei.MarketDate = EquityHistory.MarketDate
and ei.PeriodLength=15
and ei.TradingSymbol = dbo.EquityHistory.TradingSymbol)
from dbo.equityhistory
where EquityHistory.MarketDate between '20100101'  and ‘20100110’'

gave me the below query sub-plan.

image

 

Executing this update for the 6 days only took 5 seconds.  So, doing the math, there are about 144 trading days since January 1.  Divide 6 into 144 gives us 24.  24 times 5 = 120 seconds.  Based on that, in theory at least ,the update should only take around 2 minutes using this plan.

At this stage, I could have used the sp_create_plan_guide something like below, but took the quick and dirty way and just put the hints in the query.

EXEC sp_create_plan_guide
    @name = N'Update_History_IntradayCounts_FromIntraday',
    @stmt = N'update dbo.EquityHistory
set DayIntervalCount = (select COUNT(*) from dbo.EquityIntraday ei
where ei.MarketDate = EquityHistory.MarketDate
and ei.PeriodLength=15
and ei.TradingSymbol = dbo.EquityHistory.TradingSymbol)
from dbo.equityhistory
where EquityHistory.MarketDate > ''20100101'' ',
    @type = N'SQL',
    @module_or_batch = NULL,
    @params = NULL,
    @hints = N'OPTION (TABLE HINT ( ei, INDEX( IX_EquityIntraday_DateSymbolPeriodLengthInterval ), FORCESEEK ) )';
GO

So, what was the result of using the hint to force a seek on the index instead of a scan for retrieving the 20 million rows to update 2 million aggregation rows?

image

Update time comparison on SSD Storage

Using query hint: 2:14
Sequential Access: 10:26*
Benefit Ratio: 5 to 1

*First time, the sequential took over 15 minutes before I cancelled it, second time tested, it finished much faster, probably due to data caching.

As you can see, changing the query plan was effective.  So, you might ask: Would changing the query plan have the same benefit for a database not on SSD – is this just a SSD-issue?  I just happen to have a copy of the database on non-SSD storage.  So, here are the update times using both methods on that database:

Update time on non-SSD storage

Using query hint: 4:50
Sequential Access:  17:29
Benefit Ratio: 4 to 1

So, we see that the query hint on non-SSD was also more effective, but, it was more effective on SSD (5 to 1 benefit versus 4 to 1 benefit)

There are a couple of implications for this related to SSDs:

1) For large operations that end up using sequential access, you may need to revisit the query plan and tweak them to be seek-based to get the best performance.

2) SQL Server is a cost-based optimizer, not results-oriented.  SQL doesn’t factor in what type of storage your rows are located on, it will use the same cost calculation techniques whether the data is on SSD, on a RAM device, on a high-speed 15k SAS drive, or on a 5400 RPM laptop drive.  It also does not factor in database caching – i.e. with a large amount of memory, the data may already be in memory, making random access almost as fast as sequential access.  This type of issue makes the argument for a metrics-based optimizer for SQL – one that actually looks as the results of using different query plans and factors this in when determining a plan, rather than simply the estimates of performance.

 

Technorati Tags: SSD,SQL Server Performance,SQL Server 2008,SQL Server Optimization,Query Hints