As part of my research work, I generate thousands of simulations on an almost daily basis for various scenarios. One of my scenarios involves determining strategies for selecting the most profitable short and long stock entry/exit positions. Calculating this involves data mining stock history information, analyzing moving averages of the equities, and correlating parameters associated with the equities to find the the optimal entry and exit position for a given stock. I've found one way to calculate this is using brute force computational power to simulate the trades over historic time periods and measure the results. (There probably is an easier way with advanced calculus using derivatives, etc.). In order to play out the simulation, I need to track the position of the equity associated with the order - whether there is an order pending, if the position has been opened, or if the position has been closed, or if the order was cancelled. For example, I need to measure the dollar value of positions as well as cash in the portfolio to tell how much "buying power" remains to purchase additional equities at a given point of time in the simulation.
I track dates associated with all of the events, so a "normalized" way to do this is simply to look at the date as shown in these views:
I can then index on the cancel-date, open-date, and close-date in combination with other fields to try to optimize the queries. But, why do this, when you only need a single status since each state is mutually exclusive? Based on that, we can create a computed field called OrderStatus (we'll come back to the other computed column "Profit" in a moment) that just has the status as in:
CREATE TABLE [Simulator].[PortfolioOrder]( [PortfolioOrderId] [int] IDENTITY(1,1) NOT NULL, [PortfolioId] [int] NOT NULL, [TradingSymbol] [varchar](50) NOT NULL, [OpenDate] [date] NULL, [CloseDate] [date] NULL, [CancelDate] [date] NULL, [TradingMethod] [char](1) NOT NULL,
[Shares] [smallmoney] NOT NULL,
[FilledPriceOpen] [smallmoney] NULL, [FilledPriceClose] [smallmoney] NULL, ... [Profit] AS ([shares]*
when [TradingMethod]='L' then [FilledPriceClose]-[FilledPriceOpen]
else [FilledPriceOpen]-[FilledPriceClose] end) PERSISTED,
[OrderStatus] AS (CONVERT([tinyint], case when [CancelDate] IS NOT NULL then (0) when [CloseDate] IS NULL AND [OpenDate] IS NULL then (1) when [CloseDate] IS NULL AND [OpenDate] IS NOT NULL then (2) else (3) end,(0))) PERSISTED,
So, how does the computed field help us with performance? First, I need to make sure the computed field is persisted and the caveat on that is that only deterministic values can be computed. Not only that, the computed field must be derived directly from columns in the table, rather than other functions that look at other tables. For example, I have a current value column on my portfolio that is computed based on the value of all of the underlying positions, but that column is not persist-able since it depends on a function that sums the results from the order detail data (I could use an indexed view for this to get around that, but there are quite a few rules and restrictions around indexed views - see my earlier blog on indexed views).
The second thing to do is to make sure that the storage for the computed field is optimal. I wouldn't normally use a floating value for example as a persisted field that gets indexed since floating values are approximate. Also, I want to make sure the computed field uses as little storage as possible, so that my index is as small as possible, thus potentially reducing the number of levels required for a large number of rows. That is why I use the CONVERT function for the order status - by default, SQL would set this to be an integer field which is 4 bytes, but all I really need is a single byte or TINYINT.
Note, that I also use the included columns technique to only include what is needed. I leverage another persisted computed field called profit rather than the all the source fields needed to calculate the profit so that a covering index will work for my queries, thus eliminating the need to do a lookup back to the base table (see the link at end of article for more info on covering indexes). This keeps the included data small, which also contributes to a smaller size index entry.
Using the profit field as an include field along with other relevant fields, and then using the order status gives us the following index, which is just right for the types of queries we need to run in order to perform a stock trading simulation. This allows us to be able to quickly calculate the profit (or loss) and hence current buying power for the portfolio as well as find orders in a given status (such as newly submitted orders with a limit price) that may need to still be opened into a position or open positions that need to be closed to seal a profit.
Index for finding profit of closed orders for portfolio using computed fields:
CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder] ( [PortfolioId] ASC, -- 4 bytes [OrderStatus] ASC -- 1 byte ) INCLUDE ( [Profit]) - 4 bytes
Total: 9 bytes
Contrast that with the below index structure that had additional included columns for (Shares, PriceAtClose, PriceAtOpen instead of just Profit) and included the cancel date to filter out invalid orders, or alternatively did not include any of the date fields, but then had to filter after retrieving by PortfolioId.
Index for finding profit of closed orders for portfolio without benefit of computed fields:
CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder] ( [PortfolioId] ASC, -- 4 bytes [CloseDate] ASC-- 4 bytes, probably not worth having this as part of index and just incorporate
-- into the include list and filter. ) INCLUDE ( [FilledPriceAtOpen], [FilledPriceAtClose], [TradingMethod], [Shares]) - 13 bytes
Total: 21 bytes
So, how effective is this technique? I"m sorry I don't have hard numbers before-and-after, but I can honestly tell you that my simulations for this application are running about twice as fast now. I believe the main reason is that the number of index levels was reduced, because I was at a threshold due to the number of rows in the Portfolio Orders. Up to a few hundred thousand rows, the performance was similar, however once we cross that number, we require a second index level due to the sheer length of the index when using the . Each additional level in the index requires another I/O to per row lookup.
This brings up another point about indexes and performance. If you experience an overnight-drop in performance on some queries as more data is added, it's probably because one of your indexes just had to add another level to support the binary tree. You'll want to understand how SQL stores indexes to get a better insight on this and other ways to mitigate the issue if merely shrinking the size of the supporting columns is not feasible. Take a look especially at filtered indexing (which we could have used for this scenario, also), where an index can be defined based on horizontal partitioning - i.e. only create this index for rows meeting a specific filter.
This points out the need to carefully consider the lengths of the columns you are indexing as well as the sizes of the fields you include in your include list (at some point it doesn't make sense if you're including every field, may as well forget the list and do a lookup). First, do you really need to use the whole column? If not, you might want to use a computed field to get a substring of what you really need to index. Second, are you using the right data type? If you're shamelessly using GUIDs for primary keys, shame on you, you are wasting 4 times the storage space of a single identity (and not only that creating fragmentation because your inserts are happening all over the tree causing page splits).
I'd like to go deeper into how to calculate number of bytes per row in an index, etc, but don't have time and there are other resources that provide a much better explanation than I can give. For more information, see the following links:
- Computed columns including their use in indexes: http://www.mssqltips.com/tip.asp?tip=1702 - Has great explanation and example, better than mine.
- Covering indexes: http://msdn.microsoft.com/en-us/library/ms190806.aspx
- Indexing tips: http://www.sql-server-performance.com/tips/optimizing_indexes_general_p1.aspx
- Why not to use GUIDs if you don't have to: http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
- SQL Server Storage Engine: http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/25/646865.aspx - Includes discussion of storage of included columns.
- SQL Server 2008 Internals: http://www.microsoft.com/learning/en/us/book.aspx?ID=12967&locale=en-us (Kalen Delaney, Paul S. Randal, Kimberly L. Tripp, Conor Cunningham, Adam Machanic, and Ben Nevarez)