I confess lack of depth in SQL Analysis Services (SSAS). I’ve read quite a bit about the capabilities including data mining, and played with a couple of the wizards, but just haven’t had time to immerse deeply into end-to-end scenarios. Based on that, I’m writing this post with a big disclaimer – You might be able to do some of what is in this post easier with Analysis Services – I don’t really know yet. You’ll have to forgive me for ignorance here.
Although SQL Analysis Services provides a lot of capability, it does seem that you will get better results if you "prep" the data before-hand in SQL, particularly the sources for the dimensions. Therefore this post may be useful even if you are highly leveraging SSAS. In my scenario, it would be useful to reduce the range of values associated with the variables, due to the sheer size of the data sets, the multiplicative effect with combining different variable permutations, and for simplicity of the analysis . Basically, this means to summarize ranges of values into different groups and then analyze correlations of the grouped values of the variables – hence the "NTILE" title of this article. Think of it as assigning grades rather than point scores, so you can identify the "A","B" etc students without getting immersed in the point distinctions. Once you have the students grouped by grade levels, then you can much easier do correlative analysis – i.e., how much time do "A" students spend on homework versus ""C" students, as opposed to analyzing the homework correlation with students with 93/100 do compared to 74/100 point-scaled graded students.
First a little bit about the application – My requirements include the ability to correlate various factors about an equity – it’s earnings per share (EPS), Price-earnings growth (PEG), Sector, period of performance etc. to indicate if any of these have to do with profitability of a stock purchased at a particular time and when it should be sold.
Hopefully, that provides enough background so some of this will make some sense.
So for my scenario, some what-if questions: Do certain sectors or industries perform better during certain periods in relation to other stocks? How does the EPS affect trading profitability? What about market capitalization? Do the combination of these have an impact? For example, is it more profitable to invest in stocks with small capitalization in some industries during certain times of the year?
Essentially, it would be nice to be able to take all of these factors, generate permutations, and then analyze simulations of stock trades for a period of time to see if any combinations are effective to predict how successful a strategy will perform. Now, if we were to just take all of the combinations of these, the multiplicative effect of the permutations would be unmanageable and more than the "fact" data since each the precision on many of these values is large enough to generate hundreds if not thousands of distinct values. "Permutating" all of this together produces a lot of interesting but meaningless data.
The NTILE feature allows assigning a group based on an interval to a set of data. One of the variables I want to NTILE is Earnings-Per-Share EPS). Another is market capitalization. Another could be Sector. And finally, the period. By NTILE, I mean grouping into categories, so for example, every stock would fall into just 1 of 5 different tiles or ratings for EPS, or for market cap, etc. You can see how by reducing these into groupings, we greatly simplify the analysis and reduce the number of permutations while still having enough granularity to differentiate the groupings. We just have to figure out a way to map the original "fact" data values to their groupings (grades). Once we have the mapping, we can join the tiled dimensional data back to the fact data. Yes, I know this is sounding more like a SSAS scenario, but remember I already gave my disclaimer… I’d welcome responses back from the BI experts on how to approach this scenario using SSAS.
So, here are some steps to accomplishing this:
1) Use computed fields to round the fact data into more discrete values, particularly if the original values are not precise, such as floating point. For example, if we are only need 5 tiles, we probably don’t need to map down to the .01 decimal precision. If we instead round down to just the .1 value, our tiling should still be fairly accurate. This reduces the amount of discrete values in the fact data that need to be tiled and reduces the size of the mapping table. Below is an example of an Equity Information table with the computed rounded fields
CREATE TABLE [dbo].[EquityInfo]( [TradingSymbol] [varchar](25) NOT NULL, [Description] [varchar](50) NULL, [IndustryId] [smallint] NULL, [SectorId] [smallint] NULL, [Shares] [bigint] NULL, [PERatio] [float] NULL, [EPS] [float] NULL, [DivYield] [float] NULL, [DivPerShare] [float] NULL, [PEG] [float] NULL, [PriceToSalesRatio] [float] NULL, [PriceToBookRatio] [float] NULL, [ExchangeName] [varchar](10) NULL, [EnabledFlag] [bit] NULL, [DateUpdated] [date] NULL, [DateCreated] [date] NULL, [SampleName] [varchar](50) NULL, [TotalBookValue] AS (case when [PriceToBookRatio]>(0) AND [Shares]>(0)
then [Shares]/[PriceToBookRatio] end), [TotalSalesValue] AS (case when [PriceToSalesRatio]>(0) AND [Shares]>(0)
then [Shares]/[PriceToSalesRatio] end), [PEG_Rounded] AS (CONVERT([smallmoney],ROUND([PEG],1)))
PERSISTED, [EPS_Rounded] AS (CONVERT([smallmoney],ROUND([EPS],1)))
PERSISTED, [PE_Rounded] AS (CONVERT([smallmoney],ROUND([PERatio],1)))
PERSISTED, [PTB_Rounded] AS (CONVERT([smallmoney],ROUND([PriceToBookRatio],1)))
PERSISTED, [PTS_Rounded] AS (CONVERT([smallmoney],ROUND([PriceToSalesRatio],1)))
PERSISTED, [Shortable] [bit] NULL, [HistoryReloadRequired] [bit] NULL, [HistoryReloadedDateTime] [datetime] NULL, CONSTRAINT [PK_EquityInfo] PRIMARY KEY CLUSTERED ( [TradingSymbol] ASC ))
2) Create a view that tiles the values for each of the dimensions of interest. For example, for EPS, we would have the following view:
CREATE VIEW [Olap].[Vdim_EPS_Tiled] AS SELECT NTILE(5) OVER (Order BY EPS_Rounded) as EPS_Tile, EPS_Rounded FROM dbo.EquityInfo WHERE EPS_Rounded Is NOT NULL
3) Generate a table from the view to contain cross-reference the tiles back to the source fact data. The table can be generated through a SELECT INTO and then maintained via a MERGE after we add a primary key. We need to use Select DISTINCT because our first view will contain duplicate EPS_Rounded values.
select distinct * into olap.tdim_EPS_Tiled from Olap.vdim_EPS_Tiled;
ALTER TABLE Olap.tdim_EPS_Tiled ADD CONSTRAINT PK_tdim_EPS_Tiled PRIMARY KEY CLUSTERED (EPS_Rounded)
CREATE PROCEDURE olap.Update_tdim_EPS_Tiled AS BEGIN MERGE INTO olap.tdim_EPS_Tiled AS T USING olap.vdim_eps_Tiled AS S ON S.EPS_Rounded = T.EPS_Rounded WHEN MATCHED THEN UPDATE SET EPS_Tile = S.EPS_Tile WHEN NOT MATCHED BY TARGET THEN INSERT (EPS_Tile, EPS_Rounded) VALUES (S.EPS_Tile, S.EPS_Rounded) WHEN NOT MATCHED BY SOURCE THEN DELETE; -- Might not want to do this if there is history involved END
Now, whenever I want to analyze performance of an equity related to a simulation, I can easily join to the tdim_EPS_Tiled table on the EPS_Rounded in order to derive the tile. I can then group these, average the profitability from the simulations and evaluate the degree of correlation. This should work well with a cube in SSAS as well to allow correlation and predictive data mining. I’ll be trying that out as soon as I get through the SSAS tutorial.
The below diagram illustrates the process flow including the automation aspect.
First we reduce the number of values to map, so that instead of mapping the entire universe, we map a rounded representation. We also store it as a precise data type (not floating). This makes our cross-reference tables smaller. This introduces some complexity when we want to map from the fact data back, but this is simply remedied by creating computed fields on the fact data that map back to the source.
Next, we create "Ntile" views of the grouped data. We can now join back from our source data to find the grade that a particular equity is associated with. This allows us to do iterative simulation. By iterative simulation, I mean taking the results of one pass of simulations as input into a higher layer of simulations. For example, if my simulations show a positive correlation for a particular EPS grade, then I can create another simulation that filters just on the EPS grade to drill-down and better explore the correlations from that. good article about iterative and non-iterative simulation algorithms on this link is http://www.stat.columbia.edu/~gelman/research/published/itsim_interface.pdf
You might be wondering why not just use indexed views – I tried that and couldn’t get it to work, there are limitations on what you can index when it comes to aggregation and apparently using a function like NTILE is one of them.
Let’s all the time I have for this today. You might be thinking this is neat, but what about all my "what-if" questions? How do we query our dimension data and link back to our fact to find the correlations. I have made some interesting correlation discoveries, but I still have a lot of work to do here. I guess this is where I really need to start using SSAS because the queries to actually find the correlations become very complex — Time to get out the manual and go through the tutorials… Time to start playing with the data mining wizards…