Why did we only use the MOLAP storage mode in Project REAL? There is also ROLAP and HOLAP storage, but they weren’t used at all. Also, why doesn’t Project REAL illustrate pro-active caching?
The Analysis Services 2005 Performance Guide contains a more complete discussion of the storage modes in Analysis Services in Appendix B. Because the question comes up from time to time regarding Project REAL, this write-up gives a summary.
Every partition in AS has an associated storage mode. MOLAP partitions store aggregations and a copy of the source data (fact and dimension data) in a multidimensional structure on the Analysis server. This is the optimal storage mode for several reasons:
- Compression. When relational data is processed into AS, the storage needed is typically 20% to 40% of the size of the un-indexed relational fact table. Less I/O means that the MOLAP data is faster to access.
- Multidimensional structure. The format in which AS data is stored is optimized for the kind of access that OLAP users do – access to the data using arbitrary dimensions and arbitrary slices of the dimension.
- Data local to AS. There is no need for AS to call out to another service to obtain data when MOLAP storage is used.
- Indexes are not needed on relational fact tables. When AS processes fact data, it generally can do a full table scan, so indexes are not needed. When there are multiple partitions in a measure group, there will be one join to the dimension that is used for partitioning, but that’s all. Then at query time, no requests are sent to the relational database, so no indexes are needed for querying. Given that indexes can multiply the size of data in the relational system, the space savings can be significant.
HOLAP partitions store aggregations in a multidimensional structure on the Analysis server, but do not copy fact data from the original relational database. As a result, whenever Analysis Services needs to resolve a query against fact data stored in a HOLAP partition, Analysis Services must query the relational database rather than using a multidimensional structure stored on the Analysis server. Some administrators choose HOLAP because HOLAP appears to require less total storage space while yielding high performance queries. However, the disadvantages almost always outweigh the advantages:
- When a query “misses” the aggregations AS will have to query the relational database for fact-level data. This is necessarily slower than getting data from the compressed and natively multidimensional MOLAP storage.
- Because of the above, many administrators will create more aggregations to decrease the number of “misses.” At some point the additional aggregations can become larger than the fact data would have been. But even if it doesn’t, more aggregations take more time to process.
- Processing of HOLAP partitions is not significantly faster than processing of MOLAP partitions. In both cases the entire fact table must be read into AS in order to create aggregations. The only difference is whether AS also saves the fact data. However if more aggregations were created because of the previous point, processing will be slower.
ROLAP partitions store aggregations in the same relational database that stores the fact data. ROLAP partitions take longer to process, and because all queries must be redirected to the relational database, ROLAP partitions are slower to query against as well.
Given all this, it sounds like only MOLAP should ever be used! For most implementations that is the case. The exception is when near-real-time access is needed to data that changes throughout the day. Even there, pro-active caching is the first technique of choice. However, if even pro-active caching does not allow frequent enough updates, then the use of one ROLAP partition in the measure group is a way to handle the incoming data.
The recommended way to handle near-real-time scenarios is to take advantage of the fact that the storage mode is set on a per-partition basis. A vast majority of data in an OLAP cube does not change throughout the day. That data should be stored in MOLAP partitions. One partition in the measure group is designated to receive updates as they come in. That way, only that partition needs to be updated. This partition can be updated using proactive caching or it can be defined as a ROLAP partition with zero aggregations. In the latter case, all queries to this partition will be sent to the relational source, and current data will always be as current as that table.
For more details about near-real-time data access, see the section “Near Real-Time Data Refreshes” in the Analysis Services 2005 Performance Guide.
An example of how such a partitioning scheme would be updated might look like this:
- Fact tables contain weekly partitions in the RDBMS, and there are corresponding weekly partitions in AS.
- Each night, data for the previous day is processed into the correct weekly partition.
- As new data comes in during the day, it flows into a relational partition for current updates. AS has a near-real-time partition that maps to that relational partition. Either using pro-active caching or a ROLAP partition with zero aggs, users get access to new data throughout the day.
- At night, the collected daily data is moved to the weekly partition and processed (either by full process or using incremental update). The near-real-time partition is now ready to collect the next day’s updates.
Returning to the original question: Why did we only use the MOLAP storage mode in Project REAL? The simple answer is that MOLAP storage performs the best. The slightly extended answer is that Project REAL is based on the scenario in the Barnes & Noble data warehouse. The business model centered on daily updates, not near-real-time access. Therefore, there was no need for proactive caching or ROLAP partitions.
– Len Wyatt