MBIC Second Day [Other sessions]

(Português) - (English)

Yesterday I saw four more presentation besides the general session, they were:

- Mastering Master Data Management
- Microsoft is Making a Difference At Schumacher Group (Hitachi)
- SSIS Best Practices
- Populating a Kimbal method DW with SQL Server & SSIS 2008

Here are some takeaways that I can point to you...

I have been in some sessions related to MDM in the past two years and sincerely, they weren't very good or clear. Besides the fact that I have matured in the past years, I really enjoyed a presentation about this subject for the first time. The first presenter, a veteran in IT, showed clearly what MDM is supposed to be, how the organization should look at it and the many ways to think about it and how to get there.
The second presenter talked a little about SHAREPOINT MASTER DATA SERVICES! Yes, your heard correct, they are working in a way to make MDM easier and feasible, all integrated with Sharepoint. We couldn't see any public demonstration about it, but the aspects highlighted were really cool. This is one area that will evolve much in the next years, changing not only our OLAP systems, but OLTP too (this is a common misconception). Stay tuned and read about it.

The presentation about Schumacher group was interesting to see a real example on how BI can help to accelerate decision making and helping people's lives.

Kimbal method and SSIS session: this presentation was good for me to remember some of the Kimball concepts when I read two books written by him, almost four years ago. During the presentation, Joy Mundy showed us how do an effective ETL with SSIS, highlighting some caveats that we may find along the way and how to avoid it. A nice session filled with real world experience.

SSIS Best Practices - This was an excellent presentation with deep technical content. I actually took a bunch of notes that I'll post here, hope you enjoy.
NOTE: this is a list of unorganized topics, if you read, try to pick something that makes sense to you. :-)

Measure twice - cut once

Understand your system

  • CPU, Network, Memory, I/O and network
  • In memory pipeline
    • Do not want to the file system - touch in the extract and in the load
    • SQLIO / Severe optimized the IO to get the record / how spindles map to LUN
  • Measure the extract speed of the source
    • Better drivers, driver configuration, I/O and network utilization
    • Understand limits early in design
  • Speed per connection (use integration services log)
    • Rows / sec = row count / Time
  • Use multiple NICs (pain)
  • If possible, measure both at source and execution

Baseline the package

  • CPU (DTExec / SQLServr)
  • SSIS Pipeline \ buffers spooled (tell whenever a memory buffer has to be spooled to disk!)
    • Keep it at zero!
  • SSIS Temp Folder
  • Partition the data to never see 1 in the buffers spooled (12 million instead of 20 millions, for example)
  • SSIS assumes transformations fit in memory!
  • How CPU is distributed?

Tuning the data flow

  • Optimize Data Source
    • NOLOCK, minimize columns
  • Optimize lookup transformation (builds a hash table in memory)
    • Use the query to control the data returned (no Select *), NOLOCK
    • Shared Lookup Cache (build the hash table only once)
    • Large Lookup (other options)
  • Network tuning
    • Increase the packet size (4K) to 32767
    • Enable Jumbo Frames in the network (5000 to 9000)
    • 32 MB/S to 46 MB/S
    • Shared memory vs TCP/IP
    • Win 2008 - Network affinity
  • Data types
    • Narrow, avoid casting (conversions in the SQL statements), watch precision issues
    • The many benefits of money (from double to currency, it's faster)
    • Memory, CPU alignment issues (work with precision)
  • Optimize SQL Targets
    • OleDB destination (IS on the same box)
    • Heap insert is typically faster than cluster index
      • Use heap in staging
    • Use partitions and SWITCH
    • Commit size is 0 (if not possible, use highest possible)
  • Change the design
    • Don't sort in SSIS
    • Sometimes T-SQL is better (set based update instead oledb row by row)
      • Large Aggregations (group by/sum)
      • Use new MERGE statement
    • Delta Detection
      • Rule of thumb - Delta > 10% => reload
    • Do minimally logged operations if possible
      • Data flow in bulk mode?
      • TRUNCATION instead of delete
  • Avoid buffer copies
    • The width of a row determines the size of the buffer?
      • Smaller rows, more in memory
    • Memory copies are expensive
      • Pointer magic
      • Multicast - instead of doing a full copy, just put the changed value in the other table
        • Multicast followed by a row transformation (bad - will make a copy) / Change the order, good
  • Transforms and buffers
    • Row based (buffer reused) - sync
    • Partially blocking (merge, merge join, union all) (data copy to new buffers) - async
    • Blocking (data copied to new buffers) (aggregate, sort, pivot, unpivot -> do in the database)
      • Can be used with a small sampling (cool)

DESIGN FOR PARALLELISM

  • Tenets of scalable computing
    • Partition the problem
    • Eliminate the need for common resources (stateless design)
    • Schedule and distribute it correctly
  • Partition the problem
    • Hash partitions for modulo on identity(1,1)
    • Range partitions
  • Eliminate common resources
    • Don't land all connections on same NUMA Node
  • Schedule it correctly

(A ser traduzido... Ajuda?)

[]s
Luciano Caixeta Moreira
luciano.moreira@microsoft.com
===============================================
This post is provided "AS IS" and confers no right
===============================================