Updated: 15 September 1015
This is designed as a high-level overview of the data the comes into SCOM and how to "tune" it. By tuning, I mean adjust the volume and tailor it to only show data that's relevant and valuable to YOU and YOUR ORGANIZATION.
So where does agent data go exactly....
When rules/monitors are run on an agent, any data returned first goes to the management server (MS) and is stored in its cache. The management server processes the data in the cache and writes it out to the databases (Operations DB or OpsDB and Data Warehouse or DW). The data is "tagged" letting the MS know where to write it. Data in the OpsDB is "short term" (7 days default) and for displaying in the console. Data in the DW is for long-term reporting, dashboards,, etc. and is stored 400 days by default. Tuning can impact what data is written to one or both DBs depending on how the rule/monitor is designed.
Where to Start....
Start by going thru the core data types and seeing what is gathered. Think about what data is valuable to you, not just "nice to have."
This is what is most visible to the users. Every alert in the console is stored in both the OpsDB and DW. My guideline is to only so actionable, relevant alerts in the console. If you aren't monitoring SQL, don't alert on it. If you don't care that your disk is fragmented, don't alert on it. Alerts should equal problems, not just an annoyance to be closed. Most alerts can be easily tuned by an override, but be aware of "general" alerts (script failed to run) that actually provide info that another rule/monitor failed. I recommend creating a custom alert view in My Workspace that shows open and closed alerts. Group by alert name and see what the top "offenders" are...evaluate if those alerts are adding value or just noise and then tune accordingly.
See Nathan Gau's post on Alert Management (http://blogs.technet.com/b/nathangau/archive/2016/02/04/the-anatomy-of-a-good-scom-alert-management-process-part-1-why-is-alert-management-necessary.aspx). SCOM is only useful if you view well-tuned alerts!
This is the information about state change (red/yellow/green). Generally this data is legit unless you've having issues with monitors flip flopping. I suggest you read Kevin Holman's blog on this - http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx
These are events scraped from Event Logs, generally by rules, and stored in SCOM. The easiest way to see what is coming in it to create a custom view in My Workspace. This data is separate from alerts! Look at the data and if it's not relevant (or is simply and informational event) override it to turn it off. One warning, some rules gather multiple event IDs. Check for this to be sure you are comfortable turning off event gathering for ALL those events before overriding the event rule.
This is the tricky one since it's the hardest to "see." In the console, performance data is visible in the performance view (graphs). This data set it sneaky since it's easy to gather too much data. For example, logical disk statistics. If you pull in data every 5min from every logical disk, just think about how much data that is! Do you really need that? Will you ever report on logical disks at that level of detail? If not, keep the rule/monitor but override it to a more realistic interval. I generally like every 6-12hr, but you'll have to make the decision based on your environment. Just dropping it to every hour will reduce the data gathered by 92%!
Viewing this data is best done directly thru the OpsDB via queries. If you run this query against your OpsDB, it will give you the performance counters that are generating the most data.
select top 20 pcv.ObjectName, pcv.CounterName, r.RuleName, count (pcv.countername) as Total
from performancedataallview as pdv, performancecounterview as pcv, rules as r
where (pdv.performancesourceinternalid = pcv.performancesourceinternalid) and (r.RuleId = pcv.ruleid)
group by pcv.objectname, pcv.countername, r.rulename
order by count (pcv.countername) desc
The data sources I see most often are below:
|SQLSERVER : Database||DB Avg. Disk ms/Read||Microsoft.SQLServer.2008.Database.DiskReadLatency.Collection|
|SQLSERVER : Database||DB Avg. Disk ms/Write||Microsoft.SQLServer.2008.Database.DiskWriteLatency.Collection|
|Health Service||agent processor utilization||Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeCollection|
|SQLSERVER : Database||DB Avg. Disk ms/Read||Microsoft.SQLServer.2012.Database.DiskReadLatency.Collection|
|SQLSERVER : Database||DB Avg. Disk ms/Write||Microsoft.SQLServer.2012.Database.DiskWriteLatency.Collection|
|SQLSERVER : Database : File Group : DB File||DB File Allocated Space Left (%)||Microsoft.SQLServer.2008.DBFile.FileAllocatedSpaceLeftPercent.Collection|
|SQLSERVER : Database : File Group : DB File||DB File Allocated Space Left (MB)||Microsoft.SQLServer.2008.DBFile.FileAllocatedSpaceLeftMB.Collection|
|SQLSERVER : Database : File Group : DB File||DB File Free Space (%)||Microsoft.SQLServer.2008.DBFile.SpaceFreePercent.Collection|
|SQLSERVER : Database : File Group : DB File||DB File Free Space (MB)||Microsoft.SQLServer.2008.DBFile.SpaceFreeMegabytes.Collection|
|Network Interface||Bytes Total/sec||Microsoft.Windows.Server.2008.NetworkAdapter.BytesTotalPerSec.Collection|
|SQLSERVER : Database : File Group||DB File Group Allocated Space Left (%)||Microsoft.SQLServer.2008.DBFileGroup.FileGroupAllocatedSpaceLeftPercent.Collection|
|SQLSERVER : Database : File Group||DB File Group Allocated Space Left (MB)||Microsoft.SQLServer.2008.DBFileGroup.FileGroupAllocatedSpaceLeftMB.Collection|
|SQLSERVER : Database : File Group||DB File Group Free Space (%)||Microsoft.SQLServer.2008.DBFileGroup.SpaceFreePercent.Collection|
|SQLSERVER : Database : File Group||DB File Group Free Space (MB)||Microsoft.SQLServer.2008.DBFileGroup.SpaceFreeMegabytes.Collection|
|SQLSERVER : Database||DB Active Connections||Microsoft.SQLServer.2008.Database.ActiveConnections.Collection|
|SQLSERVER : Database||DB Active Sessions||Microsoft.SQLServer.2008.Database.ActiveSessions.Collection|
|SQLSERVER : Database||DB Active Requests||Microsoft.SQLServer.2008.Database.ActiveRequests.Collection|
|SQLSERVER : Database||DB Allocated Size (MB)||Microsoft.SQLServer.2008.Database.DBSize.Collection|
|SQLSERVER : Database||DB Total Free Space (%)||Microsoft.SQLServer.2008.Database.DBSpaceFreePercent.Collection|
So now what?
Once you've done the first pass at tuning, let SCOM rest for a week or so and then repeat the process. Tuning is an incremental, repetitive process. Even the best maintained environments should do a quick evaluation on a regular basis....
*One note: If you've engaged Microsoft (or a consultant) to help tune SCOM, don't just think a few days of solid work will be the end of this item. Since the process is iterative, its better to have multiple short sessions spaced a few days apart. Tune, wait, repeat! I often do ~2-3hr weekly calls for 3-4 weeks with customers. This isn't something you can "blitz" through...so set your expectations accordingly.