Starting with SQL 2012 you can easily trace Resource Waits using Extended Events. Unfortunately the duration for these waits is returned in milliseconds and not in microseconds. I see a problem here: What if the duration is below 0.5 ms it will be rounded down to 0. In an environment with a high frequency of tiny calls you might not see the bottleneck if you sum up the short duration of wait events.
For this reason I still use the following script to trace an entire system for its bottleneck. It shows the sum of resource_waits from the last 10 seconds.
Drop Table trace
select *,getdate() as 'timestamp',1 as T1 into trace from sys.dm_os_wait_stats
waitfor delay '00:00:10'
Insert into trace Select *,getdate(), 2 as T2 from sys.dm_os_wait_stats
Select t1.wait_type,t2.wait_time_ms-t1.wait_time_ms as 'OtherWait', t2.signal_wait_time_ms - t1.signal_wait_time_ms as 'CPUWait' from trace t1 join trace t2 on t1.T1 = t2.T1-1
and t1.wait_type = t2.wait_type
order by t2.wait_time_ms-t1.wait_time_ms desc
Btw. SOS_Scheduler_Wait shows no bottleneck. It tells you that the task steps back from the cpu when its quantum of time is expired to allow other tasks to use the cpu. That’s called preemptive scheduling. If you see this wait_type only it means you have a cpu intensive (lots of instructions) workload. In this case you can try to parallelize the workload by manually splitting the data you process into smaller sets and run your workload in parallel connections. (If you do ETL Tasks you should use Integration Services for that.) Or you can try to reduce the cost threshold for parallelism (but that’s a server wide sp_configure setting). Also make sure your statistics are up to date so that the optimizer can estimate the cost of a query correctly so that the engine can decide to parallelize your query.
(Singlethreaded Applications and the application architecture running many tiny queries instead of set based workload are the main reason for performance related problems nowadays. You can invest in expensive hardware (high frequency CPU Clock, large CPU Cache, infiniband network) or you can change your architecture to parallelize queries and run larger chunks of work.)
To reduce the number of instructions you can also use native compiled sql procedures and SQL2014 inmemory tables.