A couple of clarifications - including potential for much greater backup throughput beyond 3 GB/s up to 9 GB/s for this hardware:
1) Contrary to the video, 10 GbE is not the fastest practical Ethernet for server-to-server communications. 40 GbE network cards and switches are available and affordable. It is feasible to achieve 5GB/s using a PCIE Gen 2 40 GbE card as well as via a 4 port 10 GbE adapter configured suitably. Mellanox produces a Gen 2 PCIE card supporting 40 Gb/s Ethernet and 32 Gb/s (actual) QDR InfiniBand as well as a Gen 3 version with FDR InfiniBand throughput of 56 Gb/s (54.54 Gb/s actual) providing close to 7 GB/s bandwidth. See http://www.mellanox.com/content/pages.php?pg=infiniband_cards_overview&menu_section=41.
However, Just three x8 Duo2s which can be hosted in many servers as small as 2U generates up to 9 GB/s throughput saturating 56 Gb/s InfiniBand, the fastest network interface card having practical availability. PCIE SSD adoption means the network is now the bottleneck rather than storage I/O in the area of server-to-server data transfer.
2) Although backup compression is not used in the testing due to CPU limitations, database row/page compression is used achieving about 40% space reduction. This makes “real-world” effective backup throughput closer to 5 GB/s. Extrapolating that metric pushes logical throughput into the 15 GB/s range for the tested configuration if the current bottleneck (next item) is resolved.
3) CPUs are averaging under 50% utilization in my testing so that does not seem to be the bottleneck. I suspect a QPI (Quick Path Interconnect) configuration issue. See http://www.qdpma.com/systemarchitecture/systemarchitecture_qpi.html. The aggregate total bandwidth in a 2-way Nehalem architecture is 25.6 GB/s accommodated through 4 x 6.4 GB/s QPI links. The QPI links comprise:
a. One between the two I/O hubs (IOH) (each of which connects directly to half of the PCIE cards)
b. One each (total of two) between each IOH and a processor socket (i.e. NUMA node),
c. One between the sockets themselves.
As I understand, If data transfer is monopolizing the IOH-to-IOH or CPU-to-CPU links rather than affinitized between each processor and the processor’s directly-connected IOH this limits throughput to the individual QPI speed of 6.4 GB/s. If PCIE/NUMA configuration causing QPI link saturation is the bottleneck, then resolving should enable a SQL backup of a database stored on 6 Duos (1.5 GB/s read rate) to achieve 9 GB/s distributed among collectively-adequate destination targets using a pair of 40 GbE network cards. Using the SQL Backup to NULL operation can help verify. I realize there are other possible culprits – i.e. inefficient cache use, SQL Server specific NUMA configuration - Open to suggestions. The article http://www.micron.com/~/media/Documents/Products/White%20Paper/pcie_ssd_performance_hi.pdf provides recommendations for optimizing PCIE configuration for throughput.
Note: Due to an unexpected health crisis in our family, I need to re-prioritize and will be slow for next several days reviewing responses to the blog and associated video.