alert when a VM I/O limit is reached
When a VM hits its global VM-wide IOPS or bandwidth limit and throttling kicks in, provide a facility for creating an alert based on this situation.
Currently, the methods to detect I/O throttling rely on collecting guest OS statistics and comparing them to the limits specified by Azure. It would be a real benefit to determine at the hypervisor level when I/O throttling has taken place and record this in customer-visible metrics. These metrics are already internally visible to Microsoft support, so the data is collected-- just not presented.
At a minimum, the disastrous situation where whole-VM I/O throttling has taken place should be able to generate an alert.
A good resource for describing I/O throttling is here:
A good quote from that article:
In order to prevent the blocking issues and to improve
the performance, we need to:
Prevent VM level throttling at all cost.
I completely agree with this. Throttling often is not merely a slowdown, but instead causes VMs to hang, crash, or cease servicing requests. This creates significant availability issues that are hard to detect by other means since Azure heartbeats continue even while the business services are dead.
Beyond this basic alert on "your server may be dead" from global VM IO throttling, it would be great if we could also get individual disk alerts when throttling kicks in, or even (ideally) when some percentage of the maximum is reached for some defined period of time.