Unexpected restart of Azure VMs due to Azure Healthy Monitoring system
Recently in our deployment in SE Asia (Singapore) DC, we had an unexpected restart of one of our VMs which is a Production Application Server. The Database server was fine. In 20 minutes it was back, but that 20 minute happened to be peak hours for our customers and we received quite a few brickbats on the reliability. This is 3rd such instance and it was a very tough scenario with customers who are hosted on that server.
We don't have Availability Sets configured. We also understand, Azure's SLA of 99.95% can be realized only when we configure Availability Sets. The RCA seem to be a Azure Healthy Diagnostic tool which applied some kind of correction to unhealthy nodes which caused a restart to the VMs in that node.
If there is some diagnosis done and it finds some unhealthy nodes, instead of doing an automatic healing, we could have been informed and we could have possibly taken actions by informing our customers on possible downtime. This has taken us totally by surprise.
In case we had set up load balanced VMs, we might have been saved. But then, for database, we don't have SQL Enterprise with Always-on. For our budget, we have chosen Standard and trying to set up Mirroring to achieve something close to HA by means of manual switching. But if this kind of unscheduled restarts happen, we don't even now how to check and make the manual switching to secondary server.
The reliability is becoming a big question mark. Every customer / user in Azure cannot afford SQL Enterprise; you have to consider the large segment of SQL Standard edition adoption and build DC accordingly
