Live migrate the VMs during maintenance/hardware failures
Right now Azure VM customers experience very frequent restarts that impact the availability of services. As far as I know, these restarts are mostly caused by either host upgrades or hardware failures.
Developing a HA setup is pretty straightforward for some scenarios (web servers), complex (database back-ends, notably MySQL) or extremely difficult (back-end applications used by our accounting department, for instance, which were designed to be run on one machine).
Of course, HA setups are necessary for any major applications, but most of the setups we see have lots of single-instance roles (and for good reason most of the times).
Currently, if you don't have an HA setup, machines are restarted about once per month or even more frequent (some as often as weekly). If this restart occurs during business hours, then this means downtime.
Furthermore, for services like MySQL or SQL Server, this also means that the entire buffer pool is emptied and the performance of the application is degraded until the buffer pool fills up again.
What we would love to have is something like Google Cloud's Live Migration - the VM would be moved to another host in case of hardware failures or host upgrades without restarts or downtime.
Azure has recently announced a number of improvements to maintenance reboot notifications. A lot of maintenance work is done silently behind the scenes. For those that require a reboot, you’ll now have a “planned maintenance” window where you can pro-actively reboot the machine at a time of your choosing.
This "implementation" becomes even more problematic with Azure Stack. I cannot really figure out the reason though. All needed technology has been available in hyper-v for years.
Ziv Rafalovich commented
B. We have recently announced (currently still in preview) a new feature which enables applications to discover upcoming events from within the VM before they happen. With this new capability, you can improve the overall resiliency of your service (e.g. by proactively failover to another node) or reduce the overall impact even in the case of a single node by performing graceful shutdown. Many applications have some sort of support for graceful shutdown that will reduce the overall time to restore the application once the VM is recovered. Note that this feature support Pause (used for in-place VM migration), Reboot (planned maintenance) and Redeploy (used for scenarios such as healing) .
For more information: https://myignite.microsoft.com/videos/39458
Ziv Rafalovich commented
We have made several improvements in recent months in order to improve overall availability of VMs in Azure
A. In-place VM migration is a technology enabling us to update the Azure hosting environment without rebooting VMS. We can patch the underlying Host OS, various agents and even micro code (in some cases) while pausing VMs hosted on that node for less than 30 seconds. This technology enables us to rollout new features, improvements, bug fixes,as well as security patches with little to no impact on our customer. When pausing the VM for such a short time, memory, open files, and network connection remain open. Most platforms are resilient to such a pause and will not cancel/rollback any transactions or jobs. For more information : https://azure.microsoft.com/en-us/updates/azure-in-place-virtual-machine-migration-eliminates-virtual-machine-reboots-during-critical-security-updates-for-host-os/
Sander Knijn commented
"We expect to make improvements in planned maintenance over the next few months." quoted on 5th of Janaury 2015. It's now almost september 2016....
Please give an update about this one!
It would be nice to be able to rent IAAS with a high SLA.
Just use Live Migration and some good hardware+storage boxes.
Would love an update/feedback on this - even just confirmation it is still coming would go a long way.
Azure supports high availability swing migration from OUTSIDE the Azure IaaS infrastructure and for some reason can't handle a node-node transfer to migrate around scheduled downtime. On premise solutions can provide this quite simply with a N +1 nodes and a rolling upgrade (even between HA cluster versions).
Samir FARHAT (MVP) commented
MS will bring a 99.99 availability to a single instance on the future. It's part of a project where they enhance the maintenance process.
Lucian Daia commented
Uptime was a lot better in 2015, I'll give you that.
If it's too complex to give us live migrations, at least please add an SLA for single instance VMs. Right now we have to dodge answering the SLA question by a lot of customers who have a few services that can run on a single instance.
This is a real important point to address in making the Azure IaaS offer more competitive. Running the solution on On Premise hardware, we simply use N+1 hosts to overcome planned and unplanned downtime. In Azure we are forced to use Availability Sets which will double the costs of the production environment. Especially for SMB markets this is in my opinion a show stopper, and make the Azure offer almost two times more expensive than an On Premise or Private Cloud offer.
Corey Sanders, when can we expect some changes in this setup?
Darian Miller commented
IAAS with Azure has been a non-starter. Yes, you can have redundant servers, but not everything can be completed asynchronously. Envision taking a credit card payment, and after it gets submitted to the processor, and before you get the transaction result, your VM is dropped. The transaction is completed on the receiving end, but you have not received confirmation. If you have dozens/hundreds of transactions in process simultaneously throughout the day - how often is this IAAS problem going to hit you? The 'bonus' is that many of these types of transactions take a manual intervention to correct. Downtime is expected, but intentional/unannounced/frequent downtime certainly is not. I would not recommend anyone with any sort of volume to even bother with Azure IAAS. There is apparently some progress being made but it's unusable except for low volume use cases until this issue is fully addressed.
Vlad Kosarev commented
Any news on this? This has been a major pain point for us since the beginning.
John Meyer commented
Corey Sanders stated that there would be improvements to this process back in January, it is now August. I'm a big proponent of Azure and potentially going to recommend it as a solution to eventually migrate our entire infrastructure into Azure, however I cannot fully do so until we have the ability to have HA during maintenance windows that does not involve user configuration or intervention.
Can we please be provided an update as to the current status of the improvements to maintenance windows?
Mikael Lundh commented
Yes, we need this badly.
This capability needs implementing, we have 16 global sites connected to an application with a SQL backed end and 4 interconnected servers providing this service which also requires a boot sequence for it all to connect back properly, not having the ability to reboot the services and servers in sequence causes us severe problems.
It would be such a relieve if a live migration was possible during a routine azure backend maintenance window !!
Neil Palmer commented
Any updates on this? Amazon is way ahead when it comes to VM uptime, I'd love to recommend Azure VM's to clients, but I cant when I know they have software that doesnt do HA and Azure will randomly reboot.
Any updates? The 48 Hour Maintenance Window makes VDI solutions on Azure problematic.
VDI requires persistent connections to Single VMs. Availability Sets do nothing to alleviate the problem of VDI VMs being rebooted during business hours.
It is not only about planned maintenance. Recently our VMs seem to end up on bad hosts pretty often and I would love to have at command at my disposal that would be let me migrate affected VMs to a new host. At the moment I have to change their size but this is a workaround not a solution. Even better if Azure could do that for me.
Thomas Brown commented
An absolute must against the competition, we cannot use Azure because of this missing feature. HA is no option, if it nearly doubles licensing costs of third party software based on the number of front end servers. Should be the number priority and would also enable SLAs for single machines, which every cloud provider is able to deliver. Except Azure :-(
Lucian Daia commented
Even though there weren't any more significant outages for the past 4 months, reducing the impact of both planned maintenance and hardware failure is still a really important feature.
Any updates on what the Azure team will do to mitigate outages or frequent restarts in the future?