Add url-based probe health checks to Web Apps
When I spin up multiple instances of a Web App, and one of them goes unhealthy in the sense that it starts throwing HTTP 500 errors, I'd like the load balancer to take it out of the load, ideally by using load balancer probing.
This is achieved through WebRoles as detailed here: https://msdn.microsoft.com/library/azure/jj151530.aspx, but this functionality is not available in Web Apps. It would really help soften the blow of instance-based outages.
Use Proactive Auto Heal to set up rules (such as for high HTTP 500 errors) and actions (such as restarting the instance).
Thank you for suggesting Proactive Auto Heal.
I need to know why the instance has become unhealthy. Can I automatically save a memory dump before an unhealthy instance is restarted by Proactive Auto Heal?
Microsoft support engineers say that if there is no memory dump, they can't investigate the cause of the unhealthy condition.
Nick Muller commented
Thanks for suggesting the Auto Heal feature. Didn't know it existed.
1. Can this feature be documented in the docs? (https://docs.microsoft.com/en-us/azure/app-service/)
2. I'd like to receive an (email) alert when a mitigation rule is triggered. Is this currently possible?
When one instance of WebApps occur issue, if I can manually exclude it from load balance of WebApps, it's still good.
When one instance of WebApps occur issue, I'd like to avoid customers seeing error message repeatedly. If I can exclude this instance from load balance of WebApps temporarily, customers can connect other WebApps and continue their work.
Mike Kostuch commented
Make the probe submit as https if it's set to https, and http if its set to http in the web app.
Do you have ALWAYS on , in the web app in Application Settings? If so, Azure sends a test to ensure the app is still in memory every 5 mins. It sends it as an http request. If you put in a rewrite statement to https, this fixes the http 500 errors, if they are re-occurring every 5 mins evenly. Not sure why yours are occuring, but this fixed our metrics...when we saw them every 5 mins evenly.
I've got a few on-prem web apps that I would like to migrate to Azure App Service. It seems the Azure Load Balancer service can do this but why can't App Service? This is my greatest fear about switching. It almost seems that we'd get a faster solution to this by containerising and using Service Fabric Mesh to get the "serverless", once that service was GA'd. That is a massive workaround that may or may not work in practice here though.
Tom Wilson commented
This idea and https://feedback.azure.com/forums/169385-web-apps/suggestions/8894002-add-url-based-probe-health-checks-to-web-apps look pretty similar, should they be merged?
This idea and https://feedback.azure.com/forums/169385-web-apps/suggestions/9392763-application-instance-health-monitoring-and-recover look pretty similar, should they be merged?
This is absolutely crucial to have for a highly available production app
Josep Planells commented
It seems to me very difficult to achieve a trully high availability scenario with minimal downtimes with the current options. Is there any way to resolve in an automated way a single instance app crashed for any reason? We can not see how. LB does not help, since it continues delevering traffic to that crashed instance. Ahother option would be to detectect the issue with an instance in other monitoring system and restart it. But Auto Healing restarts all the instances, so we'll have downtime. So, no matter how many instances do you have, if one or any becomes irresponsive, there is no way to solve it.
Tore Lervik commented
Simple web-tests that the LB can use would fix a lot of cases, and it would make app services much more robust.
Yesterday a system went down because one out of 4 instances crashed. 3 Instances were working fine but because of how the LB operates this didn't matter much.
Right now the only way to recover is to restart the whole app, or find the the instance that is broken and kill the correct process on that host (which is a pain to get right)
Would be nice with an easier way to restart just a single instance: https://feedback.azure.com/forums/169385-web-apps/suggestions/32127793-add-ability-to-restart-a-specific-instance
Tuukka P commented
Any updates on this? We have a fairly long web role instance startup as we need to load a large set of read-only data from blob storage into server memory, and during this time the node instance is not able to serve requests. A custom health probe allows us to direct requests to instances that are ready. We cannot move from web role instances to web app instances unless we have a mechanism to deal with this situation, and since classic cloud services do not support ARM, we are left with legacy solutions for deploying these services.
James Reategui commented
You guys could get some ideas from Docker/Kubernetes as to how they handle Health Checks and killing pods in the set that go bad. If you guys can pull this off it would really take App Services to the next level. For us at least it would mean not having to work on migrating to Docker.
Having a defective load balancer like this means we can't keep using Web Apps in production.
We've encountered multi-hour long outages as well, including because an instance was bad because of Azure's side.
Advertising defective products as production-ready does real damage to businesses. How are other features considered higher up in the pipeline?
Thank you for your patience. We still have this planned though there is no timeline yet as there are a few items our engineering team needs to complete before this can be tackled. We will update here once there is more information to share.
+1. I have this need as well to have a webapp be able to specify a custom health check page.
Just curious if there's been any movement on this issue? We're trying to deploy a spring based web app to an app service and experiencing the same issues. When 1 instance crashes the whole web app seems to go down. Would like to talk with somebody from the Azure product team to bounce ideas around and gain an understanding of how java apps work under the hood in an app service.
This is killing me. My worker web app takes a significant amount of time to start up in its WCF service constructor (reading data from SLOW d:\home Azure file share that all web app instances share). It seems the load balancer is directing traffic to it even before the constructor has finished. This is causing terrible latency when this instance is first rotated in. I'm now regretting not having gone with WebRoles. Please help!