Configurable back-end health check aggressiveness
Behind my frontdoor are two "back-ends", each consists of a single web app.
For each back-end I have configured a health check with interval of 120 seconds. My expectation was that this leads to roughly 30 requests per hour.
In reality, my application insights shows 64000 requests in the past 24 hours, that's more than 40 requests per minute! A live traffic log confirms this: I see health check requests come in almost every second...
With the current behavior there is hardly any correlation with the configured "Interval" setting.
It would be great if there was an option to tune down the health check aggressiveness for simple web applications.
You can now set the health probe method to HEAD instead of GET and also disable probes for backend pools with a single backend
This doesn't appear quite done yet - can't actually disable the health probe in certain scenarios that the docs say should be possible:
Matt Pearse commented
It would be good if you could disable them entirely as well. There is a disable button on the portal... but the button is disabled haha.
I've set mine to as slow as possible (255 seconds), which still results in about 40-50 hits a minute...
I only have two back-ends, and one is disabled on purpose as we want to control fail over manually. We implemented front door for the WAF, SSL termination and the ability to route if there is a sustained regional outage. There is no point to the health check in our scenario apart from costing us traffic. We have separate availability tests from application insights, which are much more configurable.
I completely agree. This is insane. My weblogs are flooded, Application Insights is unusable, and my CPU and Network bandwidth are getting completely absorbed by 90 FrontDoor nodes hitting my servers 200x a minute. Completely unmanageable. You should at least be able to have a small handful of "probes" share relevant information with your 90 FD nodes. This ALMOST makes FrontDoor unusable.
Sandeep Kumar commented
The probes are so aggressive because of the backends being hit by N POP servers located through out the world (as described in the linked thread in the post). Why do all servers probe? So that they can each determine the latency of backends. If only 1 of them hit, it would be hard to route requests efficiently from all parts of the world.
I would recommend splitting the probes into two categories:
Type 1: Probes whose purpose is to check if the backend is up and running.
- These probes can be more frequent, let's say every 5 seconds, but only 1 (or very few) POP server should do this probe.
Type 2. Probes to help calculate latency.
- These probes should be done by all POP servers located through out the world, and we should be able to set it to much longer time interval than the current maximum of 255 seconds. The documentation should also explicitly mention that all N pop servers (not just 1) will hit the backend every X seconds.
- If latency based routing is not enabled, then this probe shouldn't be done.
Same with me
Paul LeBlond commented
Additionally, I've got a web API built using Azure Functions. I cannot use Front Door Service with my API because it keeps the functions running 24/7.
Just being able to configure health check time to 1/day for my functions would be fine. Even better, configure health check rate based on time of day (99% of our usage is 8AM-5PM). Even better than that, if Front Door Service looked at usage over time and dynamically adjusted health check rates accordingly. Put some of that machine learning to work!
Sebastian Groeneveld commented
What strikes me most is that with a configured health check interval of "120 seconds", your application will get health check requests every 1-2 seconds....
For my simple web app, the logging of the health checks has by far become the most expensive component...
I wouldn't consider entirely disabling AI for health checks a solution, hence I posted this request:
I have deleted my previous comment and removed my 3 votes because, as it turns out, there is a perfectly acceptable way of dealing with this issue. For people having the same problems here are the basics:
Application Insights allows you to mark requests from certain user agents as synthetic. In the full framework version of the SDK this is done in the ApplicationInsights.config file (look for SyntheticUserAgentTelemetryInitializer, or for more information about this file: https://docs.microsoft.com/en-us/azure/azure-monitor/app/configuration-with-applicationinsights-config). In the .NET Core version of the framework you might need to write a simple custom TelemetryInitializer yourself that inspects the request's user agent and marks the telemetry accordingly.
When everything that is synthetic is marked as such, you can add a simple custom ITelemetryProcessor implementation to filter out all synthetic requests. For an example, see: https://docs.microsoft.com/en-us/azure/azure-monitor/app/api-filtering-sampling
Aleksander Pawlak commented
I've enabled Azure Front Doors for a test/dev scenario to a web app that has application insights enabled. It started draining heavily as all frontdoors hearbeats are logged and are causing AppInsights log analytics to be pricey. There should be a way to disable that?