Improve handling and prevention of unhealthy SQL gateway cluster
It seems that if a gateway, front-end, node in the SQL gateway cluster, that provide the directory lookup services and rerouting of incoming connections to the intended back-end probes, turns unhealthy it can result in "connection terminated unexpectedly" or "Communication link error" errors when a login to the database is attempted, this recently happened to a client of mine that is using Azure SQL Database / SQL Data Warehouse in the West Europe region. There are several nodes in one region in the gateway cluster, but it seems that an ARM deployment resulted in an ACL rule being removed in a resource required for internal communication between the gateways and the internal Directory Service, causing the login failures mentioned earlier when the gateway node and its in-memory cache Directory Service could not perform the lookup and reroute of the incoming connections to the back-end probes successfully. When lasting for an extended period of time, it can also lead to resource starvation, resulting in further impact. Thankfully the issue was resolved when engineers restarted the gateway processes, which restored the ACL permissions and thus Directory Service and the internal communication.
But perhaps the SQL gateway service can be made even more resilient, i.e. by improving the health model of the service by automatically stopping deployments that matches the failure pattern described above. And perhaps also further monitoring of deployment results could be implemented, i.e checking for persisting unexpected internal communication errors after deploy. Or maybe further tuning the load balancing between different gateway nodes. Actions like these would make the service and the resiliency even better, I think.