Better handling and visibility around 'Unusable' cluster nodes
From time to time, my clusters have unusable nodes. They either become unusable during preparation (e.g. with InternalError running dpkg/apt-get commands), or while a job is running.
Ideally we wouldn't encounter unusable nodes at all, but when we do it would be good if they were removed from the cluster, any jobs that were running on them resubmitted, and autoscaling taking care of reprovisioning nodes if required.
As it is, it seems like the unusable nodes during preparation may get retried a couple of times but then no more, and during running the jobs just stay running and nodes unusable, until manual intervention.