Batch AI

We’re here to make your deep learning and AI training more efficient, to help you run many experiments in parallel, at larger scale, and with lower overall costs. We’d love to hear your feedback on what you like, what’s not so good, and features or other improvements we should make. All of your feedback here and in other forums will be monitored and reviewed by the Batch AI engineering team. By making suggestions and voting here, you will also among the first to hear about new capabilities. This forum is just for feature suggestons. If you have technical questions, please see our documentation, MSDN, or StackOverflow..

  1. Better handling and visibility around 'Unusable' cluster nodes

    From time to time, my clusters have unusable nodes. They either become unusable during preparation (e.g. with InternalError running dpkg/apt-get commands), or while a job is running.

    Ideally we wouldn't encounter unusable nodes at all, but when we do it would be good if they were removed from the cluster, any jobs that were running on them resubmitted, and autoscaling taking care of reprovisioning nodes if required.

    As it is, it seems like the unusable nodes during preparation may get retried a couple of times but then no more, and during running the jobs just stay running and nodes unusable,…

    8 votes
    Sign in
    Sign in with: Microsoft
    Signed in as (Sign out)

    We’ll send you updates on this idea

    0 comments  ·  Autoscaling  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Feedback and Knowledge Base