Custom job/node statistic/metrics
There's a vote for per-node resource statistics for general metrics that are always relevant - CPU, RAM usage, etc.
The ability to push custom statistics, probably through some kind of node SDK, would be great too. For example, if you're doing database work, you might want the rows p/s or if it was render work, pixels p/s or ETA for that section.
Ideally, these metrics would also be available for the entire job. In the first example above, the entire job would show the sum of all rows p/s to get the total rows p/s. For ETA, it would be the largest ETA (as the jobs are running in parallel).
Adminbatchuv (Other, Microsoft Azure) commented
Leveraging Application Insights may be an option: https://docs.microsoft.com/azure/batch/monitor-application-insights
Peter Clotworthy commented
Regarding custom solutions:
I've spent a good deal of time looking into ways to do this myself, but they've always come off as clunky, and none of them will appear in the portal, which, if nothing else, provides an easy way to see what's going on in a pool.
I currently have each node write progress to a log, but the log grows pretty big. If said log gets larger than 1MB, you can't view it without downloading it. Even if you do this programmatically, every time the log is written to, the request takes longer and longer.
Some potential ways to do this yourself:
- Provision a VM outside of the pool (i.e. a "normal", non-batch VM) and have the nodes connect to it using a Virtual Network
- Use the multi-task mode to have the master node monitor metrics and schedule tasks (great in networking terms, because you get the environmental variables for the master IP address, but not so great in complexity because you probably have to use MPI and that doesn't make sense if each node only does one thing over its life cycle)
- Add a fake metric task to your application (dirty but simple) that makes that node the "sink". This is by far the easiest in theory, but then each node has to work out which node actually holds the metrics. E.g. by a discovery protocol or an IP written to a file in the shared folder.