Add "Staleness" Metric for EventHub Triggered Functions
TL;DR: Add both offset and time based "staleness" metric for EventHub to Function triggers
If one is depending on the well ordered guarantee of EventHub, one must crash the process if a temporary error occurs in order to avoid the offset being moved forward (there should be a better mechanism - the retry policy seems to solve for this part of the problem). Crashing the process causes a new function to be run from the same offset repeatedly until the temporary error is resolved. Regardless of how the offset is made to not proceed, this temporarily avoids data loss. However, if the temporary error extends beyond the retention period then data loss will occur. Of course the archive generated from a capture can be used but that introduces a lot of unnecessary complexity and coordination. Avoiding that work safely requires developer attention focused on whatever the source of the error. A metric that determines the "staleness" of the data via either or both of time (lastEnqueuedTimeUtc - retrievalTime) and offset (lastEnqueuedOffset - offset) can be alerted on to support safer operation of such a system.
Please add metrics for how far behind by offset and perhaps by time between read and write heads to the Azure Functions trigger integration with EventHubs.
Thanks for the feedback! Additional metrics are something we are looking into. Keep the votes coming!