How can we improve Azure Log Analytics ?

6 hours SLA on indexing custom log data is a very long time to alert on

According to this article https://azure.microsoft.com/en-us/support/legal/sla/log-analytics/v1_1/ SLA on indexing log data might take up to 6 hours. OMS has built in alerting that allows you to trigger actions within 5 minutes of data arrival. But if indexing takes more than 5 minutes - then what's the point of creating alert that might trigger on something that is no longer a problem, or not trigger at all if there is real problem. What is the average data indexing time? Log Analytics would be much more useful and have many more applications in real world if that indexing time is much lower. 6 hours worst case seems like a joke for any real time responding to problems system.

(Usually I see 5 seconds delay on indexing of custom logs. But couple days ago it spiked to 20 minutes for couple hours. As the result Log Analytics, which otherwise is very nice, might be not used at all, especially after reading that it actually might take 6 hours)

364 votes
Vote
Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)
You have left! (?) (thinking…)
Serg Salo shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →
started  ·  AdminAzure Log Analytics (Admin, Microsoft Azure) responded  · 

We have recently published an article – https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-data-ingestion-time that details various aspects of data ingestion time for Log Analytics, and clarifies distinction between the financially-backed SLA and our Service-Level Objectives. In fact, the typical latency to ingest data into Log Analytics is between 3 and 10 minutes, with 95% of data ingested in less than 7 minutes.

We are also actively working to bring this latency down even further, and many customers already report that they experienced a significant improvement, but more is coming.

16 comments

Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)
Submitting...
An error occurred while saving the comment
  • Trey Morgan commented  ·   ·  Flag as inappropriate

    Thank you for the article! It's very informative and answered a lot of my questions about ingestion time. Thank you for attempting to bring the latency down even further.

  • Trey Morgan commented  ·   ·  Flag as inappropriate

    Hey Ketan, are you and the team still on track to publish an update to the SLA? I was just speaking with a internal customer about the amount of time it should take to complete the indexing. I'm assuming, based on this SLA that 6 hours is the absolute maximum. What is the expected average or normal time duration for ingestion and indexing?

  • Anonymous commented  ·   ·  Flag as inappropriate

    So what is the update to the SLA? 6 hours is just not cutting it for a monitoring solution.

  • Anonymous commented  ·   ·  Flag as inappropriate

    I just had to recommend another product for monitoring simply because of this SLA. If not for this 6 hour SLA I would have recommended Microsoft OMS.

    The specific scenario was for monitoring a hybrid environment, in this case the 'near real time reporting' is not applicable for the on-prem servers, and not being able to guarantee alert processing within a reasonable timeframe meant this was a non-starter.

  • Balasubramanian Murugesan commented  ·   ·  Flag as inappropriate

    Is there an update on indexing log data which takes 6 hour in Azure Log Analytics\OMS?? We are struggling and makes customer to opt other Third party products. Please advice any change is in progress in Azure LA\OMS??

  • Balasubramanian Murugesan commented  ·   ·  Flag as inappropriate

    Is there an update on indexing log data which takes 6 hour in Azure Log Analytics\OMS?? We are struggling and makes customer to opt other Third party products. Please advice any change is in progress in Azure LA\OMS??

  • Tore Groneng commented  ·   ·  Flag as inappropriate

    Hi, any update on the SLA review? We are struggling with random high latency on ingestion > 15 minutes. The 6 hour SLA just do not cut it and makes the Alert feature useless in my opinion. The new near real time alert metric only applies to Azure resources. Customers with on-prem servers cannot use it and is left with high latency on ingestion from time to time.

  • Devin Lambert commented  ·   ·  Flag as inappropriate

    Any update on this? NRT Metrics is not sufficient, I am looking for reliable alerting of applications. Ability to do so through logging make for a very compelling feature, as long as it can be within the 5 minute time frame.

  • Brian commented  ·   ·  Flag as inappropriate

    Is there no further guidance from the Azure team on this? LA is a non-starter for alerting without timely data.

  • Anonymous commented  ·   ·  Flag as inappropriate

    Not sure if this is related, but we run an e2e test validating our log statements posted trough the http data collector api are actually in log analytics using the ARM api
    It often happens that the log records are not visible after > 15 minutes.

  • Chris Baird commented  ·   ·  Flag as inappropriate

    To add to this, you have the latency of events being recording prior to them even reaching OMS. In some cases, for example Azure AD sign-ins or activity, it can be several hours before an event can be viewed in the Azure AD portal, via the API or via a storage account. You then have a further 6-hours to wait for such an event to be indexed within OMS.

    Whilst the reality is that events often flow quicker, such SLAs could be a deal breaker for some large organisations where monitoring and alerting upon issues of security and compliance, threats and DLP is critical.

    Is there any plan to review holistically the SLAs around event delivery (both for OMS and native Azure/O365)?

  • Graham Powell commented  ·   ·  Flag as inappropriate

    Regulated businesses that have to know and prove they have timely security and operational controls in place simply can't accept such a long SLA or real world performance. They would ideally like typical performance in a 0 - 5 minutes range and an SLA set around 15 minutes. I'm currently working for Microsoft with a major financial institution looking at exactly this topic.

Feedback and Knowledge Base