How can we improve Azure Search?

Make the blob indexer faster

From the tests I made, currently a single blob indexer in S2 with 1 partition and 1 replica is only able to process between 50 000 and 75 000 small office documents (1 to 4 pages) in a 24 hour period.

The current solution which would be to restructure millions of blobs into "directories" with max 75 000 blobs in them and have 12 indexers is completely out of the question due to the insane pricing model and the time it would take to both, modify consumers with new paths and move blobs to a new structure. The latter being very slow as well.

So if I want to index 5 million documents into a 1200$ / month tier which can support 60 000 000 and 100 gigabytes, I would have to upgrade to a ~12 000$ / month tier with 12 indexers and somehow figure out a way to restructure all my blobs to fit the 12 directories model... Even then it would take weeks to complete... how is that even considered by Microsoft?

At least have a way to price indexer throughput (or how many indexers we can have) without having to scale out partitions.

9 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    I agree to the terms of service
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    LpLp shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    1 comment

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      Submitting...
      • lpunderscorelpunderscore commented  ·   ·  Flag as inappropriate

        As I think about this, another solution would be to allow us to call a high throughput api that can extract document content. We could then scale services with Azure Batch services for example and call that api to extract content from documents at scale.

        blob storage coud be integrated (passing blob storage uri + credentials) to the service.

        documents could be on prem which would allow for more throughput then blob storage if needed...

        We can then index the content using the search indexing api (which could use more performance as well imo)

      Feedback and Knowledge Base