Make the blob indexer faster
From the tests I made, currently a single blob indexer in S2 with 1 partition and 1 replica is only able to process between 50 000 and 75 000 small office documents (1 to 4 pages) in a 24 hour period.
The current solution which would be to restructure millions of blobs into "directories" with max 75 000 blobs in them and have 12 indexers is completely out of the question due to the insane pricing model and the time it would take to both, modify consumers with new paths and move blobs to a new structure. The latter being very slow as well.
So if I want to index 5 million documents into a 1200$ / month tier which can support 60 000 000 and 100 gigabytes, I would have to upgrade to a ~12 000$ / month tier with 12 indexers and somehow figure out a way to restructure all my blobs to fit the 12 directories model... Even then it would take weeks to complete... how is that even considered by Microsoft?
At least have a way to price indexer throughput (or how many indexers we can have) without having to scale out partitions.
Graham Bunce commented
We gave up on the indexer. Far too slow and far to expensive to scale and updated content ourselves via the api. More pain to write, but far superior performance.
As I think about this, another solution would be to allow us to call a high throughput api that can extract document content. We could then scale services with Azure Batch services for example and call that api to extract content from documents at scale.
blob storage coud be integrated (passing blob storage uri + credentials) to the service.
documents could be on prem which would allow for more throughput then blob storage if needed...
We can then index the content using the search indexing api (which could use more performance as well imo)