How does the Azure Search Blob Indexer work with billions of blobs? The only feasible way is if it can query the Blob Service to enumerate all the blobs with a LastModified timestamp greater than the last time that the indexer was run. But the Blob Service does not support any such filter (grr why not?!). So how does Azure Search Blob Indexer work? The transaction cost for enumerating billions of blobs would be large.
Perhaps Azure Search Blob Indexer relies upon that "hack" of using Blob Service logging to detect when blobs are updated. But logging is described in the documentation as a "best effort, potentially lossy" service so this seems unlikely. As it would put the indexer at risk of missing updates to blobs.
MSFT thoughts would be appreciated!
Please post questions about Azure Search on StackOverflow or our MSDN forum (https://social.msdn.microsoft.com/Forums/en-US/home?forum=azuresearch) to ensure quick response.
A single blob indexer (or even a single search service) is not going to cope with billions of documents, due, among other reasons, to scalability limits on enumerating blobs as you note below.
Our advice for scaling blob indexing is to provision multiple datasource/indexer pairs all writing to the same index (with datasources potentially pointing to different storage containers or storage accounts).
Make sure your search service is scaled to run multiple indexers concurrently; keep in mind that one search unit can run one indexer at a time.
Hope that helps!
Your Azure Search team.
For anyone else reading this, the current blob indexer is not only unable to parse billions, but is actually limited to a few thousand (small documents 1-2 pages) during the indexation process. It is currently impossible to get millions of documents parsed by the blob indexers, even if you scale to 12 partitions (which would cost a ****load of $$) you can only index 600 000(12* 50 000) documents a day. This is with a cost of about 14 000$ a month :S.
This limitation is even more ridiculous when you think about the fact that the service itself can handle 12 * 60 000 000 documents at that tier.. but you would never be able to get them in there ...
Not even going to talk about how all of this 12 indexers solution is also reliant on having your documents neatly setup in 12 containers or virtual directories in the blob storage...
I'm not sure if we should lol or facepalm at the bad architecture and pricing decisions of this service... "Look at S2 base level you can have 60 000 000 documents!" but getting them in there is another story...