Blob indexer should be able to skip unsupported content types instead of treating them as errors
Hi, this can now be done using new indexer configuration parameters to specify “include” and “exclude” file name extensions. For details, see https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#using-indexer-parameters-to-control-document-extraction
Your Azure Search team
I have added a new feature request....
I can confirm Nick Caruso's comment. I'm finding even with formats that should be supported like docx and pdf that it can and does fail, some of the documents it is failing on do not even appear to be corrupt i.e I can download the file from azure blob storage, open it in word and it seems like a valid document.
In any case, corrupt data is as inevitable as buggy code, resilience is the only workable solution. The indexer shouldn't give up on the first hiccup it encounters. I have grave doubts Azure Search will be a viable solution with changing the behavior of the indexer to simply log and continue indexation of the remaining documents.
Chris Lucas commented
The include/exclude does not make a difference if the error is in the actual indexer due to an invalid file or similar. My PDF indexing is stopping because a few of the PDF files are corrupt. Seems silly to have me hunt them all down.
Nick Caruso commented
I'm confirming that this is still happening. I don't have total control in my app as to the content type, and so when the indexer is running, it completely stops as soon as it sees an unsupported type.
Does this error stop the indexer from indexing other valid content?