Ignore thumbnails when indexing word documents
Word supports saving documents with a thumbnail which contains the text content from the first page. When Azure Search indexes documents, it also indexes their embedded documents. Generally speaking this is useful behaviour, but in this case it leads to content duplication of the first page.
We cannot control documents provided by users so Azure Search should handle this and ignore thumbnails when indexing Word Documents.
Thank you for your feedback. While it is unlikely we’ll address this suggestion in the near future, we’ll reassess based on the number of votes it receives.
Azure Search Product Team
Henry Ing-Simmons commented
"it is unlikely we’ll address this suggestion in the near future" - why not? This is a fundamental flaw in indexing of documents and results in duplicated content in the output of azure search!