Support user-provided query in SQL indexer
Let's say I run an indexer to read a really large table, with a billion records, out of which I'm interested in only a few thousands. This is a database I do not own, and I have only read permissions to it.
In this case it would be better to have a simple filter on which data to keep in the indexer throw away the irrelevant data.
For eg: "Index this table where TeamName equals MyTeam"
Even though the indexer runs on a large set of data, my index would only contain a smaller subset of data.
Smaller the index, faster the search results.
Hope my explanation was clear.
Ad mentioned below, please create a View in SQL and then leverage this in the Indexer.
Where condition to be added to the data source e.g. SQL. Dropdown with table columns and value field to retrieve only records matching the required value rather than the whole table
Allow Indexer Data Source to be a Table Valued Function or Stored Procedure, with a @highWaterMarkColumnName_value parameter.
We currently have a view, used by Azure Search Service, that joins 6 tables, 3 of which have ~22 million rows and 1 of which has ~65 million rows.
When doing an index update request, Azure Search queries this view with a "Where > @highWaterMarkColumnName_value" clause.
Since this @highWaterMarkColumnName_value tracks changes made in the last 5 minutes or so, it will identify anywhere from a few hundred to a few thousand affected view rows.
But the current update query from Azure Search causes the Azure SQL DB will do a full table scan of each of the 6 underlying tables.
This forces it to processs more than 100 million records, from the underlying table, which takes anywhere from 3.5 to ~7 minutes on a P11 Azure SQL DB!
Whereas if the Indexer Data Source could be a Table Valued Function or Stored Procedure, with a @highWaterMarkColumnName_value parameter,
the view could be rewritten as a cascading CTE which queries each of the underlying tables with the @highWaterMarkColumnName_value, before joining them with the other base tables.
This allows the Azure SQL DB to process a few thousand rows as opposed to > 100m rows.
We've prototyped this type of query modification and it currently finishes in ~1 second.
Therfore if Azure Search would support the use a Table Valued Function or Stored Procedure, as the Indexer Data Source, it would give us the ability to control the efficiency of the Indexer update request!
And this would make Azure Search a viable option for us to use in Production.
Joseph Taber commented
Create a view that selects and filters from your large table, then create an indexer on the view. Combine with a rowversion column and the HighWaterMarkChangeDetectionPolicy and that should be what you're looking for!
Don't forget to create a sql index that makes querying the view fast. :)
Shriladha Balasubramanian [MSFT] commented
Scenarios I faced for querying SQL indexer in Azure Search:
1. Azure Search indexer does not accept empty values to be keys.
2. Enabling Track deletions feature expects the column to be string or Boolean. I have the column as smallint in the SQL Azure database.
If querying is enabled I can query for not nulls and convert smallints to bits for Deleted column.