Upvotes: 47
<=-=Jul 3 2014 10:03AM=-=>
This would definitely be a good build. I walked through an initial investigation of clustered columnstore on one of our data sets, and the number of reads was reduced by about 80% for some queries in the workload once I started paying careful attention to segments and loading data in a single-threaded manner to optimize segment elimination. However, this results in slower loading of data, and it would be great to be able to create columnstore indexes in order for segment elimination.
Having an optional ORDER BY clause for the columnstore initial creation, removing the need to first create a clustered index in order to control order and allowing parallelism without breaking order, would be particularly powerful.
<=-=Jul 3 2014 10:30AM=-=>
Neugebauer: thanks. you identified the issue correctly. This is something we are actively looking. One question
(1) once the index is build, the subsequent inserts may not be ordered, Is this a big issue?
thanks
Sunil
<=-=Jul 4 2014 8:40AM=-=>
Thank you for the commentary, Sunil.
In my personal opinion, the subsequent inserts can be unordered, since we can easily rebuild index/partition to get them back in order.
Clustered Columnstore Index is not a truly ordered index and hence there is no need to upheld the order.
It would be awesome to have it ordered one day, but I am not holding my breath:),
especially since one can get items order by Rebuilding CCI.
I like the suggestion by Geoff of eliminating by rowstore index creation before CCI build,
but given it’s impact (~0.7 times slower)
and syntax implications (column order specification) – I see this feature coming into a potential ordered version of the Columnstore.
<=-=Aug 6 2014 1:54PM=-=>
Sunil,
I also agree that the subsequent inserts can be unordered. Neugebauer’s logic makes sense to me. This is really about optimizing the initial creation of large columnstore tables. We would typically be loading new data only by building / rebuilding an entire new partition at a time, and it seems like that must be a common use case for many of the large data sets towards which columnstore is geared.
<=-=Nov 11 2014 3:31PM=-=>
Niko/Geoff:
Thanks for your suggestions. We are considering this for the next release
thanks
Sunil
<=-=Dec 8 2014 5:40PM=-=>
Niko: I am closign this as this is in the plans for next release. If this gets pushed out unepectedly, please feel free to re-activate
thanks
Sunil
<=-=Apr 5 2016 10:13AM=-=>
In SQL Server 2016, parallel index (re)builds of Column Store Indexes will not preserve ordering.
As noted, the workaround is to do the build with MAXDOP = 1, so it becomes a tradeoff between index build performance and query performance.
We clearly understand the problem space, and why this would be beneficial. Unfortunately, we haven’t solved some of the technical issues involved.
We will continue to look into this area.
Kevin
<=-=Jun 29 2016 9:59AM=-=>
Thanks for the update, Kevin.
I’d still love to see a version in the future where the Columnstore index build runs in parallel, builds segments in the desired order, and does not require creating an intermediate b-tree index first. But definitely understand there may be tough technical challenges involved.
<=-=May 16 2017 6:42AM=-=>
For pretty much every client that we work with, this keeps coming up as a pain point. We’re essentially telling clients to choose between much slower ETL processes or reduced DW query performance for certain query patterns that would benefit hugely from segment elimination, including many of the most common reporting and analytic queries they are performing. This feature would be hugely valuable in order to unlock both good ETL performance and good DW query performance, which are currently at odds with each other.
Our request would be two features:
(1) Allow optional control of the order in which new rows are packing into segments when using INSERT INTO … WITH TABLOCK to insert data into a clustered columnstore index in parallel
(2) Add optional control of the row order to the REBUILD command for columnstore indexes
The first of these two features is the critical one, as it unlocks the parallel load capability into a single columnstore table (which we would use as a staging table to swap into a partition of a much larger fact table). Even if explicit rebuild support is not present, a rebuild could still be performed by re-inserting all of the data using feature (1). Feature (2) is a nice to have that would be convenient, but we’d be satisfied with (1) even if (2) is not implemented or not implemented in the same release.
Controlling the order in which rows are distributed into row groups will yield huge DW query performance benefits for queries that can benefit from segment elimination by providing the ideal scenario: each segment has a disjoint range (almost, ignoring some possible ties) on the leading column of our ordering. However, it is understood that neither of these features would guarantee the order of future inserts, and neither would impact the delta store in cases where the number of rows inserted is small enough that a new segment is not created. Therefore, the query optimizer would continue to avoid making any assumptions about the order of data for a clustered columnstore index loaded using this feature.
Any updates on this? I don't see any documentation about it being included in SQL Server 2019 :(