[In Preview] Hash distribute fact table on multiple columns
A large fact table often has multiple join keys to multiple dimension table, so while it can be opitmised for one join it always creates data movement for other joins. The hash distribution should accept multiple columns so all joins are optimised and create no data movement for dimension joins.
This feature is currently in preview and will be available to all users in coming months.
Nick Galemmo commented
How is this supposed to work? While it sounds great, it seems like a feature very few users would want.
The only way colocation works is if the data in the two tables are distributed on the same key and you are performing an equijoin on that key. To support colocation on multiple different keys you must replicate the data to physically redistribute the data on those different key. Even if you do that, you cannot co-locate a join using multiple tables distributed on different keys.
As it has been implemented, use of multiple columns only creates a different hash key. All co-located joins must be between tables with the exact same distribution using ALL the columns of the distribution definition in the join.. The only reason to use multiple columns is if is necessary to get a even distribution.
In dimensional models, standard practice is to distribute dimensions on their primary key and facts on the foreign key to the largest dimension.
James Tran commented
could someone clarify whether this is a feature to distribute on different columns so that joins to different tables are more efficient or if it is a multi-column hashing option (essentially hashing the concatenation of multiple columns)?
I'm interested in the former and don't want to raise my hopes in case it's the latter...
Christopher Teljstedt commented
Is it possible to get access to this preview in any way?
Shall concatenating all SKs and using it in hash table column help taking the hash table advantage? I anyway has the Fact PK for defining this. For materialized views it is expecting to define distribution mode right. what should be effective for a fact table from performance point of view?
I agree that what is being requested doesn't work with MPP architecture. It sounds like MSFT is working on the feature to have a multi column HASH distribution which will not actually solve what is being requested. If you have a multi column HASH, data from other tables will only co-locate when their hash key is the same set of columns,
Definitely need this
Kin Tang commented
MPP Databases don't work like this. Other MPP database such as Netezza will not work if you have multiple columns HASH but only JOIN on one of the column. Do to the fact the data is distributed/stored using the HASH of all of columns for FACT while Dimension is distributed/stored using a different HASH of one column, the data ends up being not co-located.
In order get the benefit of the distribution, both table in the JOIN need to be co-located thus using the same HASH key for distribution. If Fact and Dimension table are distributed using different HASH columns. It will end up calling DMS every execution because the tables are not co-located.
Small Dimension table should be replicated to minimize data movement.
Reference; Designed systems using Netezza, AWS RedShift and Azure SQL Data Warehouse.
Seconded, I had assumed this functionality existed, although there is the work around of simply concatenating the feilds
Glen Swan commented