[In Preview] Hash distribute fact table on multiple columns
A large fact table often has multiple join keys to multiple dimension table, so while it can be opitmised for one join it always creates data movement for other joins. The hash distribution should accept multiple columns so all joins are optimised and create no data movement for dimension joins.
This feature is currently in preview and will be available to all users in coming months.
I agree that what is being requested doesn't work with MPP architecture. It sounds like MSFT is working on the feature to have a multi column HASH distribution which will not actually solve what is being requested. If you have a multi column HASH, data from other tables will only co-locate when their hash key is the same set of columns,
Definitely need this
Kin Tang commented
MPP Databases don't work like this. Other MPP database such as Netezza will not work if you have multiple columns HASH but only JOIN on one of the column. Do to the fact the data is distributed/stored using the HASH of all of columns for FACT while Dimension is distributed/stored using a different HASH of one column, the data ends up being not co-located.
In order get the benefit of the distribution, both table in the JOIN need to be co-located thus using the same HASH key for distribution. If Fact and Dimension table are distributed using different HASH columns. It will end up calling DMS every execution because the tables are not co-located.
Small Dimension table should be replicated to minimize data movement.
Reference; Designed systems using Netezza, AWS RedShift and Azure SQL Data Warehouse.
Seconded, I had assumed this functionality existed, although there is the work around of simply concatenating the feilds
Glen Swan commented