How can we improve Microsoft Azure SQL Data Warehouse ?

Replicated Table Support

The feature is similar to that of redshift's Distribution style all. This will reduce data transfer between distributions/nodes and consequently improve performance.

204 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    I agree to the terms of service
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Mainul HassanMainul Hassan shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    18 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      Submitting...
      • sdoubledaysdoubleday commented  ·   ·  Flag as inappropriate

        I'm glad to see that this was moved to "Started" in late March 2017. Is there a timeline you can share with us? Can we expect this feature in Q2 2017? By the end of 2017?

      • Anonymous commented  ·   ·  Flag as inappropriate

        It's been almost a year since you promised you were working on a design for replicated tables in ASDW. Can you PLEASE tell us when you expect to have this feature available?

      • Anonymous commented  ·   ·  Flag as inappropriate

        I am curious if folks have actually tried distributing their DIMs on the surrogate key and running queries on an APS and on ASDW and measured performance. We have and do not see an urgent need to continue to use replicated tables on either the APS or ASDW. If fact, we have been converting all but the smallest DIMs to distributed on the APS and all on the ASDW. We do not use Round Robin tables. Our performance has been the same, if not better. The Broadcast Move replicates the DIM tables quite quickly. The tables persists in memory unless there is memory contention and the query completes the same or better than when we used replicated tables. (Note, Teradata does not support replicated tables and has been functioning quite well for decades).

        To be more specific - On our APS, we have historically replicated our DIM tables. On the ASDW, we distribute all DIMs on their surrogate key. When running the same queries on both systems, we have found no performance degradation when using distributed DIMs on the ASDW versus the replicated on the APS. To further test this, we then switched all but the smallest replicated DIMs on the APS to distributed and reran queries on the APS. Performance was never worse and often better. What I see is that the Broadcast Move operation 'replicates' the data into memory by reading from all distributions simultaneously. Unless you are memory bound, these pages will stay in memory for the duration of the query execution and performance may be just fine or better. It certainly is in our case. I encourage you to test this and verify that you really impacted by not being able to create replicated tables in ASDW.

        DB Statistics: DWU 2000. Largest Fact table is 4 billion. Largest Dim is 1 billion. Several joins are through many to many factless fact tables. The largest factless fact is 1.6 billion. The largest queries can include one FACT table, and up to 15 factless fact and DIM tables. MEDIUMRC.

      • Peter GadsbyPeter Gadsby commented  ·   ·  Flag as inappropriate

        @Chris Aiken replicated tables are indeed supported in on premise APS, however in the Azure version (Azure DW) it is not. The only option is round_robin (and hash distribution) which means when joining dimensions to fact this causes a broadcast move. Which obviously is sup-optimal.

      • Paul SchmidtPaul Schmidt commented  ·   ·  Flag as inappropriate

        This is a must feature, we currently have a Azure DWH in production and query performance is less than optimal. We have implemented a fix, however, replicated table support will avoid all unnecessary workarounds. When will this feature be available?

      • Darryll PetrancuriDarryll Petrancuri commented  ·   ·  Flag as inappropriate

        There's a work around that just came to me yesterday that I believe may work. It's a hack, but considering the performance ramifications of keeping the DMS minimally engaged, I think it's worth a shot.

        Create a new column on any dimension needing to be replicated and call it ReplicationInstance or something to that affect. Use this column as the HASH distribution key.

        When performing any ETL of dimensional data into Azure SQL DW, you'll essentially need to 'duplicate' data 60 times, as 60 is the number of distributions within Azure SQL DW.

        I suspect under the covers this is actually what the REPLICATE keyword in APS actually does anyway. Yes, it complicates the ETL slightly and certainly increases storage requirements, but considering the relative cost of storage as compared to compute, it seems to me an obvious choices because you'll require / use less compute and DMS won't need to be engaged.

        Of course, I believe this is all predicated on D-SQL environment recognizing the local availability of the dimensional data on each distribution.

      • IkarosIkaros commented  ·   ·  Flag as inappropriate

        This quite crucial to have in order to be able to proceed with using Azure Data Warehouse in production. Please provide with an approximation of time that you would expect the functionality to be released.

      Feedback and Knowledge Base