Polybase: PARTITIONED BY functionality when creating external tables.
I have a large number of files which are read with Hive using a partitioning scheme. PARTITIONED BY functionality, which is so commonly used in HIVE is missing from polybase.
Thank you for your request. I would like to understand if this request concerns SQL DW or SQL Server implementation of PolyBase?
Quite frustrating to develop and build out a modern data architecture to get to this end point and find out such a necessary feature is unavailable in Azure DW/ Synapse. Massive failure to not have this available and never mention it in your unavailable features.
This is very much required. When you support parquet format, obviously partition also needs to be supported. Wondering why this is not implemented yet even after 2.5 years
Greg Galloway commented
I ran into a situation with a partitioned Hive table in Parquet format where the partition column wasn't in the actual Parquet file columns. Possibly this is some Hive optimization such that when partitioned by day, the partition column only appears in the partition folders named "date_col=2020-01-01" or "date_col=2020-01-02" and not in the actual files. Loading this with Polybase is challenging as you have to create one external table per folder. Adding partition comprehension to Polybase or the COPY command would be a big help here... especially if querying a partitioned table could read only the folders mentioned in the WHERE clause if filtering on the partition column in a select from the external table.
just putting half cooked products out wont do.
this was an absolute no brainer
Brown, Jeremy (CW) commented
This is for Azure SQL DW. Yes this is a much needed feature. In order to get partition isolation you need a HIVE compatable interface, is it really neccesary to stand up an HD Insight cluster to do this? Seems very expensive to get basic HIVE capability that should be part of Polybase, guys. This is a miss feature wise.
Jason Horner commented
Any update on this it would be a major enhancement we are currently forced to workaround by use a complex set of dynamic sql that repoints the data source to the current days partition.. adding a hint to tell polybase the partitioning scheme would be extreme;y help to support predicate pushdown and partition elimination
Tom Petz commented
We use this capability in ADLA and have a need for it in SDW as well. Without this capability it makes scaleable loading and unloading of data from/to a data lake difficult.
While we need both the read by partition is more important than the write.
Scott Person commented
Why is this still marked as "NEED-FEEDBACK"? It's in the SDW section and was indicated as being SDW in May (see below).
In a recent web talk Kuber Sharma recommended using Polybase as a data lake interface. This recommendation should be reconsidered as without partitioning it isn't viable. All data lakes partition.
Scott Person commented
Even though Al was a bit harsh - understandably so if he's blocked, I agree. This is basic functionality for anyone using data stored in Hadoop or anything even Hadoop like. You all should take a look at what AWS is doing with Glue. It's a shared meta-store. You create a table once and you can read it in several different DB engines. A table in Glue can be queried in Redshift (SQL DW), EMR (HDInsight), and Athena (Azure ain't got anything even close). When implemented well, you wouldn't even need to create the external tables in SQL DW. They'd already be there when you cranked up the cluster.
Hugo PT commented
In our corp it's a must have feature for Polybase that is missing. On large result sets we are making additional code to generate one table/partition, with a view to union results adding hardcoded the partitioned fields. It's extra work, extra risk of failure, etc. For smaller result sets we just use SSIS over Hive, less work and safer.
This concerns SQL DW
Rob Poidomani commented
I would vote for SQL DW. Having polybase able to elimate partitions of data without having to deal with multiple external tables, etc. would make life so much easier to avoid performance issues on large, append only historical datasets. Either that, or make it able to connect to a hive metastore that can.
al turker commented
I can't believe they create a utility saying we have Hadoop support and they don't have the partition function.
Congrats Windows, you did it again, because of your utility's lack of service on the most common functionality that any Hadoop utility must have, I have wasted 2 days, and my project is roadblocked now. Bravo