Add support for "Delta Lake" file format in Azure Data Lake Store / HDFS
Today we can query data stored in parquet files on ADLS. It would be fantastic to extend this to support the new "Delta Lake" file format recently open-sourced by the DataBricks team ( see https://delta.io )
This would allow us to take advantage of ACID guarantees that the delta format brings to the data lake.
Suresh Kumar Pathak commented
It would be great help and simple pipeline if we provide this feature. Eagerly waiting for implementation of this feature.
Julian W commented
I managed to use the ADF mapping data flow in the synapse studio to load some data into a file with delta format in the azure data lake. And managed to update one record 2 times.
But when I query the delta file in Synapase on-demand, that record returns 3 rows, and each row is from a separate file.
I know Delta lake supports time travel, looks like the 3 rows are just the different verions of the same record.
My question now is how to use the query to return the latest version of each record, of course, I could add a timestamp on each row, but there would be something out-of-the-box solution, right?
Another question is that if the delta file do keeps all the versions of the same record, how could I remove the historical versions? I know in the mapping data flow, i could set the vaccum days on the sink, by default, it is 0 which means 30 days, the minum day I could set is 1, which means if I update the record multiple times in the same day, the historical versions will not be removed.
Ok, the delta file do contains the historical version of a record, the issue is that Synapse does not support the delta file, but just read all the records in the parquet file, that's why it returns the historical versions as well.
We are using synapse for the powr bi reporst to read the data from azure data lake. it is quite disappointed that the ADF mapping data flow supports delta format, but synapse does not.
Technically, we could get the latest version of each records, but the portential issue is the performance.
GH LG commented
The feature will be of great interest also in the Serverless Option of Synapse.
Simply defining views a top of the delta lake storage will make everything easy to integrate with SQL based applications
Steffen Mangold commented
Highly need, otherwise no chance of replacing DataBricks.
Pedro Caeiro commented
Eagerly waiting for news on this. Given how widespread Delta Lake is, and the fact that it is a open-source library, I would expected Microsoft to implement this feature ASAP to make a direct compete with Databricks, otherwise Spark pools aren't nearly as relevant or useful.
Snata Ghosh commented
It would be really interesting to query data stored in Delta lake file format in ADLS, looking forward to implement the ACID transformations soon.
Andy Steinke commented
Polybase is a critical method for importing data into the warehouse and accessing data through a SQL interface and in order to fully utilize delta lake files it is a major hindrance. Thank you and we appreciate prioritizing this critical effort!
Prasanta B commented
It will be great to have Delta Lake support in ADLS considering Synapse as an integrated solution. Our target architecture landspace shall be simpler and azure based.
Please update on this or share any workaround..Most needed feature for us.
Lakshmi Sankarappan commented
Any updates ? Most needed feature ,we are very much awaiting for this.
This is super important to get in place!
Felipe Rosa commented
+1, this would be hugely helpful to us.
Euan Garden commented
(Linux Foundation) Delta Lake in Spark has been supported since Private Preview was released in Nov of 2019. We are working on support in the other compute engines.
Bear in mind we are currently constrained by the feature gaps in the OSS version.
On my point of view to guarantee data consistency, time travel and acid trasations use deltatable format.
So make sense synapse support to read deltatable "format" (json log files indicate which parquet data are valid)
Delta-Lake management in Synapse and data factory are highly anticipated developments on our side as well.
The azure team needs to be swift in their response to this issue. Its a very much a good practice in data brick environment to use delta format.
This should be supported by sql data warehouse(or sql synapse analytics) as an external file format.
Thanks, hoping for a swift reply and a solution.
PETRANCURI, DARRYLL commented
This is hugely important, and frankly with all the work that has been done on integration with Apache Spark, I'm really surprised this isn't in the roadmap at this time. It's not enough to be able to perform direct queries against Parquet. If you consider all the power and capabilities that Delta provides for a simplified data lake and data lake ETL pipeline, it's a must have.
Along with this I feel it's critical to add a Delta Sink to Azure Data Factory as well.
Saumyakumar Suhagiya commented
Any update on this?