Extractor that pulls from Excel Worksheets in a Workbook!
In using the Document.Format.xml object, I am running into out of system memory errors on particularly large files. I have a spreadsheet that is over 260K rows. Is there a possible solution in dealing with these errors?
Aaaaaaand...please disregard previous e-mail.
It can be added through NuGet Package Manager.
PM>Install-Package DocumentFormat.OpenXml -Version 2.8.1
Is there a chance we can get the class files for the DocumentFormat.OpenXML.xml or at least the compiled .dll? The extractor won't work without it.
Pradeep Raghunath commented
It would make sense to provide built-in extractors for XLSX, JSON, XML file types which are very common.
Michael Rys commented
Note that there is a community contributed Excel extractor available at https://github.com/Azure/AzureDataLake/tree/master/Samples/ExcelExtractor
Kory Skistad commented
I've worked in the financial/mortgage industry for nearly 20 years and if there is one constant- it's Excel. Moving from DTS to SSIS and to various other ETL technologies has always presented challenges to bringing Excel data into databases in order to make "desktop" data into "enterprise" data. It still baffles me why this continues to persist as a challenge for Microsoft to address. Why are we still using the Jet driver? Excel has a very logical hierarchy- Workbook->Sheet->Range. That should be enough to create a driver that can navigate this hierarchy and allow tools like SSIS, USQL, and any other MS product to target the area we need to pull data out of an Excel document. Named ranges make it even more intuitive.
I started using Excel at version 3 back around 1991. I worked for a bank at the time. Now almost 30 years later I am still trying to find a simple way to deal with this data. Face it Microsoft- Excel is not going away... let's make a killer driver that can be used with your other tools and technologies to tap this valuable resource.
Michael Rys commented
Note that U-SQL can read most CSV and TSV files that are generated by Excel (without header and no CR/LF in content). XLSX files are harder to support: They are a compressed archive of XML files, so it makes it rather difficult to give you good performing processing.
We will look into it though if there are enough votes.