USQL String Data Type has a size limit of 128KB
USQL String Column Data Type has a size limit of 128KB. This limits uploading/processing the text data larger than 128kb through USQL job. For example, if the text data type in SQL has XML content, which size greater than 300KB, it fails uploading/processing with USQL. Can we increase the string data type size?
J Meade commented
we want to use u-sql for data prep, nlp processing, and/or merging multiple smaller data files into larger consolidated files. we're running into issues when reading any rows containing fields that are beyond this limit. it would be really helpful if we could work under the same constraints as .NET and/or T-SQL.
Using the R.Reducer, I want to save some R objects that took me a loooooong time to calculate. They are quite large (10-50MBs) but I need to be able to get them out of R, save them to ADLS and send them back to ADLA+R when doing some other computations.
We need to be able to put larger single objects than 128k in and out of R. (Yeah, I know I can use DEPLOY RESOURCE, but it does not have the dynamics of passing things around using the Reducer)
Paul Andrew commented
For info, this relates to this SO question/answer: https://stackoverflow.com/questions/44631022/value-too-long-failure-when-attempting-to-convert-column-data
Alex Kyllo commented
Cosmos strings don't seem to have this limitation, and a string in C# .NET can be up to 2GB in size, and in T-SQL an nvarchar(max) field can also be up to 2 GB. So it's very strange that this limitation exists in U-SQL especially since it is marketed as a big data processing language. Also the 4MB per row limitation is too restrictive for data such as JSON, XML, and unstructured text. Please consider improving the U-SQL runtime to remove these limitations.
Carolus Holman commented
When trying to access an array of elements the Extractor fails when typing the json array object as a string. If I directly load the array using the jsonpath in the extractor ex. JsonExtractor(body.telemetry.sensors[*]) I can load the fragment, but using this method I cannot get to the other parts of the json object such as the header. I have scoured the internet but I cannot find a solution.
The workaround of reading it as byte does not work when dealing with Gzip compressed files.
In lot of cases, there is a need to read a larger string and then parse out a smaller portion of it.
Please add support for reading larger string values...similar to HIVE please.
Rukmani Gopalan commented
While a byte works fine, it is restrictive in terms of not being able to use string operations. E.g. think of a scenario of mining long error messages (with a huge error stack) where you are looking for a specific tag - a substring function will be handy. Another scenario - working with string encodings of huge objects like images to perform operations.
Michael Rys commented
Thanks for filing. Currently the recommended workaround is to put the data into a byte array (byte). Note that for XML documents that may have a self-contained encoding, that may be the better way anyway.
What is the expectation for such a type? Do you still want to dot into it and have it type compatible with the core string type? Or do you want it as a different type?