Add row position column to STRING_SPLIT
The new string splitter function in SQL Server 2016 is a good addition but it needs an extra column, a ListOrder column which denotes the order of the splitted values.
Upvotes: 72<=-=Mar 10 2016 11:26AM=-=>
It’s a shame that this was submitted as just a “suggestion”. It should actually be listed as a “bug” because there’s only a comparatively small set of use cases where enumeration of the result set of elements is not important.<=-=Mar 11 2016 12:47PM=-=>
I agree that an order column is required; one example use case is where two lists are passed in, and ordinal positions in one list correspond to positions in the other.<=-=Mar 11 2016 3:12PM=-=>
Please see the related suggestion: STRING_SPLIT needs “RemoveEmptyEntries” option, like String.Split in .NET ( https://connect.microsoft.com/SQLServer/feedback/details/2462002/ ).<=-=Mar 12 2016 12:02PM=-=>
This kind of function is primarily needed for de-serializing previously serialized arrays of values of any type format-able as text.
I therefore recommend to have the result set of this function work excellent with this use-case.
With de-serialized arrays there is a need to have precisely defined ordinal order as an additional result column.
This T-SQL splitter function would have even more usability if it also adds the following:
1. A result column containing the string position within the serialized text.
2. A result column containing the length of the identified value within the serialized text.
Finally allow for splitting values longer then one character and even zero characters.
With zero characters, just treat each character position as a new element ( “ABC” becomes “A” “B” “C” ).
I strongly second Peter de Heer’s suggestion that if an empty string is provided as a delimiter, that the function should split the string at the character level. This adds great utility to the function while still keeping it simple, lean, and fast.
As for adding other options, that’s fine as long as it doesn’t interfere with the notion of “simple, lean, and fast”. For example, I don’t believe that most people need the length of each element to be returned and if doing so slows the function down, I’d leave that out. We do, however, definitely do need for the ordinal position of each element to be returned because it’s not possible to get that information using a workaround because sort order is not otherwise guaranteed.<=-=Mar 16 2016 5:45PM=-=>
I’m very pleased to see a native inline table-valued function to split strings. However, it’s incomplete with an “Item Number” column being included in the return table. It’s a very useful column to have and and is absolutely necessary much of the time. Kudos for creating the function. Please consider adding the missing piece soon.<=-=Mar 22 2016 11:05PM=-=>
The new OPENJSON function returns the element’s ordinal position when working with an array. To be consistent, this function should also.<=-=May 5 2017 5:56AM=-=>
Due to some possible parallelism behaviour, the workaround solution proposed by Solomon Rutzky cannot be systematically used on big tables, even in case of MAXDOP 0… We needs to know which algorithm has been used…<=-=May 5 2017 6:38AM=-=>
Curiously the function name “split_string” is case sensitive (I ever use case sensitive instance to quickly retrieve systems queries that does’nt works on every installations). “split_string” must be expressed in lower case to be used in every cases. Usually and since the beggining all integrated SQL functions are expressed in UPPERCASE like SYSTEM_USER.<=-=May 13 2017 1:56PM=-=>
@SQL_pro Re: your comment of “Due to some possible parallelism behavior, the workaround solution proposed by Solomon Rutzky cannot be systematically used on big tables”: is this behavior you have observed and can reproduce, or is it theoretical? If it has been observed then there is nothing to argue with. But if it is only theoretical, then to me the proposed workaround is still an option due to the very low probability that this operation could ever be handled in parallel. The very nature of a split operation is that you don’t know where the splits occur without scanning the entire string, and one cannot simply start in the middle in the hopes of dividing the work in half as that could be in the middle of an element. And while that issue of dividing the work up can possibly be corrected for by starting in the middle and scanning until the next delimiter (not a pure 50/50 split but still dividing and conquering), that requires scanning the string which is not even guaranteed to have a delimiter, hence in many cases it would be doing double work (wasting time) just to ensure that it didn’t break in its effort to save time. So parallelism (here) is more error-prone and only sometimes reducing time but more often than not increasing operational time.
However, to be fair, it was mentioned in a thread on SqlServerCentral.com that due to the documentation not guaranteeing the order, that it was not safe to assume consistency of ordering because it could somehow change in the future, and I get that, but still ;-).<=-=May 14 2017 9:00AM=-=>
@SQL_pro Re: your comment of "Curiously the function name ‘split_string’ is case sensitive ": What version of SQL Server are you testing on? And what Collation(s)? I tested on SQL Server 2016 Express (RTM and SP1) with an instance default Collation of Hebrew_100_CS_AS_KS_WS_SC. I tested in [master] (having the same Collation as the Instance), and two other DBs with Collations of Latin1_General_100_CI_AS and Latin1_General_100_BIN2. I also tested on SQL Server 2016 Developer Edition (SP1) with a database default Collation of Latin1_General_100_BIN2. In all cases, the following worked without error: “SELECT * FROM StRInG_SpLiT(‘h’, ‘,’);”.
The only way I am able to produce an error is to specify “SELECT Value FROM StRInG_SpLiT(‘h’, ‘,’);” instead of “SELECT value FROM StRInG_SpLiT(‘h’, ‘,’);”; the only difference being the case of the returned column “value”. The databases with default Collations being case-sensitive or binary would produce the error. It should be noted that using “Value” did not error in databases with a case-insensitive default Collation, even if the Instance default Collation was case-sensitive.
In the future, if you find a problem, please report it as a separate bug ticket, not as a comment in some other ticket. Please also include details to help reproduce the error, such as which edition and version and service pack level you are on, and which Collation(s) you are using.<=-=May 14 2017 9:07AM=-=>
@SwePeso: this ticket might benefit from having a title more reflective of the specific issue (i.e. “STRING_SPLIT missing RowID / ElementID / ListOrder / etc result column”). I mention this after noticing a more recent duplicate request ( https://connect.microsoft.com/SQLServer/feedback/details/3130245/ ) that has a response from Microsoft whereas they have yet to comment here. It does make it easier to know at a glance what the request is for when scrolling through the results list :).<=-=Jun 24 2017 9:07AM=-=>
Just to throw in my full wish list for an intrinsic splitter function in T-SQL…
Here’s my wish list for what should have done with the function. I consider items 1 thru 5 to be essential. The rest would be really nice to have but are not essential (IMHO). Of course, performance is always on the wish list.
1. Permanently or optionally return the ordinal position of each split-out element.
2. Always return something. If you pass it a NULL, return a NULL. If you pass it a single element empty string, return an empty string. If you pass it a one or more spaces, return the spaces (unless spaces are the delimiter, of course, and should be carefully considered because THAT’S a very common use case).
3. If you don’t pass a delimiter or any other option other than the ordinal position option (if it is optional), split at the character level.
4. Have a “True CSV/TSV” option.
5. Allow for multi-character delimiters.
6. Have an option to return either NULLs or empty strings for adjacent delimiters. Delimiters with even a single space between them would not be considered as adjacent.
7. Have an option for a “word” split where adjacent non alpha-numeric characters are treated as one and auto-magically used as the delimiter possibly including an exclusion list for characters like dashes, underscores, etc.
8. Have an “array” option where you can identify the delimiters for up to 3 levels (think flatted array or cube being passed) that would include the proper ordinals for each level.
I agree that this should be considered a defect. Without any guarantees in the documentation that row_number guarantees the order I can’t reliably use this function for anything other than grabbing all values and must assume a random order can be returned at any time. Just include the index as an additional column!
I believe you need an ordinal column (or some related column with a name for each row) in order to be able to PIVOT the results back to columns, which is kind of what we often need in the first place.
Ronen Ariely (pituach) commented
I totally agree with this feedback and I voted for it. With that being said, I cannot agree with phrases like "we can't get the order reliably". In fact I hate to hear "can't", and in this case we can Guarantee the order using STRING_SPLIT, as I show in my blog. To clarify! I am not saying that we should use this solution in production, and I clearly states in the blog that "This is only a theoretical discussion and not a solution for production", but it is very simple to implement. In fact, If you don't have duplicates (like the first scenario in the blog), and maybe even when there can be only twice or less (as the second scenario), then this approach might bring good performance in some cases. It is when we have multiple duplicates like scenario 3, that everything we gain by using STRING_SPLIT is just thrown away. http://ariely.info/Blog/tabid/83/EntryId/223/T-SQL-Playing-with-STRING_SPLIT-function.aspx
Maurice Pelchat commented
Adding a column returning ordinal order of elements is a must. It has many useful applications. Since it is a table result, it would not create problems for existing code, which cannot actually refer to this column. Many other related items can be added manually by the developer, but since we can't get the order reliably, SQL Server should do it.
With an empty delimiter string, splitting at the character level would be very useful also.
Jeff Moden commented
There are a couple of suggestions on this thread that state that people want it to automatically remove "empty" elements like it some similar function does in .Net. There's also a suggestion or two to have it return the length of each ordinal. While that may seem useful, the use cases for it are not that frequent. I wouldn't do anything to it that would cause a slowdown other than adding the requested ordinal position, which shouldn't slow it down in any measurable fashion. The extra functionality can be handled post execution by the user code.
It would, indeed, be nice if it could handle multi-character delimiters but, again, don't do it if it causes any measurable slowdown. More than 90% of the time (IMHO), string splitters are only needed to work against a single character delimiter.
Pretty useless without ordinal column, really.
This really needs an ordinal column, or there needs to be an option that guarantees to return the elements in the original order. As the function returns results in original order in nearly all cases, it's highly likely that new T-SQL code being written will assume original order, and bug reports will be filed in the occasional cases when the original order is not observed.
Justin Terry commented
Most real use cases for this function are totally excluded without an ordinal column or at least a guarantee that order is preserved. Please add this functionality.
Mark Iannucci commented
If you're looking for the duplicate suggestion, it is here:
Visakh Murukesan commented
Just now came across a use case where the ListOrder would have benefited. There are many cases where the order of values which are split from the original string are important. A typical case is where you've a concatenated column whose values needs to split to multiple columns where each part of the string has to go to a different column based on its position. As of now we've to again utilize UDF based or XML based method itself to ensure the order
It would be really handy if the STRING_SPLIT function had an option/switch to return the row number in order to be able to use other windowing functions on the resultset.
I found I still had to result to a CTE effectively making my own string splitting function which included the row number to then be able to apply other windowing functions such as a running total.
This code helps to highlight the issue. In trying to apply a running total the results have to be ordered and the position of Susan and Michael is switched. Had a Row Number field existed, ordering by that would have maintained the original order:
DECLARE @str NVARCHAR(MAX) = 'Ben|10, Bob|325, Susan|7, Michael|99'
RunningTotal = SUM ( N.Amount ) OVER ( ORDER BY N.PersonName ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
PersonName = LEFT ( TRIM ( Value ), CHARINDEX ( '|', TRIM ( Value ) ) - 1 ),
Amount = CAST ( RIGHT ( TRIM ( Value ), CHARINDEX ( '|', REVERSE ( TRIM ( Value ) ) ) - 1 ) AS int )
FROM STRING_SPLIT ( @str, ',' )