Add row position column to STRING_SPLIT
The new string splitter function in SQL Server 2016 is a good addition but it needs an extra column, a ListOrder column which denotes the order of the splitted values.
Upvotes: 72<=-=Mar 10 2016 11:26AM=-=>
It’s a shame that this was submitted as just a “suggestion”. It should actually be listed as a “bug” because there’s only a comparatively small set of use cases where enumeration of the result set of elements is not important.<=-=Mar 11 2016 12:47PM=-=>
I agree that an order column is required; one example use case is where two lists are passed in, and ordinal positions in one list correspond to positions in the other.<=-=Mar 11 2016 3:12PM=-=>
Please see the related suggestion: STRING_SPLIT needs “RemoveEmptyEntries” option, like String.Split in .NET ( https://connect.microsoft.com/SQLServer/feedback/details/2462002/ ).<=-=Mar 12 2016 12:02PM=-=>
This kind of function is primarily needed for de-serializing previously serialized arrays of values of any type format-able as text.
I therefore recommend to have the result set of this function work excellent with this use-case.
With de-serialized arrays there is a need to have precisely defined ordinal order as an additional result column.
This T-SQL splitter function would have even more usability if it also adds the following:
1. A result column containing the string position within the serialized text.
2. A result column containing the length of the identified value within the serialized text.
Finally allow for splitting values longer then one character and even zero characters.
With zero characters, just treat each character position as a new element ( “ABC” becomes “A” “B” “C” ).
I strongly second Peter de Heer’s suggestion that if an empty string is provided as a delimiter, that the function should split the string at the character level. This adds great utility to the function while still keeping it simple, lean, and fast.
As for adding other options, that’s fine as long as it doesn’t interfere with the notion of “simple, lean, and fast”. For example, I don’t believe that most people need the length of each element to be returned and if doing so slows the function down, I’d leave that out. We do, however, definitely do need for the ordinal position of each element to be returned because it’s not possible to get that information using a workaround because sort order is not otherwise guaranteed.<=-=Mar 16 2016 5:45PM=-=>
I’m very pleased to see a native inline table-valued function to split strings. However, it’s incomplete with an “Item Number” column being included in the return table. It’s a very useful column to have and and is absolutely necessary much of the time. Kudos for creating the function. Please consider adding the missing piece soon.<=-=Mar 22 2016 11:05PM=-=>
The new OPENJSON function returns the element’s ordinal position when working with an array. To be consistent, this function should also.<=-=May 5 2017 5:56AM=-=>
Due to some possible parallelism behaviour, the workaround solution proposed by Solomon Rutzky cannot be systematically used on big tables, even in case of MAXDOP 0… We needs to know which algorithm has been used…<=-=May 5 2017 6:38AM=-=>
Curiously the function name “split_string” is case sensitive (I ever use case sensitive instance to quickly retrieve systems queries that does’nt works on every installations). “split_string” must be expressed in lower case to be used in every cases. Usually and since the beggining all integrated SQL functions are expressed in UPPERCASE like SYSTEM_USER.<=-=May 13 2017 1:56PM=-=>
@SQL_pro Re: your comment of “Due to some possible parallelism behavior, the workaround solution proposed by Solomon Rutzky cannot be systematically used on big tables”: is this behavior you have observed and can reproduce, or is it theoretical? If it has been observed then there is nothing to argue with. But if it is only theoretical, then to me the proposed workaround is still an option due to the very low probability that this operation could ever be handled in parallel. The very nature of a split operation is that you don’t know where the splits occur without scanning the entire string, and one cannot simply start in the middle in the hopes of dividing the work in half as that could be in the middle of an element. And while that issue of dividing the work up can possibly be corrected for by starting in the middle and scanning until the next delimiter (not a pure 50/50 split but still dividing and conquering), that requires scanning the string which is not even guaranteed to have a delimiter, hence in many cases it would be doing double work (wasting time) just to ensure that it didn’t break in its effort to save time. So parallelism (here) is more error-prone and only sometimes reducing time but more often than not increasing operational time.
However, to be fair, it was mentioned in a thread on SqlServerCentral.com that due to the documentation not guaranteeing the order, that it was not safe to assume consistency of ordering because it could somehow change in the future, and I get that, but still ;-).<=-=May 14 2017 9:00AM=-=>
@SQL_pro Re: your comment of "Curiously the function name ‘split_string’ is case sensitive ": What version of SQL Server are you testing on? And what Collation(s)? I tested on SQL Server 2016 Express (RTM and SP1) with an instance default Collation of Hebrew_100_CS_AS_KS_WS_SC. I tested in [master] (having the same Collation as the Instance), and two other DBs with Collations of Latin1_General_100_CI_AS and Latin1_General_100_BIN2. I also tested on SQL Server 2016 Developer Edition (SP1) with a database default Collation of Latin1_General_100_BIN2. In all cases, the following worked without error: “SELECT * FROM StRInG_SpLiT(‘h’, ‘,’);”.
The only way I am able to produce an error is to specify “SELECT Value FROM StRInG_SpLiT(‘h’, ‘,’);” instead of “SELECT value FROM StRInG_SpLiT(‘h’, ‘,’);”; the only difference being the case of the returned column “value”. The databases with default Collations being case-sensitive or binary would produce the error. It should be noted that using “Value” did not error in databases with a case-insensitive default Collation, even if the Instance default Collation was case-sensitive.
In the future, if you find a problem, please report it as a separate bug ticket, not as a comment in some other ticket. Please also include details to help reproduce the error, such as which edition and version and service pack level you are on, and which Collation(s) you are using.<=-=May 14 2017 9:07AM=-=>
@SwePeso: this ticket might benefit from having a title more reflective of the specific issue (i.e. “STRING_SPLIT missing RowID / ElementID / ListOrder / etc result column”). I mention this after noticing a more recent duplicate request ( https://connect.microsoft.com/SQLServer/feedback/details/3130245/ ) that has a response from Microsoft whereas they have yet to comment here. It does make it easier to know at a glance what the request is for when scrolling through the results list :).<=-=Jun 24 2017 9:07AM=-=>
Just to throw in my full wish list for an intrinsic splitter function in T-SQL…
Here’s my wish list for what should have done with the function. I consider items 1 thru 5 to be essential. The rest would be really nice to have but are not essential (IMHO). Of course, performance is always on the wish list.
1. Permanently or optionally return the ordinal position of each split-out element.
2. Always return something. If you pass it a NULL, return a NULL. If you pass it a single element empty string, return an empty string. If you pass it a one or more spaces, return the spaces (unless spaces are the delimiter, of course, and should be carefully considered because THAT’S a very common use case).
3. If you don’t pass a delimiter or any other option other than the ordinal position option (if it is optional), split at the character level.
4. Have a “True CSV/TSV” option.
5. Allow for multi-character delimiters.
6. Have an option to return either NULLs or empty strings for adjacent delimiters. Delimiters with even a single space between them would not be considered as adjacent.
7. Have an option for a “word” split where adjacent non alpha-numeric characters are treated as one and auto-magically used as the delimiter possibly including an exclusion list for characters like dashes, underscores, etc.
8. Have an “array” option where you can identify the delimiters for up to 3 levels (think flatted array or cube being passed) that would include the proper ordinals for each level.
I agree that this should be considered a defect. Without any guarantees in the documentation that row_number guarantees the order I can’t reliably use this function for anything other than grabbing all values and must assume a random order can be returned at any time. Just include the index as an additional column!
I totally agree with this feedback and I voted for it
declare @c varchar(50)
set @c = 'a_b_c_d'
;WITH t AS
row_number() over (order by (select 1)) as id
select value from t where id =2
Tim Lehner commented
To make this thing truly usable (a game-changer for many of us) and not break any existing code, the engine team could simply guarantee the original string's sort order when combined with an otherwise meaningless order or ranking function (e.g. select *, row_number() over (order by @@rowcount) from string_split('a,b,c,,d', ',')), and also allow for wider separators (e.g. nvarchar(128), or even an empty string for character-level splitting). Without a way to guarantee the sort order according to the original string, this function is effectively relegated to replacing a single LIKE wildcard concatenation anti-pattern, and it isn't even consistently faster than that.
Jason Horner commented
Another point and perhaps the larger issue here is that with the removal of CLR support for Azure SQL DB, Users no longer have the means to custom implement this type of functionality via CLR routines.
The SQL Server feature surface area is sadly lacking several features common in other platforms. by not providing developers viable options to extend functionality within Azure SQL DB:
Advanced JSON Parsing and manipulation
POSIX Regular expresion support
Advanced spatial operations
Databases like PostgreSQL and Oracle are light years ahead of SQL Server in these areas and to be honest the other engines have had these features for several years. It's time for the SQL Server team to get serious about improving the developer experience with effcient operators for these types of workloads.
Jeff Moden commented
I absolutely agree about the 80/20 thing. Unfortunately, in the form this function currently is, it doesn't even cover 20% of what I need such a function for. Just like they final fixed some of the wonderful windowing functions to be able to do running totals without crippling those functions, so can they fix this function to do a much better job. I absolutely love it when something that doesn't meet even 20% usefulness suddenly becomes 100% useful. If it's 100% useful for someone as it is, that's great but, for a lot of us, it's virtually useless because of missing functionality that should have been included from the git, and it shouldn't be difficult to do so now.
If you're concerned with backwards compatibility, then simply make a String_Split2() function.
Simon, this request is not about performance!
It's about feature set, about what should have been delivered to begin with, as what was delivered is often useless.
.NET CLR is not an answer as it does not exist in Azure!
Happy that it does work for you but there are way more cases, besides trivial, where it does not work.
Simon Richardson commented
Please do not implement this Idea, why? well personally I am in the camp of the 80 / 20 rule that is we waited a long time to get this option and it performs very well in most cases. Don't get me wrong I know there are cases where developers won't / can't change the input code and the DBA's do their best to accommodate, and unfortunately because of this they get more 'coverage' Few talk about the hundreds of things that work well but we all like to shout about the stuff that is tough or even possibly wrong. So if the .net split_string works so well use that or implement a CLR to cover what this does not do or clean your inputs before the DB or indeed ask MS to new String_to_array option. This works and works well please do not cripple it for the exceptions.
Stuart Davis commented
We deal with product data feed files from many different sources and it would be extremely beneficial to be able to use string_split, however often it's not possible as many files are never simple or clear-cut, I'd need to be able to define several delimiters eg tab, pipe, comma to apply collectively for a single string.
Aaron Bertrand commented
Jason Horner raised a great point on twitter.
Let's say you currently have a CLR function to handle certain aspects (like a three-character delimiter). You'd love to move to STRING_SPLIT, but this blocks your migration to Azure SQL DB because (a) you can't use CLR there and (b) you will need to spend time re-creating your CLR function, using a T-SQL replacement, losing performance and time.
Dan Obermiller commented
Every time I try to use STRING_SPLIT I eventually need the order option. Then I inevitably roll my own solution so I can output the actual order of elements when that is important to my use case (e.g. get the Nth piece of the output to identify what to do with the other pieces).
The other issue I regularly find myself having to deal with is either quoting delimiters CSV style, e.g.
a,b,"c,1,2,3",d - this has 4 values
Or escaping delimiters, e.g.
a,b,c\,1\,2\,3,d - this has 4 values
This one is harder to do, and I usually give up on doing it on the database because the solution is so messy.
Having a good way to do both of these would let me take a lot of work off of my application where it doesn't really belong
Solomon Rutzky commented
Regarding the comment by @Anonymous (immediately below):
NO!!! Please do 𝘯𝘰𝘵 use a multi-statement TVF (especially using a WHILE loop). There have been numerous articles / posts / etc comparing various string splitting methods in SQL Server, and the WHILE loop is one of the slowest, if not 𝘵𝘩𝘦 slowest.
If you can't or won't use the built-in STRING_SPLIT, then use one of the following two choices:
1) Pure T-SQL: "Tally OH! An Improved SQL 8K “CSV Splitter” Function" (be sure to see related discussion) ( https://www.sqlservercentral.com/articles/tally-oh-an-improved-sql-8k-%e2%80%9ccsv-splitter%e2%80%9d-function )
2) SQLCLR: there are many examples if you want to code your own, or you can simply download/install SQL# ( https://SQLsharp.com?ref=fa_32902852 ), which I wrote, as the Free version contains: String_Split and String_SplitInts. Use "SplitInts" if you are parsing a list of INTs as it is optimized for that. And, if your list is guaranteed to be <= 4000 characters (I am simplifying here for practicality), then use the "4k" version as that is also optimized for non-MAX data.
RUN THIS INSTEAD OF STRING_SPLIT. IT's FAST ENOUGH AND PROVIDES WHAT YOU NEED:
CREATE function [dbo].[Split]
RETURNS @Results TABLE (f_id INT, item varchar(4000))
DECLARE @spacerepl char(1)=char(254)
DECLARE @i int = 1, @f_id int = 1, @j int
IF @Delimiter=' ' BEGIN
IF ISNULL(@String,'')='' RETURN
WHILE @i <= LEN(@String)
SELECT @j = CHARINDEX(@Delimiter, @String, @i)
IF @j = 0 SELECT @j = len(@String) + 1
INSERT @Results SELECT @f_id, RTRIM(SUBSTRING(@String, @i, @j - @i))
SELECT @i = @j + LEN(@Delimiter)
SELECT @f_id = @f_id + 1
DECLARE @str nvarchar(100) = N'a,a,b,b,c,d,e'
,ROW_NUMBER()OVER(ORDER BY (SELECT NULL) DESC) Ix
FROM string_split(@str, N',')
,ROW_NUMBER()OVER(ORDER BY (SELECT NULL) ASC) Ix
FROM string_split(@str, N',')
both results are the same, despite ASC/DESC request
What happens to "guarantee" here?
Is it a bug? Probably not as without explicit way to define ORDER BY there is no guarantee of any ORDER. Results are undefined. At least that is how it is in T-SQL.
So we do need ordinal index/key
Therefore, string_split is feature deficient, 1/2 baked. PERIOD!
Microsoft, do the right thing, add a key/index.
Chojrak, what if one needs ordinal index?
Guarantee is meaningless without ORDER BY. This is not guarantee, it just happen to work this way, but no guarantee.
DECLARE @str nvarchar(100) = N'a,a,b,b,c,d,e'
,ROW_NUMBER()OVER(ORDER BY (SELECT NULL)) Ix
FROM string_split(@str, N',')
This is what "guarantee" looks for me, when I get key sequence
SELECT [key], [value]
FROM OPENJSON(CONCAT('["', REPLACE(@str, ',', '","'), '"]'), '$')
if I need descending sequence I can build it with explicit guarantee now
SELECT [key], [value]
,ROW_NUMBER()OVER(ORDER BY [key] DESC) Ix
FROM OPENJSON(CONCAT('["', REPLACE(@str, ',', '","'), '"]'), '$')
We don't need ordinal column, just a guarantee that it returns data in order it appears in the string being split. Just like .NET String.Split works. You can rely on it.
Alexandro Ferreira dos Santos commented
Need an ordinal column, or there to be an option that guarantees to return the elements in the original order.
L. Szozda commented
I strongly aggree that SQL Server should support this feature(it is available in PostgreSQL UNNEST() WITH ORDINALITY).
As a workaround OPENJSON and KEY could be used:
DECLARE @str NVARCHAR(MAX) = 'Ben|10, Bob|325, Susan|7, Michael|99';
SELECT s.col, PersonName, Amount,rn,
RunningTotal = SUM(Amount) OVER(ORDER BY rn ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM (SELECT @str) s(col)
CROSS APPLY (SELECT
PersonName = LEFT(TRIM(Value), CHARINDEX('|', TRIM (Value))-1)
,Amount = CAST(RIGHT(TRIM(Value), CHARINDEX('|', REVERSE (TRIM(Value)))-1) AS INT)
,rn = [key]
REPLACE(CONCAT('["',s.col,'"]'), ', ', '","')))
I believe you need an ordinal column (or some related column with a name for each row) in order to be able to PIVOT the results back to columns, which is kind of what we often need in the first place.
Ronen Ariely (pituach) commented
I totally agree with this feedback and I voted for it. With that being said, I cannot agree with phrases like "we can't get the order reliably". In fact I hate to hear "can't", and in this case we can Guarantee the order using STRING_SPLIT, as I show in my blog. To clarify! I am not saying that we should use this solution in production, and I clearly states in the blog that "This is only a theoretical discussion and not a solution for production", but it is very simple to implement. In fact, If you don't have duplicates (like the first scenario in the blog), and maybe even when there can be only twice or less (as the second scenario), then this approach might bring good performance in some cases. It is when we have multiple duplicates like scenario 3, that everything we gain by using STRING_SPLIT is just thrown away. http://ariely.info/Blog/tabid/83/EntryId/223/T-SQL-Playing-with-STRING_SPLIT-function.aspx
Maurice Pelchat commented
Adding a column returning ordinal order of elements is a must. It has many useful applications. Since it is a table result, it would not create problems for existing code, which cannot actually refer to this column. Many other related items can be added manually by the developer, but since we can't get the order reliably, SQL Server should do it.
With an empty delimiter string, splitting at the character level would be very useful also.