Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR

In SQL Server 2019 CTP 2, if you have UTF-8 data and concatenate that with another VARCHAR value of a higher Collation Precedence, and having a non-UTF-8 Collation, the Code Page will change to that of the other Collation. And since a) that other Collation is a non-UTF-8 Collation, and b) the datatype is still VARCHAR, any characters that do not have a best-fit mapping in the destination Code Page will be converted into one or two "?"s (depending on if the character is BMP or Supplementary).

To be fair, this scenario already existed, to a degree, when doing similar concatenations with VARCHAR data of differing Code Pages. And in those cases, the total number of characters that could potentially change was only 128 (Code Points 128 - 255), though realistically not the full 128 due to there usually being at least some minimal overlap of characters between the various Code Pages.

The difference now, with UTF-8, is that previously it was never possible to lose Unicode characters since they were always contained in a datatype that had a higher precedence (NVARCHAR) and so it wouldn't matter if the Code Page changed as the Code Page was no longer being used.

The most likely scenario is concatenating UTF-8 literals and/or variables with non-UTF-8 VARCHAR columns.

If this is considered expected behavior and not a bug, OR if it is too difficult to adjust how Collation Precedence and/or Datatype Precedence works such that if a UTF-8 Collation is being used, it is either preserved or the resulting expression is implicitly converted to NVARCHAR, then there will definitely need to be some clear guidance on this to help reduce confusion and set expectations.

-----------------------------------------------------------------------
IF (DB_ID(N'UTF8') IS NULL)
BEGIN
CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;
ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;
END;
GO

USE [UTF8];

DECLARE @Test TABLE
(
[Value] VARCHAR(50) COLLATE Latin1_General_100_CI_AS -- Code Page 1252
);

INSERT INTO @Test ([Value]) VALUES ('a');
SELECT * FROM @Test;
-- a

DECLARE @LowerPrecedence VARCHAR(50) = NCHAR(165239); -- Code Page 65001
SELECT @LowerPrecedence AS [TheCharacter],
DATALENGTH(@LowerPrecedence) AS [Numbytes],
CONVERT(VARBINARY(4), @LowerPrecedence) AS [UTF-8_bytes];
/*
TheCharacter Numbytes UTF-8_bytes
𨕷 4 0xF0A895B7
*/

SELECT SQL_VARIANT_PROPERTY(@LowerPrecedence, N'Collation') AS [VariableCollation];
-- Latin1_General_100_CI_AS_SC_UTF8

-- Data-loss here is expected as the container cannot hold the value:
INSERT INTO @Test ([Value]) VALUES (@LowerPrecedence);
SELECT * FROM @Test;
/*
Value
a
??
*/

SELECT [Value] + @LowerPrecedence AS [Concatenation],
SQL_VARIANT_PROPERTY([Value] + @LowerPrecedence, N'Collation') AS [ExpressionCollation]
FROM @Test;
/*
Concatenation ExpressionCollation
a?? Latin1_General_100_CI_AS
???? Latin1_General_100_CI_AS
*/
-----------------------------------------------------------------------

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

2 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    5 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Solomon Rutzky commented  ·   ·  Flag as inappropriate

        Hello @pmasl : Please take a closer look at my description and example. It is clear from the "??" returned that both A) the value was converted into UTF-16, hence the double question mark instead of a single question mark as it is a Supplementary Character, and B) the value was then converted back to Code Page 1252, thanks to Collation Precedence, hence the question marks instead of the original character.

        However, if you prefer to see it differently, as an actual comparison, then execute the following (the results are included in comments). There is no way that the same UTF-8 character should match both "??" in Code Page 1252, AND "𨕷" in UTF-8.

        -------------------------
        USE [UTF8];

        DECLARE @Test2 TABLE
        (
        [CP1252] VARCHAR(50) COLLATE Latin1_General_100_CI_AS,
        [UTF8] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        );

        INSERT INTO @Test2 ([CP1252], [UTF8]) VALUES ('??', '𨕷');
        SELECT * FROM @Test2;
        -- ?? 𨕷

        SELECT * FROM @Test2 WHERE [CP1252] = '𨕷';
        -- ?? 𨕷

        SELECT * FROM @Test2 WHERE [UTF8] = '𨕷';
        -- ?? 𨕷
        -------------------------

      • Pedro Lopes commented  ·   ·  Flag as inappropriate

        Thank you for taking the time to work with this preview feature. As it stands, comparison between a UTF8 and non-Unicode data is done by converting both to UTF-16. We will investigate further as we continue to develop this preview feature..

      Feedback and Knowledge Base