Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: Collation Precedence allows for silent truncation (data loss) implicitly converting Extended ASCII characters into UTF8

In SQL Server 2019 CTP 2.1, VARCHAR data being implicitly converted from a non-UTF-8 Collation into a UTF-8 Collation can be silently truncated when Extended ASCII characters in being used (i.e. characters with values between 128 and 255). Standard ASCII characters (values 0 - 127) are not affected since they are 1 byte whether they are in a UTF-8 Collation or not. But Extended ASCII characters (i.e. those that change per each Code Page) are never 1 byte: they are usually 2 bytes (and possibly 3, though I'm not sure about this and haven't tested all of them).

The problem is that the conversion process uses the existing "max_length" of the non-UTF-8 data for the result of the converted string. Characters requiring more than a single byte will push characters off the end of the string. Meaning, a string literal of '§' in a database using any [SQL_]Latin1_General_* Collation is parsed as being VARCHAR(1). However, in UTF-8, that character requires 2 bytes. Since both bytes cannot fit into VARCHAR(1), neither byte will exist after the conversion.

For example:

---------------------------------------------------------------------------------
-- Try this in any DB with a non-UTF8 [SQL_]Latin1_General default Collation:
SELECT CHAR(167), CHAR(170), CHAR(174), CHAR(191), CHAR(198);
-- § ª ® ¿ Æ

-- Convert to UTF-8:
SELECT '§' COLLATE Latin1_General_100_CI_AS_SC_UTF8;
-- {empty string}

SELECT SQL_VARIANT_PROPERTY('§' COLLATE Latin1_General_100_CI_AS_SC_UTF8, 'MaxLength');
-- 1 (but it should be 2)

SELECT CONVERT(VARBINARY(4), '§' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [Bytes],
DATALENGTH('§' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [NumBytes];
-- 0x 0

-- Convert to UTF-8 again, but add a character to make it 2 bytes:
SELECT '§ª®¿Æ' COLLATE Latin1_General_100_CI_AS_SC_UTF8;
-- §ª (the "®¿Æ" is missing)

SELECT SQL_VARIANT_PROPERTY('§ª®¿Æ' COLLATE Latin1_General_100_CI_AS_SC_UTF8, 'MaxLength');
-- 5 (but it should be 10)

SELECT CONVERT(VARBINARY(4), '§ª®¿Æ' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [Bytes],
DATALENGTH('§ª®¿Æ' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [NumBytes];
-- 0xC2A7C2AA 4
---------------------------------------------------------------------------------

A more thorough test script, with additional scenarios, can be found on PasteBin.com at: https://pastebin.com/td4tgmzb

This bug is very similar to UTF-8 in SQL 2019: Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR ( https://feedback.azure.com/forums/908035-sql-server/suggestions/35604949-utf-8-in-sql-2019-collation-precedence-allows-for ). The difference is: that other bug deals with the non-UTF-8 Collation being the higher precedence and thus the conversion goes in the wrong direction (i.e. into Extended ASCII). Here the Code Page conversion goes in the correct direction, but there is possibly not enough space to fit the result.

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

2 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    3 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Solomon Rutzky commented  ·   ·  Flag as inappropriate

        To be clear, the truncation doesn't affect only Extended ASCII characters, it chops off whatever is on the right-side of the string (i.e. the end of the string). It's just that it is the Extended ASCII characters which cause there to be truncation in the first place. For example:

        -------------------------------------------------------------------------------
        -- Try this in any DB with a non-UTF8 [SQL_]Latin1_General default Collation:

        -- Convert to UTF-8:
        SELECT '§x' COLLATE Latin1_General_100_CI_AS_SC_UTF8;
        -- § (2-byte character fits in the two allocated bytes, leaving no room for char #2)

        SELECT 'x§' COLLATE Latin1_General_100_CI_AS_SC_UTF8;
        -- x (1-byte character fits in the two allocated bytes, leaving only 1 byte left;
        -- not enough room for 2-byte character so 2nd byte is left empty)
        -------------------------------------------------------------------------------

      Feedback and Knowledge Base