Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: Inconsistent handling of invalid UTF-8 byte sequences

In SQL Server 2019 CTP 2, when inserting invalid UTF-8 byte sequences into VARCHAR types that are using a "_UTF8" Collation, you will sometimes get an error, and at other times that operation will succeed and you will have a string showing "unknown" characters.

For example, inserting 0x80 by itself fails. But inserting something along with it can work (anything that is not more of 0x80). And, inserting 0x90 by itself, also invalid, does not fail.

This might be related to another bug I just filed: Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. ( https://feedback.azure.com/forums/908035-sql-server/suggestions/35587279-utf-8-in-sql-2019-invalid-utf-8-bytes-get-msg-68 ).

-----------------------------------------------------------
CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

USE [UTF8];

SELECT CONVERT(VARCHAR(4), 0x80);
/*
Msg 9833, Level 16, State 2, Line XXXXX
Invalid data for UTF8-encoded characters
*/

-- The following do not fail, yet are likewise invalid:
SELECT CONVERT(VARCHAR(4), 0x90);
SELECT CONVERT(VARCHAR(4), 0x90508050);
SELECT CONVERT(VARCHAR(4), 0x8050)
SELECT CONVERT(VARCHAR(4), 0x5080)

---

DECLARE @Test TABLE ([Character] VARCHAR(4)
COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL);

INSERT INTO @Test ([Character]) VALUES (0x807F);
SELECT * FROM @Test;

INSERT INTO @Test ([Character]) VALUES (0x8080);
/*
Msg 9833, Level 16, State 2, Line XXXXX
Invalid data for UTF8-encoded characters
*/
-----------------------------------------------------------

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

1 vote
Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)

We’ll send you updates on this idea

Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

2 comments

Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    I have run these tests on CTP 2.1 and this issue _might_ be resolved. Things are definitely improved. The following test now fails (as it should):

    SELECT CONVERT(VARCHAR(4), 0x90);

    And, according to the Unicode Standard (v11), in Chapter 3 ( http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf ): Conformance, section 3.9 "Encoding Forms" (page 56 of the PDF -- page 127 in the PDF):

    * the sequence <C2 41 42> must return as U+FFFD, U+0041, U+0042 and it does:
    SELECT CONVERT(VARCHAR(14), 0xC24142); -- �AB

    * (on the next page) the sequence <F0 80 80 41> is only required to return the 41 but can handle the previous, erroneous sequence in a variety of ways. SQL Server returns:

    SELECT CONVERT(VARCHAR(14), 0xF0808041); -- ��A

    Not sure if that is how it should be. Nor are the following entirely conclusive:

    SELECT CONVERT(VARCHAR(14), 0xF08080); -- {nothing}
    SELECT CONVERT(VARCHAR(14), 0xF0808080); -- ���
    SELECT CONVERT(VARCHAR(14), 0xF080808080); -- ERROR: Invalid data for UTF8-encoded characters

  • Pedro Lopes commented  ·   ·  Flag as inappropriate

    Thank you for taking the time to work with this preview feature. We will investigate.

Feedback and Knowledge Base