Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small.

When inserting some invalid UTF-8 byte sequences, you can get either a mysterious (and changing) extra byte, OR in some cases you can get the following error (and are then disconnected):

Msg 682, Level 22, State 148, Line XXXXX
Internal error. Buffer provided to read column value is too small. Run DBCC CHECKDB to check for any corruption.

For these tests, the database does not need to have a "_UTF8" Collation as its default Collation.

The first test shows that inserting a value of 0xE4A9, a partial / invalid UTF-8 sequence, into a VARCHAR(2) via a CTE works just fine:

--------------------------------------------------------------------
SELECT CONVERT(VARBINARY(2), 58537); -- 0xE4A9

DECLARE @UnicodeCodePoints TABLE
(
[Character] VARCHAR(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL
);

;WITH nums AS
(
SELECT 58537 AS [CodePoint] -- 57537 works ( 57538 - 57586 break )
)
INSERT INTO @UnicodeCodePoints ([Character])
SELECT CONVERT(VARBINARY(2), nums.[CodePoint])
FROM nums

GO
--------------------------------------------------------------------

However, as this next test demonstrates, inserting 0xE4AA, similarly partial / invalid, into VARCHAR(2) via a CTE raises a fatal error and the connected is terminated. Yet, inserting that same value into the same column outside of the CTE does not error and successfully inserts the invalid byte sequence. There are several combinations of VARCHAR size and VARBINARY size that alter if this succeeds or fails, and they are detailed in comments below the queries.

--------------------------------------------------------------------
SELECT CONVERT(VARBINARY(2), 58538); -- 0xE4AA

DECLARE @UnicodeCodePoints TABLE
(
[Character] VARCHAR(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL
);

INSERT INTO @UnicodeCodePoints ([Character]) VALUES (CONVERT(VARBINARY(2), 57538)); -- works
SELECT *, CONVERT(VARBINARY(4), [Character]) FROM @UnicodeCodePoints;

;WITH nums AS
(
SELECT 57538 AS [CodePoint] -- 57537 works ( 57538 - 57586 break )
)
INSERT INTO @UnicodeCodePoints ([Character])
SELECT CONVERT(VARBINARY(2), nums.[CodePoint])
FROM nums
/*
Msg 682, Level 22, State 148, Line XXXXX
Internal error. Buffer provided to read column value is too small. Run DBCC CHECKDB to check for any corruption.
(disconnected)
*/

-- Error when (2, 2) OR (3, 3) OR 4, >= 4

-- Column > SELECT Expression = Success
-- Column = SELECT Expression = Internal error. Buffer provided to read column value is too small.
-- Column < SELECT Expression = (When Column == 2 or 3) String or binary data would be truncated.
-- (When Column == 4) Internal error. Buffer provided to read column value is too small.

-- Error also happens if this is a local temp table (and possibly also global temp and permanent tables)

GO
--------------------------------------------------------------------

Finally, the following test shows that if there is enough space in the destination VARCHAR, then inserting the invalid 2-byte sequence includes an additional byte, that value of which often changes between executions (if it appears to stay the same over several executions, wait a minute or so and try again; I think it has to do with other threads executing).

--------------------------------------------------------------------
SELECT CONVERT(VARBINARY(2), 57548); -- 0xE0CC

DECLARE @UnicodeCodePoints TABLE
(
[ID] INT IDENTITY(1, 1) NOT NULL,
[Character] VARCHAR(30) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL
);

;WITH nums AS
(
SELECT 57548 AS [CodePoint] -- 57548 works now (or does it? ;-)
)
INSERT INTO @UnicodeCodePoints ([Character])
SELECT CONVERT(VARBINARY(2), nums.[CodePoint])
FROM nums

INSERT INTO @UnicodeCodePoints ([Character]) VALUES (CONVERT(VARBINARY(2), 57548));

SELECT *, CONVERT(VARBINARY(4), [Character]) FROM @UnicodeCodePoints;
-- Mysterious, changing 3rd byte (keep hitting F5)
--------------------------------------------------------------------

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

1 vote
Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)

We’ll send you updates on this idea

Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

2 comments

Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    I have run these tests on CTP 2.1 and can confirm that this issue is now resolved. Using the invalid UTF-8 sequences now gets the following error:

    Msg 9833, Level 16, State 2, Line XXXXX
    Invalid data for UTF8-encoded characters

  • Pedro Lopes commented  ·   ·  Flag as inappropriate

    Thank you for taking the time to work with this preview feature. We will investigate.

Feedback and Knowledge Base