UTF-8 in SQL 2019: Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR
In SQL Server 2019 CTP 2, if you have UTF-8 data and concatenate that with another VARCHAR value of a higher Collation Precedence, and having a non-UTF-8 Collation, the Code Page will change to that of the other Collation. And since a) that other Collation is a non-UTF-8 Collation, and b) the datatype is still VARCHAR, any characters that do not have a best-fit mapping in the destination Code Page will be converted into one or two "?"s (depending on if the character is BMP or Supplementary).
To be fair, this scenario already existed, to a degree, when doing similar concatenations with VARCHAR data of differing Code Pages. And in those cases, the total number of characters that could potentially change was only 128 (Code Points 128 - 255), though realistically not the full 128 due to there usually being at least some minimal overlap of characters between the various Code Pages.
The difference now, with UTF-8, is that previously it was never possible to lose Unicode characters since they were always contained in a datatype that had a higher precedence (NVARCHAR) and so it wouldn't matter if the Code Page changed as the Code Page was no longer being used.
The most likely scenario is concatenating UTF-8 literals and/or variables with non-UTF-8 VARCHAR columns.
If this is considered expected behavior and not a bug, OR if it is too difficult to adjust how Collation Precedence and/or Datatype Precedence works such that if a UTF-8 Collation is being used, it is either preserved or the resulting expression is implicitly converted to NVARCHAR, then there will definitely need to be some clear guidance on this to help reduce confusion and set expectations.
IF (DBID(N'UTF8') IS NULL)
CREATE DATABASE [UTF8] COLLATE Latin1General100CIASSC_UTF8;
ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;
DECLARE @Test TABLE
[Value] VARCHAR(50) COLLATE Latin1General100CIAS -- Code Page 1252
INSERT INTO @Test ([Value]) VALUES ('a');
SELECT * FROM @Test;
DECLARE @LowerPrecedence VARCHAR(50) = NCHAR(165239); -- Code Page 65001
SELECT @LowerPrecedence AS [TheCharacter],
DATALENGTH(@LowerPrecedence) AS [Numbytes],
CONVERT(VARBINARY(4), @LowerPrecedence) AS [UTF-8_bytes];
TheCharacter Numbytes UTF-8_bytes
𨕷 4 0xF0A895B7
SELECT SQLVARIANTPROPERTY(@LowerPrecedence, N'Collation') AS [VariableCollation];
-- Data-loss here is expected as the container cannot hold the value:
INSERT INTO @Test ([Value]) VALUES (@LowerPrecedence);
SELECT * FROM @Test;
SELECT [Value] + @LowerPrecedence AS [Concatenation],
SQL_VARIANT_PROPERTY([Value] + @LowerPrecedence, N'Collation') AS [ExpressionCollation]
P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.
This behavior happens by design and predates UTF-8 implementation. Any comparison between non-UTF8 and UTF8 data always get converted to UTF16. A design change would have to be evaluated for a future version.
I just ran these tests on CTP 2.4 and the problem still exists.
I just ran these tests on CTP 2.3 and the problem still exists.
I just ran these tests on CTP 2.2 and the problem still exists.
I just ran these tests on CTP 2.1 and the problem still exists.
Hello @pmasl : Please take a closer look at my description and example. It is clear from the "??" returned that both A) the value was converted into UTF-16, hence the double question mark instead of a single question mark as it is a Supplementary Character, and B) the value was then converted back to Code Page 1252, thanks to Collation Precedence, hence the question marks instead of the original character.
However, if you prefer to see it differently, as an actual comparison, then execute the following (the results are included in comments). There is no way that the same UTF-8 character should match both "??" in Code Page 1252, AND "𨕷" in UTF-8.
DECLARE @Test2 TABLE
[CP1252] VARCHAR(50) COLLATE Latin1_General_100_CI_AS,
[UTF8] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
INSERT INTO @Test2 ([CP1252], [UTF8]) VALUES ('??', '𨕷');
SELECT * FROM @Test2;
-- ?? 𨕷
SELECT * FROM @Test2 WHERE [CP1252] = '𨕷';
-- ?? 𨕷
SELECT * FROM @Test2 WHERE [UTF8] = '𨕷';
-- ?? 𨕷
Pedro Lopes commented
Thank you for taking the time to work with this preview feature. As it stands, comparison between a UTF8 and non-Unicode data is done by converting both to UTF-16. We will investigate further as we continue to develop this preview feature..