SqlClient incorrectly decrypted and encrypted Always Encrypted data for UTF-8 collations
The attached script demonstrates some bugs with the new UTF-8 collations and Always Encrypted. The script assumes that AE with enclaves is used, but that is only for convenience. My interpretation is that the issue is entirely in the client level.
The script first creates an un-encrypted table with three text columns: one binary nvarchar column, one nvarchar column with a plain CS_AS collation and one varchar column using a UTF-8 collation. The script then adds three rows with the same text to all columns. The first is a Swedish word, the next is the English translation of that word and the last is a major city in Poland, in which three of four characters are outside the ASCII range.
Next the columns are altered to be encrypted in place. A SELECT now displays the two nvarchar columns correctly, but the UTF-8 column displays Chinese characters. If you query the table from SQLCMD, you will find that the UTF-8 is displayed correctly.
Next part of the script adds the same words again, taking benefit of the automatic parameterisation. A new SELECT on the table reveals more Chinese, but if you look closely, the characters are not the same. If you query the table from SQLCMD, you find that räksmörgås displays as r?ksm?rg?s and Łódź displays as L?dz.
Next the script alters the table to remove encryption and a final SELECT displays the first three rows correctly, while the UTF8 data is mutilated for the rows that were entered when encryption was active.
My conclusion is that both decryption and encryption is broken in the .NET API. How it arrives at the Chinese text is beyond me, but I see what happens when it encrypts: it converts the data to the 1252 code page, which explains why Ł in Łódź becomes L and same thing for ź. (These characters are not in Latin-1.) Interesting enough, this happened also when I set the system locale in Windows to use UTF-8.
Solomon Rutzky commented
Here are some examples showing different ways that can produce what you are seeing:
SELECT CONVERT(VARBINARY(20), N'Łódź') AS [UTF-16];
-- 0x4101F30064007A01 (UTF-16)
SELECT CONVERT(VARBINARY(20), 'Łódź' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [UTF-8];
-- 0xC581C3B364C5BA (UTF-8)
SELECT CONVERT(NVARCHAR(20), 0xC581C3B364C5BA) AS [UTF-8bytesAsUTF-16];
SELECT CONVERT(VARBINARY(20), 'L?dz' COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [UTF-8];
-- 0x4C3F647A (UTF-8)
SELECT CONVERT(NVARCHAR(20), 0x4C3F647A) AS [UTF-8bytesAsUTF-16];
DECLARE @CP1252bytes VARBINARY(20) = 0x655DE18593E7A9A433E39398;
DECLARE @ViewTheBytes TABLE
CP1252 VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC,
UTF8 VARCHAR(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8
INSERT INTO @ViewTheBytes VALUES (@CP1252bytes, @CP1252bytes);
SELECT * FROM @ViewTheBytes;
-- CP1252: e]á…“ç©¤3ã“˜
-- UTF8: e]ᅓ穤3㓘
Solomon Rutzky commented
Hi Erland. I don't see an attached repro script so I can't see exactly what you are seeing (if you can send it to me that would be great: https://sqlquantumleap.com/contact ).
Question: is the database's default collation UTF8 or non-UTF8? That _might_ have some impact here.
The Asian characters (not always Chinese, and not even always Asian) are most likely the result of the UTF-8 bytes being interpreted as if they were UTF-16. If you provide the characters that you are seeing and also what you expected to see, I can confirm or clarify. It is also possible to get mangled text by reading Extended ASCII code page data as if it were UTF-8, which would give a mix of characters that look correct and potentially some that don't. Whatever combination of bytes and incorrect interpretation is causing this, it is a phenomenon commonly referred to as mojibake ( https://en.wikipedia.org/wiki/Mojibake ).
Also, keep in mind that the font being used by the command prompt window in which you are running SQLCMD plays a role in how any Unicode output is displayed. Different fonts display different ranges of characters, and even for the characters that a particular font does have a mapping for, there is no guarantee that BIDI and/or combining properties are implemented correctly (or at all).
Unfortunately, this issue is yet another example of the huge potential of unintended consequences resulting from sticking 1 - 4 byte Unicode data into a datatype that previously only allowed non-Unicode, 1 - 2 byte data. Rather than spending more developer time on destabilizing the SQL Server code base in order to get this to work, the UTF-8 collations should be fully removed and any additional time should be devoted to implementing Unicode Compression for the NVARCHAR(MAX) datatype ( https://feedback.azure.com/forums/908035-sql-server/suggestions/32686957-unicode-compression-nvarchar-max ). That would provide far greater benefit to many more customers, and with fewer drawbacks.
I have added this issue to the CTP 2.2 Update section of my UTF-8 analysis: https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#UpdateCTP22