Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: CHAR() function broken in databases with "_UTF8" default Collation

In SQL Server 2019 CTP 2, when executing a query in a database that is using a "_UTF8" Collation as its default Collation, the CHAR() function no longer correctly returns characters above Code Point 127 (0x7F). Instead, it returns NULL for any value above 127.

Not only should it return the correct values for the remaining 8-bit Code Points (i.e. values 128 - 255 / 0x80 - 0xFF), but it really should return the intended character for all valid Unicode Code Points.

The NCHAR() function only handles Code Points 0 - 65535 / 0x0000 - 0xFFFF (i.e. the BMP / UCS-2 characters) when the current database is not Supplementary Character-Aware, otherwise it can return all Unicode characters outside of the BMP (i.e. Supplementary Characters). Likewise, the CHAR() function should return characters for Code Points above 255 / 0xFF when the current database is using a "_UTF8" Collation as its default Collation.

------------------------------------------------------
CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;
ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;
GO

USE [UTF8];

SELECT NCHAR(27581), -- 殽
CHAR(27581), -- NULL
CHAR(128), -- NULL
CONVERT(VARCHAR(3), NCHAR(27581)); -- 殽

SELECT CHAR(140), CHAR(189), CHAR(220);
-- NULL NULL NULL

-- Ideally, the following would return the same character
-- in a database using a "_UTF8" Collation:
SELECT CHAR(0x013333), NCHAR(0x013333);
-- NULL 𓌳

GO

CREATE DATABASE [CP1252] COLLATE Latin1_General_100_CI_AS_SC;
ALTER DATABASE [CP1252] SET RECOVERY SIMPLE;
GO

USE [CP1252];

SELECT CHAR(140), CHAR(189), CHAR(220);
-- Œ ½ Ü

------------------------------------------------------

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

1 vote
Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)

We’ll send you updates on this idea

Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

7 comments

Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)
Submitting...
  • Aaron Hope commented  ·   ·  Flag as inappropriate

    If CHAR() is to convert a codepoint value of given character set in the range 0-255 to the corresponding CHAR(1) value, then the docs should say as much. If you're going to mention ASCII, it should be to say that 8-bit character encoding values generally match ASCII values in the range 0-127.

    Saying "This function converts an int ASCII code to a character value" is not consistent with "An integer from 0 through 255".

    Given that adjusting the return type of NCHAR() to cope with UTF-16 supplementary characters set a precedent, I'd say expecting CHAR() to follow suit is reasonable. If it is to remain bound to an 8-bit range, perhaps it's worth given consideration to the UTF-8 case in the documentation.

    SELECT CHAR(9835) -- NULL
    SELECT NCHAR(9835) COLLATE Latin1_General_100_CI_AI_UTF8 -- ♫

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    @pmasl : Ok, I just tested in DBs using Japanese (932) and Korean (949) Code Pages, and in those cases the CHAR() function does appear to behave consistently with only displaying the single-byte characters in those character sets / encodings.

    So I guess I was not correct after all ;-). This item can be closed. It looks like this is really an issue to clear up in the documentation. I will do that when I get time.

  • Pedro Lopes commented  ·   ·  Flag as inappropriate

    Hello, so upon further investigation, the CHAR function returns bytes (1 to be exact), not code points. Therefore, for characters that take more than 1 byte, this function returns NULL by-design.
    Refer to https://en.wikipedia.org/wiki/Shift_JIS and see the Shift JIS byte map. The CHAR function is consistent with that. If the input of CHAR points to the 1st byte of a double-byte JIS X 0208 character, then it must return NULL.

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Hello again. Yes, I will email you later about this and the binary Collation issue.

    But for now, I thought about this particular issue more and I am correct: the CHAR() function _does_ need to return Unicode characters. You said that it returns ASCII values 0 - 255. Well, only values 0 - 127 are standard across these code pages. Values 128 - 255 are code page-dependent, hence the CHAR() function is sensitive to the database in which it is executing. This is why you currently get NULL back from CHAR() when using a "_UTF8" Collation with values 128 - 255: it is trying to work with Code Page 65001, but Code Points above 127 are at least 2 bytes instead of only 1 byte as the CHAR() function is likely expecting to work with. But, since the CHAR() function is code page sensitive, it does need to work with Code Page 65001 to be consistent with how it works in all non-UTF8 databases. Code Page 1252 is the closest if you had to choose an 8-bit Code Page, but I think there are 10 characters that are different. But to see what I mean, try the following (continuing from my test code shown in the main post):

    ---------------------------------------------
    USE [CP1252];
    SELECT CHAR(232);
    -- Latin1_General (CP 1252) = è

    CREATE DATABASE [CP1255] COLLATE Hebrew_100_CI_AS;
    ALTER DATABASE [CP1255] SET RECOVERY SIMPLE;
    GO

    USE [CP1255];
    SELECT CHAR(232);
    -- Hebrew (CP 1255) = ט
    ---------------------------------------------

  • Pedro Lopes commented  ·   ·  Flag as inappropriate

    Got it, we'll investigate using your repro.

    About the conversions, Implicit or explicit, there's a penalty, grant you that. However, as far as I can gather, it is an unlikely query pattern, and in either case, non-SARGable. I'm curious to know more about real-world use for these functions, inline with queries. Can you please email me at utf8team@microsoft.com with more such details, if any?
    Thank you!

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Hello @pmasl. Please look at and run my test code shown in the issue above. Even if you discount the idea of returning Unicode characters, it is still broken because it does not return anything but NULL for values 128 - 255:

    SELECT CHAR(140), CHAR(189), CHAR(220);
    -- NULL NULL NULL

    NULL is not the correct value to return for any of those. So it certainly appears to be affected by the UTF8 Collation since it does work properly in a database that isn't using a UTF8 Collation.

    And regarding the idea of not updating this, won't forcing us to use NCHAR() for values 128+ result in a lot of implicit conversions since that returns NVARCHAR? Or will doing that require explicitly CONVERTing the result back into VARCHAR?

  • Pedro Lopes commented  ·   ·  Flag as inappropriate

    Hello, as documented, the CHAR function only returns ASCII code (int from 0 to 255). This is kept for backwards compatibility purposes and has consistent behavior from that perspective. Nothing was changed or broken in this function's behavior. To get a Unicode value, please use the NCHAR function.

Feedback and Knowledge Base