Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

UTF-8 in SQL 2019: Binary UTF-8 Collations are missing

In SQL Server 2019, CPT 2 there are no binary versions of the new UTF-8 Collations. Why not? There does not seem to be any good reason for this omission.

1) It doesn't matter that the Code Page is the same for all UTF-8 Collations (Code Page = 65001): the same is true for all Unicode-only Collations (Code Page = 0) and there are multiples of those (not sure why though; it would make more sense to have a single "Common" / "Invariant" for "UnicodeOnly_BIN2" and "UTF8_BIN2").

2) It doesn't matter that binary Collations don't truly support Supplementary Characters: they don't for NVARCHAR, not even for Japanese_XJIS_140_BIN2 (remember, version 140 Collations implicitly support Supplementary Characters, except for the binary Collations). Just try this:

SELECT CONVERT(TEXT, N'a' COLLATE Japanese_XJIS_140_BIN2);
-- a

You don't get the error saying that the TEXT datatype can't be used with Collations that support Supplementary Characters (like you do if attempting to use "Japanese_XJIS_140_CI_AS").

3) Not having a binary Collation for the UTF8 Collations is inconsistent with all other Collations (even 2 of the SQL Server Collations have "_BIN2" versions!).

No need to add "_BIN" versions as that is just a waste of time and energy. But the "_BIN2" versions (at least) need to exist.

P.S. Please see "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ ) for a detailed analysis of the new UTF-8 feature.

2 votes
Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)

We’ll send you updates on this idea

Solomon Rutzky shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

7 comments

Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    I did some research and found that this is actually a little worse than I had originally thought. It seems that the version 100 collations added 438 uppercase and lowercase mappings that were missing in the version 80 and 90 collations. BUT, then the version 140 collations added another 211 mappings that were missing from the version 100 collations, for a total of 649 mappings that are missing from the version 80 and 90 collations.

    Hence, we really do need a 140_UTF8_BIN2 collation (or "Invariant_140_UTF8_BIN2", or whatever).

    To see how I came up with that list of missing mappings, please see the "Different Versions" sections of the following post:

    Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2) ( https://sqlquantumleap.wordpress.com/2019/03/13/differences-between-the-various-binary-collations-cultures-versions-and-bin-vs-bin2/ )

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Ooops, almost forgot to include the UTF8_BIN2 behavior when using the actual UTF-8 encoding. The test query should be:

    -------------------------------------------------------------------------------------
    SELECT LOWER(NCHAR(504) COLLATE Latin1_General_BIN2) AS [Version80],
    LOWER(NCHAR(504) COLLATE Chinese_Taiwan_Stroke_90_BIN2) AS [Version90],
    LOWER(NCHAR(504) COLLATE Latin1_General_100_BIN2) AS [Version100],
    LOWER(NCHAR(504) COLLATE Japanese_XJIS_140_BIN2) AS [Version140],
    LOWER(NCHAR(504) COLLATE UTF8_BIN2) AS [UTF8 is Version 80 😿],
    CONVERT(VARCHAR(2), LOWER(NCHAR(504) COLLATE UTF8_BIN2)) AS [UTF8 is Version 80 😿];
    /*
    Version80 Version90 Version100 Version140 UTF8 is Version 80 😿 UTF8 is Version 80 😿
    Ǹ Ǹ ǹ ǹ Ǹ Ǹ
    */
    -------------------------------------------------------------------------------------

    So, if there is to be only one UTF8_BIN2 collation, then it needs to be at least version 100. It can be any of the following:

    Invariant_100_UTF8_BIN2
    General_100_UTF8_BIN2
    Latin1_General_100_UTF8_BIN2 (since it is currently using LCID 1033)

    Using Invariant or General works better for future deprecation of the redundant _BIN and _BIN2 variations of the Unicode-only collations since those also only truly need a single binary collation (the Unicode-only collations already have both _BIN and _BIN2, so I assume we will also need the _BIN variations):

    Invariant_90_BIN
    Invariant_90_BIN2
    Invariant_100_BIN
    Invariant_100_BIN2

    Please note: there are no Unicode-only collations at version 80 or 140, only 90 and 100.

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Actually, it seems that point #2 in my review of the new UTF8_BIN2 collation (previous comment) has far greater impact than I had originally thought. By going with a non-versioned / version 80 (or version 0 if going by the values returned by COLLATIONPROPERTY) collation, while on the surface there is no difference because the sorting and comparison doesn't change, there are still properties of the code points that do change, such as lowercase / uppercase mappings. And the collations of versions prior to 100 (or 2 going by COLLATIONPROPERTY) are missing A LOT of mappings, version 80 / 0 missing the most. For example:

    --------------------------------------------------------------
    SELECT LOWER(NCHAR(504) COLLATE Latin1_General_BIN2) AS [Version80],
    LOWER(NCHAR(504) COLLATE Chinese_Taiwan_Stroke_90_BIN2) AS [Version90],
    LOWER(NCHAR(504) COLLATE Latin1_General_100_BIN2) AS [Version100],
    LOWER(NCHAR(504) COLLATE Japanese_XJIS_140_BIN2) AS [Version140],
    LOWER(NCHAR(504) COLLATE UTF8_BIN2) AS [UTF8 is Version 80 😿];
    /*
    Version80 Version90 Version100 Version140 UTF8 is Version 80 😿
    Ǹ Ǹ ǹ ǹ Ǹ
    */
    --------------------------------------------------------------

    This is something that MUST be addressed before this goes RTM! Thanks, and thanks again for including a single, BIN2 collation 😺.

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    I see that a "UTF8_BIN2" collation was added in CTP 2.3. Thanks for doing that.

    However, there are a few concerns with it:

    1) The name, being just UTF8_BIN2, is very much outside of the standard format that all other collation names fit into. I'm not entirely sure if this is truly a problem or not, but it just seems like something that lends itself well to the "law of unintended consequences", and that there might be some down-stream negative impacts, especially for any code that parses the collation name.

    2) Related to #1, version # is not part of the name. But, looking at the version number, it is reported as being version 0, which should be the collations that came with SQL Server 2000. Given the complexities of updating the TDS specification, and that there are really no other new collations since the UTF8 collations have thus far been attached to previously existing collations, it makes sense to not create / waste a whole new version number for just one single collation. But I think there are better choices than version 0, as no other UTF8 collation is at version 0. I think it would be fine to use version 3 (i.e. _140 in the name) since collations of that version were the first to inherently support supplementary characters and no longer need a set of _SC collations. I realize that a binary collation does not deal with supplementary characters, but so far the rule has been stated as UTF8 collations apply to collations that support supplementary characters, which means those ending in _SC or those with _140 in their names. Using _140 (i.e. version 3) would be consistent with that, and would still make sense as that is the most recent version.

    3) Related to #1, culture name / collation designator is not part of the name. I think "Invariant" would be fine here. It's terminology used in .NET so it will already make sense to some folks. Or, it could again be "General". It can even be "Latin1_General" seeing as the LCID is 1033. It just needs to be something.

    4) Speaking of LCID, I assume that 1033 is being used because it's the common / popular / widely-used one? Is it possible to use 0, or whatever is used for the invariant culture? Or is 1033 really the best choice?

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Hi Pedro. Yes, I will email you. But, to have it stated for other readers, the use cases would be (generally):

    1) whatever the current use cases are for anyone using a binary Collation for anything (getting string functions to work with characters that don't equate to anything, like CHAR(0) or some others, columns of alphanumeric codes that don't need linguistic rules and would benefit from the better performance of a binary Collation, etc)

    2) Now that CTP 2.1 adds the ability to select UTF-8 Collations as the Instance-level Collation, having the ability to continue using a binary Collation while also making use of the new UTF-8 encoding (for whatever questionable benefit it might provide). Point being, folks do use binary Collations at the Instance-level. If any of those folks want to continue doing so but also want to use UTF-8 for VARCHAR data, then they cannot do both at the same time, and for no stated (let alone "good") reason.

Feedback and Knowledge Base