Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

SQL Server 2019 Ukrainian_100_CI_AS_SC_UTF8 with "ьЬ" simmbol ( char 252 and 220 in Ukrainian codepage)

The last two lines will not return anything. For all other letters of the Cyrillic alphabet, this is not the case.
This is an old discrepancy in the Ukrainian_100_CI_AS encoding and similar.
Perhaps in UTF8 it is worth changing the behavior as in SQL_Ukrainian_CP1251_CI_AS?


select serverproperty('Collation')
exec sp_helpsort

create table #x (
a varchar(10) collate SQL_Ukrainian_CP1251_CI_AS
, b varchar(10) collate Ukrainian_100_CI_AS_SC_UTF8
, c varchar(10) collate Ukrainian_100_CI_AS
)

declare @x varchar(2) = char(252) + char(252) --ьь
declare @y varchar(2) = char(252) + char(220) --ьЬ

select @x, @y

insert into #x select 'ьЬ', 'ьЬ', 'ьЬ'
select * from #x where a like '%ьь%'
select * from #x where b like '%ьь%'
select * from #x where c like '%ьь%'

drop table #x

------ RESULT:

--------------------
SQL_Ukrainian_CP1251_CI_AS

(1 row affected)

Server default collation

--------------------

Ukrainian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 108 on Code Page 1251 for non-Unicode Data

---- ----
ьь ьЬ

(1 row affected)

(1 row affected)

a b c
---------- ---------- ----------
ьЬ ьЬ ьЬ

(1 row affected)

a b c
---------- ---------- ----------

(0 rows affected)

a b c
---------- ---------- ----------

(0 rows affected)

7 votes
Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)

We’ll send you updates on this idea

pkjqybr shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

1 comment

Sign in
(thinking…)
Sign in with: Microsoft
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    This issue has nothing to do with UTF-8. It is a general Windows Collation issue. Initially these two characters did equate when compared as case-insensitive. But then at some point, they were deliberately specified as being separate characters that could not equate even if using both case-insensitive and accent-insensitive. The Windows Server 2008 sort weights (if nothing else, at least very close to the definitions used for the version 100 Collations) define the two characters as initially being the same except for the difference in case:

    -----------------------------------------------
    DEFAULT 58093 ; Characters in this table are sorted by: SM, AW, DW, CW, Codepoint #
    ;CP SCRIPT ALPHA DIACRITIC CASING COMMENT
    ...
    0x044c 16 150 2 2 ;Cyrillic Small Soft Sign
    0x042c 16 150 2 18 ;Cyrillic Capital Soft Sign
    ...
    -----------------------------------------------

    The first value is the Code Point (CP). The "16" is the script, the "150" is the character within the script (i.e. the ALPHAbetic weight), the "2" after the "150" is the diacritic weight, and the "2" or "18" is the case weight. As we can see in the chart above, the two characters only differ in the "case weight" column (i.e. 2 vs 18). BUT, that is just the default sort weight. Culture/Locale-specific modifications are made depending on what culture / locale is being used. Looking at the "Ukrainian"-specific modifications, they are:

    -----------------------------------------------------------
    LCID 0x00000422 2 ;Ukranian (Cyrillic Script)

    0x044c 16 168 2 2 ;Cyrillic Small Soft Sign
    0x042c 16 188 2 18 ;Cyrillic Capital Soft Sign
    -----------------------------------------------------------

    As you can see, the values in the "Alphabetic Weight" column are now different (i.e. 168 and 188). This means two things:

    1) This behavior, even if incorrect, is intentional
    2) It is only the Ukrainian locale that redefines these two characters, so all other locales/cultures (i.e. Collations) will still equate them, assuming compared as case-insensitive. The following test proves this:

    -----------------------------------------------------------------------
    -- Windows Collation that came with SQL Server 2000:
    SELECT 1 WHERE N'ь' = N'Ь' COLLATE Ukrainian_CI_AS; -- 1

    -- Windows Collation that came with SQL Server 2008:
    SELECT 2 WHERE N'ь' = N'Ь' COLLATE Ukrainian_100_CI_AS; -- nothing
    SELECT 3 WHERE N'ь' = N'Ь' COLLATE Ukrainian_100_CI_AS_SC_UTF8; -- nothing

    -- Try again, but also with being Accent Insensitive:
    SELECT 4 WHERE N'ь' = N'Ь' COLLATE Ukrainian_100_CI_AI; -- nothing
    SELECT 5 WHERE N'ь' = N'Ь' COLLATE Ukrainian_100_CI_AI_SC_UTF8; -- nothing

    -- Try again (with accent-sensitive), but use another culture / locale:
    SELECT 6 WHERE N'ь' = N'Ь' COLLATE Hebrew_100_CI_AS; -- 6
    SELECT 7 WHERE N'ь' = N'Ь' COLLATE Hebrew_100_CI_AS_SC_UTF8; -- 7

    SELECT 8 WHERE N'ь' = N'Ь' COLLATE Latin1_General_100_CI_AS; -- 8
    SELECT 9 WHERE N'ь' = N'Ь' COLLATE Latin1_General_100_CI_AS_SC_UTF8; -- 9
    -----------------------------------------------------------------------

    Now, it should be noted that the Windows 10 sort weights (the newest definitions) DO go back to the previous behavior of keeping these two characters as the same except for case, even when using a Ukrainian Collation:

    -----------------------------------------------------------
    SORTGUID 00000040-57EE-1E5C-00B4-D0000BB1E11E 10
    LOCALENAME uk-UA ;Ukranian (Cyrillic Script)
    LOCALENAME uk ;Ukrainian - uk

    ...
    0x044c 16 137 2 2 ;Cyrillic Small Soft Sign
    0x042c 16 137 2 18 ;Cyrillic Capital Soft Sign
    -----------------------------------------------------------

    As you can see, the last two characters are both "137" for the Alphabetic Weight, and only differ by the "Case Weight".

    What does this Windows 10 info mean? It means that the only way to get this behavior in SQL Server is for the Collations team to issue a new Collation, based on these new definitions. If it was to be done in SQL Server 2019, it would be a version 150 series of Collations, such as:

    Ukrainian_150_CI_AI
    Ukrainian_150_CI_AI_UTF8
    Ukrainian_150_CI_AS
    Ukrainian_150_CI_AS_UTF8
    Ukrainian_150_CS_AS
    Ukrainian_150_CS_AS_UTF8
    ...and so on through the other _KS, _WS variations...

    Notice that there is no "_SC" option on those. That is because all new Collations inherently support Supplementary Characters (for details, please see: https://sqlquantumleap.com/2017/10/16/all-new-collations-in-sql-server-2017-implicitly-support-supplementary-characters/ ).

Feedback and Knowledge Base