Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

SQL Server searching with CJK (Ext A)

It is understandable for the case of
select 1 where ' '=''
select 1 where 'abc'='abc ' /* Not match for 'abc'=' abc' */

But the following shouldn't be matched
http://www.unicode.org/charts/PDF/U3300.pdf
/* No problem for U+3300 - U+3357 */
select 1 where NCHAR(0x3358)=''
select 1 where NCHAR(0x33ff)=''

http://www.unicode.org/charts/PDF/U3400.pdf
select 1 where NCHAR(0x3400)=''
select 1 where NCHAR(0x4DFF)=''

This bug affects searching CJK (Ext. A) characters, non-relevant results will be returned.
It needs to be matched by using Latin1_General_BIN in order to get the expected results.
How about if the field is mixed with English and searching by CI_AS is needed.

1 vote
Sign in
(thinking…)
Password icon
Signed in as (Sign out)

We’ll send you updates on this idea

Patrick shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

4 comments

Sign in
(thinking…)
Password icon
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Hi Patrick. I never said that the version 100 Collations fix everything, just that they were much improved over the previous versions. You could try the newest set, the version 140 Collations, that are only for Japanese. The version 100 Collations find 5840 characters matching nothing / empty string, yet Japanese_XJIS_140_CI_AS only found 3375.

  • Patrick commented  ·   ·  Flag as inappropriate

    Thanks for your suggestion by using the new collation (version 100) explicitly (instead of default version 90) as described in
    https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017

    However, I have enumerated all the characters (except combining high/low-surrogate characters) that should be treated as standalone character

    DECLARE @idx INT = 1;
    DECLARE @result INT;
    DECLARE @emptychar TABLE
    (
    codepoint nvarchar(5),
    character nvarchar(1)
    )

    WHILE @idx <= 0xffff
    BEGIN
    SET @result = 0
    SELECT @result = 1 where nchar(@idx)='' COLLATE Latin1_General_100_CI_AS
    IF @result = 1
    Insert into @emptychar(codepoint,character) values (CONVERT(VARCHAR(10), CONVERT(VARBINARY(2), @idx), 2), NCHAR(@idx))
    SET @idx = @idx + 1;
    END;

    select * from @emptychar

    The range (U+A500 - U+A62B), (U+0514 - U+052F) and (U+0D70 - U+0D7F) characters should also be handled though I may not use it.
    http://www.unicode.org/charts/PDF/UA500.pdf
    http://www.unicode.org/charts/PDF/U0500.pdf
    http://www.unicode.org/charts/PDF/U0D00.pdf

  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Hi Patrick. There is actually no bug here, at least not anymore. This issue was solved when SQL Server 2008 was released, which introduced the version 100 collations (e.g. Latin1_General_100_CI_AS). Those collations added sort weights to A LOT of characters that previously had no sort weight assigned to them.

    The following examples prove this:

    -----------------------------------------------------------
    SELECT 1 WHERE NCHAR(0x3358)='' COLLATE Korean_Wansung_CI_AS; -- 1
    SELECT 2 WHERE NCHAR(0x33ff)='' COLLATE Korean_Wansung_CI_AS; -- 2
    SELECT 3 WHERE NCHAR(0x3400)='' COLLATE Korean_Wansung_CI_AS; -- 3
    SELECT 4 WHERE NCHAR(0x4DFF)='' COLLATE Korean_Wansung_CI_AS; -- 4

    SELECT 5 WHERE NCHAR(0x3358)='' COLLATE Korean_90_CI_AS; -- 5
    SELECT 6 WHERE NCHAR(0x33ff)='' COLLATE Korean_90_CI_AS; -- 6
    SELECT 7 WHERE NCHAR(0x3400)='' COLLATE Korean_90_CI_AS; -- nothing!!
    SELECT 8 WHERE NCHAR(0x4DFF)='' COLLATE Korean_90_CI_AS; -- 8

    SELECT 9 WHERE NCHAR(0x3358)='' COLLATE Korean_100_CI_AS; -- nothing!!
    SELECT 10 WHERE NCHAR(0x33ff)='' COLLATE Korean_100_CI_AS; -- nothing!!
    SELECT 11 WHERE NCHAR(0x3400)='' COLLATE Korean_100_CI_AS; -- nothing!!
    SELECT 12 WHERE NCHAR(0x4DFF)='' COLLATE Korean_100_CI_AS; -- nothing!!

    SELECT 13 WHERE NCHAR(0x3358)='' COLLATE Latin1_General_100_CI_AS; -- nothing!!
    SELECT 14 WHERE NCHAR(0x33ff)='' COLLATE Latin1_General_100_CI_AS; -- nothing!!
    SELECT 15 WHERE NCHAR(0x3400)='' COLLATE Latin1_General_100_CI_AS; -- nothing!!
    SELECT 16 WHERE NCHAR(0x4DFF)='' COLLATE Latin1_General_100_CI_AS; -- nothing!!
    -----------------------------------------------------------

    Some notes:
    1) If you are concerned with CJK characters, then it is probably best to use a Chinese/Japanese/Korean collation that would have more accurate linguistic rules for those characters.

    2) It is best to use the newest version of whatever collation you are wanting. The newest, as of SQL Server 2017, are the version 140 collations, but those are only the Japanese collations. Most others have a highest version of 100.

    3) It is best to NOT use the SQL Server collations (names starting with "SQL_") as they are mostly hold-overs from pre-SQL Server 2000 which introduced the non-versioned Windows collations. Another disadvantage of the SQL_ collations is how indexes on VARCHAR columns are invalidated when filtered with NVARCHAR data (please see: "Impact on Indexes When Mixing VARCHAR and NVARCHAR Types" -- https://sqlquantumleap.wordpress.com/2017/07/10/impact-on-indexes-when-mixing-varchar-and-nvarchar-types/ ).

    4) Don't use _BIN collations as they became effectively obsolete with the release of SQL Server 2005. Use _BIN2 instead as they provide more accurate sorting.

    This issue can be closed.

  • Patrick commented  ·   ·  Flag as inappropriate

    More supplementary example is provided.

    DECLARE @testchar TABLE
    (
    idx int,
    val nvarchar(10) COLLATE SQL_Latin1_General_CP1_CI_AS
    )

    insert into @testchar (idx,val) values (1,N'a㘭b')
    insert into @testchar (idx,val) values (2,N'ab')
    insert into @testchar (idx,val) values (3,N'C')
    insert into @testchar (idx,val) values (4,N'c')

    select * from @testchar where val like N'%㘭%' -- Result: 1,2,3,4; Correct: 1
    select * from @testchar where val like N'%a㘭b%' -- Result: 1,2; Correct 1
    select * from @testchar where val like N'%c%' -- Result: 3,4; Correct 3,4

Feedback and Knowledge Base