Do you have a comment or suggestion to improve SQL Server? We’d love to hear it!

CHARINDEX function wrong with COLLATE Vietnamese_CI_AS

declare @child varchar(50)='h',@mother varchar(50)='thanh'
SELECT CHARINDEX(@child COLLATE Vietnamese_CI_AS, @mother COLLATE Vietnamese_CI_AS)

Wrong result: 0 (Correct: 2)

1 vote
Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)

We’ll send you updates on this idea

AdminMicrosoft SQL Server (Product Manager, Microsoft Azure) shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

2 comments

Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)
Submitting...
  • Solomon Rutzky commented  ·   ·  Flag as inappropriate

    Actually, the CHARINDEX function, when used with a Vietnamese collation, is returning the correct value. The example code given in the ticket description even shows the correct behavior happening. The problem is merely misunderstanding the linguistic rules of the Vietnamese collation.

    Even though the variables are VARCHAR, the comparison rules are still the Unicode rules due to using a Windows collation (and one of the reasons why everyone should be using Windows collations and not collations with names starting with "SQL_"). And, one of the awesome things that Unicode can do is recognize certain multi-character sequences as being a single unit that behaves differently than the individual characters that make it up. The example string of "thanh" has two instances of this: "th" and "nh". To get an individual "h", it either needs to be by itself, or you need to use a binary collation. Or, you can search for one of those combinations. The following demonstrates all of this:

    --------------------------------------------------------------------------
    SELECT CHARINDEX('h' COLLATE Vietnamese_100_CI_AS, 'thanh');
    -- 0

    SELECT CHARINDEX('th' COLLATE Vietnamese_100_CI_AS, 'thanh') AS [th],
    CHARINDEX('nh' COLLATE Vietnamese_100_CI_AS, 'thanh') AS [nh];
    -- 1 4

    -- "h" by itself:
    SELECT CHARINDEX('h' COLLATE Vietnamese_100_CI_AS, 'greyhound');
    -- 5

    -- no linguistic rules applied to binary collations:
    SELECT CHARINDEX('h' COLLATE Vietnamese_100_BIN2, 'thanh');
    -- 2
    --------------------------------------------------------------------------

    To be clear, all of this is expected behavior for Vietnamese collations. While the Unicode definition files are not the direct source of sorting data in Windows, it's still pretty close. So look at the following file, under the "traditional" section, and you can see the various characters and character combinations:

    https://github.com/unicode-org/cldr/blob/master/common/collation/vi.xml

    In there you can see a combination for "QU". If we try that out, we see the same behavior as we did with "th" and "nh":

    --------------------------------------------------------------------------
    SELECT CHARINDEX('q' COLLATE Vietnamese_100_CI_AS, 'racquet') AS [q],
    CHARINDEX('qu' COLLATE Vietnamese_100_CI_AS, 'racquet') AS [qu];
    -- 0 4
    --------------------------------------------------------------------------

    For more info on working with collations, Unicode, encodings, etc, please visit: https://Collations.Info/

Feedback and Knowledge Base