Invalid T-SQL REPLACE function result for CH, Ch using czech/slovak non-binary collations
The REPLACE function returns invalid result when using CzechCIAI (and similar + Slovak ones) collations for the
SELECT
REPLACE(N'Chachach' COLLATE Latin1GeneralBIN, N'ጠ', N'#'),
REPLACE(N'Chachach' COLLATE CzechCIAI, N'ጠ', N'#')
The result:
Chachach #hachach
The starting C(h) character (regardless if in upper/lower case) is matched with the character N'ጠ' (the unicode 0x1320).
In Czech CH is considered as single letter even though coded as 4 bytes, sorting is different for C and CH.
This REPLACE function behaviour seems to be related to the specific czech collation(s), not to SQL SERVER version. I have tested this on
Microsoft SQL Server 2017 (RTM-GDR) (KB4293803) - 14.0.2002.14
Microsoft SQL Server 2016 (SP2-GDR) (KB4293802) - 13.0.5081.1
Microsoft SQL Server 2014 (SP2-GDR) (KB4019093) - 12.0.5207.0
The workaround is clear, in fact, that's a minor bug, if ever.
My testing code:
declare
@inputString nvarchar(max),
@replacedResult nvarchar(max),
@charToReplace nvarchar(1),
@unicodeCode int,
@lastInvalidReplacementUnicode int
SET @inputString = N'chChCHch';
SET @unicodeCode = 1;
SET @lastInvalidReplacementUnicode = 1;
WHILE @unicodeCode <= 0xFFFF BEGIN
SET @charToReplace = NCHAR(@unicodeCode);
SET @replacedResult = REPLACE(@inputString COLLATE Slovak_CS_AS, @charToReplace, N'#')
IF (@replacedResult <> @inputString) BEGIN
IF (@lastInvalidReplacementUnicode+1 <> @unicodeCode) BEGIN
PRINT 'Previous was OK'
END
PRINT @charToReplace + '(0x' + FORMAT(@unicodeCode, 'x4') + '): ' + @inputString + '-> ' + @replacedResult;
SET @lastInvalidReplacementUnicode = @unicodeCode
END
SET @unicodeCode = @unicodeCode + 1;
END

1 comment
-
Solomon Rutzky commented
Hi there. I believe this is just an issue with very old versions of collations. You are using the version 80 collations that came with SQL Server 2000. Unless needed for backwards compatibility, anyone using SQL Server 2008 and newer should be using _at least_ the version 100 collations, and for Japanese, the version 140 collations if using SQL Server 2017 or newer.
The query below, adapted from your original query, shows that the issue was fixed in the version 100 collations:
-----------------------------------------------------------------
SELECT
REPLACE(N'Chachach' COLLATE Czech_CS_AS, N'ጠ', N'#') AS [Czech_CS_AS],
REPLACE(N'Chachach' COLLATE Czech_100_CI_AI, N'ጠ', N'#') AS [Czech_100_CI_AI],
REPLACE(N'Chachach' COLLATE Slovak_CS_AS, N'ጠ', N'#') AS [Slovak_CS_AS],
REPLACE(N'Chachach' COLLATE Slovak_100_CI_AI, N'ጠ', N'#') AS [Slovak_100_CI_AI];
/*
Czech_CS_AS | Czech_100_CI_AI | Slovak_CS_AS | Slovak_100_CI_AI
#hachach | Chachach | #hachach | Chachach
*/
-----------------------------------------------------------------No need to show the binary collation as that will always match code points to only themselves. Also, it is best to use the "_BIN2" collations instead of the older and not entirely correct for sorting "_BIN" collations.
Please see the following post of mine as it has info relevant to this issue that might help explain some things, especially sections 2 (Different Versions) and 3 (Different Binary Comparison Types):
"Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)" ( https://sqlquantumleap.com/2019/03/13/differences-between-the-various-binary-collations-cultures-versions-and-bin-vs-bin2/ )
This bug report can probably be closed.
Take care,
Solomon...
https://SqlQuantumLift.com/
https://SqlQuantumLeap.com/
https://SQLsharp.com/P.S. I'm only voting for this suggestion to get notifications because it's the only mechanism that this horrible UserHasNoVoice platform provides.