When using UTF-8, varchar(10) should mean 10 characters not 10 bytes
When using UTF-8 in SQL Server 2019 and later, specifying a column as varchar(10) means that it can store up to 10 bytes internally within the database, not up to 10 characters. That means that it might be limited to as few as 2 characters.
Developers need to be able to specify a maximum number of characters per column. There is basically no use in limiting the number of storage bytes within the database. What possible functionality could that provide?
And more importantly, how are we expecting developers to limit the number of characters in a column? Adding check constraints to every string column?
Greg Low commented
I understand how it works, That's not the issue. The question is "what would be useful to developers?", not "what would be easiest to implement in the SQL Server engine?". Defining and column as varchar(10) and knowing that it can store from 2 to 10 characters isn't particularly useful to anyone developing code to work with SQL Server. The focus of this is entirely the wrong way around. The focus is on storage not on usefulness. At least pgSQL and others have it right, even though they are UTF-8 based.
Solomon Rutzky commented
Hello Greg. Regarding your statement of:
> When using UTF-8 in SQL Server 2019 and later, specifying a column as varchar(10) means that it can store up to 10 bytes internally within the database, not up to 10 characters.
That is not exactly correct. The `n` in `VARCHAR(n)` has always referred to "bytes" and not "characters". The confusion is mostly likely due to one "character" always being one "byte" _PRE_-SQL Server 2000. Starting in SQL Server 2000, Windows collations were added, some of which were Double-Byte Character Sets (DBCS) that can store some 1-byte characters, and many 2-byte characters. Many of us (especially in the "West") don't encounter these code pages / character sets that often, but those DBCS code pages include Chinese, Japanese, and Korean, which is not insignificant in their potential usage.
Microsoft has an incredible amount of documentation, so sometimes certain technical details such as "characters" vs "bytes" don't get updated.
For a complete description of "characters" vs "bytes" across the various string datatypes, please see:
"How Many Bytes Per Character in SQL Server: a Completely Complete Guide" ( https://sqlquantumleap.com/2019/11/22/how-many-bytes-per-character-in-sql-server-a-completely-complete-guide/ )
And, for a thorough analysis of the new UTF-8 collations, please see:
"Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?" ( https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ )
P.S. I am voting for this as this horrible UserVoice system provides no other mechanism to subscribe to comments. However, I am not sure that I support the idea given the possible negative consequences to storage and query performance. When defining a table, SQL Server needs to know the max number of bytes that could be stored there. This value is stored internally in the meta-data for each column, and is used when determining max row size and calculating memory grants for queries. Since UTF-8 could be up to 4 bytes per character (for all Supplementary Characters, same as with UTF-16 / NVARCHAR), should specifying 10 "characters" equate to 40 bytes? So then the max would be `VARCHAR(2000 characters)` to stay within the 8000 byte limit? Currently you can create a table with up to 8 `VARCHAR(1000)` columns, but you would then only be able to use at most 4 `VARCHAR(2000 characters)` columns, right? I guess that's not so bad. But what about memory grants? I believe SQL Server estimates 50% fill on variable-length datatypes, so this would impact performance on some systems. This part is a bit tricky, which is why it's problematic to use `VARCHAR(MAX)` and/or `NVARCHAR(MAX)` for all string columns, even if the data is always small enough to fit on the data page. I'm not saying that this can't work, it's just not as simple as it appears to be at first glance. It gets even more complicated when dealing with languages that use multiple code points to form a single displayable character via combining characters. Either way, Oracle, which supports `(n BYTE)` and `(n CHAR)` for string type sizes, discusses some of this conflict here:
Ralf Dietrich commented
for some more details about UTF-8 https://www.mssqltips.com/sqlservertip/6168/impact-of-utf8-support-in-sql-server-2019/