nvarchar only allocates 2 bytes per character. len() reports string lengths incorrectly
nvarchar(length) is supposed to store Unicode strings as UTF-8-encoded. UTF-8 can have character lengths of up to 4 bytes. Attempting to store 4-byte characters in an nvarchar field only results in half the expected string length available. Strings can be truncated.
Similarly, the len() function for strings like this returns twice the actual length in characters.
LEN returns the number of characters of the specified string expression, excluding trailing blanks. To know the size of the encoded bytes, the DATALENGTH function is available.
Solomon Rutzky commented
This issue is just a misunderstanding of how SQL Server handles Unicode. Until _very_ recently, when SQL Server 2019 (currently only at CTP 2) added support for UTF-8, it only ever used UTF-16 LE, which is the encoding used by NCHAR / NVARCHAR / NTEXT / string data within the XML type. UTF-16 is a fixed, 2-byte per "unit" encoding. Supplementary Characters are comprised of two of these 2-byte units, known as a Surrogate Pair, and as such, require 4 bytes. Still, NVARCHAR(X) will allocate X * 2 bytes. If you need Supplementary Characters, you need to double X to reserve the required amount of bytes. This makes sense enough, though, since again, a Supplementary Character _is_ two surrogate Code Points. LEN will show the actual number of UCS-2 Code Points when not using a Supplementary Character-Aware Collation (meaning, it will only see the two individual surrogate Code Points, and so the Supplementary Character will register having a LEN of 2). When using a Supplementary Character-Aware Collation Collation, LEN will report the number of UTF-16 Code Points (so a Supplementary Character will register having a LEN of 1). DATALENGTH always shows the exact number of bytes being used.
This item is a non-issue and should be closed.