Contains routines used for UTF-16 character and string operations.

lazutf16.pas includes string routines which are based on UTF-16 implementations, although it might also include routines for other encodings.

A UTF-16 based implementation for LowerCase, for example, is faster in WideString and UnicodeString then the default UTF-8 implementation.

Currently this unit includes only UTF8LowerCaseViaTables which is based on a UTF-16 table, but it might be extended to include various UTF-16 routines.

lazutf16.pas is part of the LazUtils package.

Gets the length of the UTF-16 character in the specified PWideChar value. Uses the endian-ness for the platform. Returns 0, 1, or 2. Length of the UTF-16 character in the value, or 0 when Nil. PWideChar value examined in the routine. Gets the length for the specified value in UTF-16 characters. Copies a number of UTF-16 characters at the given character position in the specified value. UnicodeString with the values copied in the routine. UnicodeString with the values examined in the routine. 1-based staring character (code point) position in the Unicode string. Number of characters (code points) copied in the the routine. PWideChar value with the values examined in the routine. Len is the length in words of P. CharIndex is the position of the desired UnicodeChar (starting at 0). Pos implemented for UTF-16-encoded values.

UTF16Pos is a PtrInt function used to get the character index in SearchInText where the value in SearchForText is located. StartPos allows the search to begin at a specific character (code point).

The return value is the 1-based UTF-16 character index where the SearchForText starts in SearchInText, or 0 when not found.

Character index where the SearchForText starts in SearchInText, or 0 when not found. UTF-16-encoded value to locate in SearchInText. UTF-16-encoded value searched in the routine. Optional starting position (in UTF-16 code points, not in words). Converts ordinal values for UTF-16 code points in p to its Unicode equivalent.

UTF16CharacterToUnicode converts 16-bit values in p to the equivalent Unicode value.

Unpaired surrogates are invalid in any UTFs. These include any value in the range $D800..$DBFF not followed by a value in the range $DC00..$DFFF, or any value in the range $DC00..$DFFF not preceded by a value in the range $D800..$DBFF.

UTF16CharacterToUnicode ensures that ordinal value(s) in the reserved range(s) are converted to the correct Unicode value. CharLen is updated to reflect whether the values in p are a character represented by a single UTF-16 code point (1), or requires 2 code points for the surrogate pair (2). It is set to 0 when p contains an invalid UTF-16 code point.

The return value contains the Cardinal value for the Unicode code point, or 0 when p contains an invalid UTF-16 code point.

Unicode code point for the values in p. UTF-16 code points examined and converted in the routine. Number of UTF-16 code points for the converted character. Converts a Unicode character value to its UTF-16 equivalent as a WideString value.

Cardinal values below $10000 result in a single WideChar code value for the code point. Other cardinal values result in 2 WideChar values in the result to represent the UTF-16 code point.

WideString value with UTF-16 code point the Unicode character. Unicode character value converted in the routine.

Based on the specification defined by the Unicode consortium, at:

http://unicode.org/faq/utf_bom.html#utf16-7

Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF. [AF]

Use ANextChar = #0 to indicate that there is no next char.

Determines if the specified WideString contains valid UTF-16 code points.

Examines the content in AWideStr for valid UTF-16 characters. Calls IsUTF16CharValid for consecutive code point pairs.

True if the specified WideString contains valid UTF-16 code points. WideString examined in the routine.

Same as SysUtil.StringReplace but for WideStrings and UnicodeStrings, since it's not available in FPC yet.

Converts a Unicode character value to its lowercase equivalent.

Uses internal tables to map Unicode character ranges common to both UTF-16 and UTF-32.

Cardinal value for the lowercase equivalent of u. Unicode character vale converted to lowercase in the routine. Converts a UTF-8-encoded string to lowercase Unicode values using internal case tables. String with the lowercase Unicode values for s. String with UTF-8 values converted to lowercase Unicode in the routine.