Routines for managing UTF-8-encoded strings.

lazutf8.pas contains useful routines for managing UTF-8-encoded strings. All routines are thread-safe unless explicitly stated.

lazutf8.pas is part of the LazUtils package.

Indicates if the OS requires use of AnsiToUTF8 and UTF8ToAnsi for the RTL.

NeedRTLAnsi is a Boolean function that indicates if the OS requires use of AnsiToUTF8 and UTF8ToAnsi for the RTL. AnsiToUTF8 and UTF8ToAnsi need a widestring manager under Linux, BSD, and Mac OSX. Normally these OS's use UTF-8 as the system encoding so the WideStringManager is not needed.

For the Windows environment, NeedRTLAnsi is True if the default system code page is not CP_UTF8. For UNIX-like environments, NeedRTLAnsi is True when any of the LC_ALL, LC_MESSAGES, or LANG environment variables contain a language code other than UTF-8.

DefaultSystemCodePage
True when the system encoding is not UTF-8. Sets the value for the unit global variable. New value for the variable. Ensures UTF-8 characters (or format settings) are converted to the system code page.

UTF8ToSys is an overloaded function used to convert the specified string value (or format settings) to the system codepage for the platform. UTF8ToSys works like UTF8ToAnsi, but is more independent of WideStringManager. For platforms where UTF8_RTL is not defined, and NeedRTLAnsi returns True, UTF8ToAnsi is called to convert non-ASCII values in s. For platforms where UTF8_RTL is defined, the value in s is used without modification.

An overloaded variant of the function handles TFormatSettings for the platform. The return value for the function is the specified values in AFormatSettings after being updated to reflect the system codepage for the platform. For platforms where UTF8_RTL is not defined, the values in the following format settings are updated: CurrencyString, LongMonthNames, ShortMonthNames, LongDayNames, and ShortDayNames.

No actions are needed for platforms where UTF8_RTL is defined.

Utf8ToAnsi TFormatSettings
Value for the string after conversion. Value to examine in the function. Format settings to examine in the function. Converts strings (and format settings) from the system codepage to UTF-8.

SysToUTF8 is an overloaded function used to convert strings (and format settings) from the system codepage to UTF-8. SysToUTF8 works like AnsiToUTF8, but has no reliance on the widestring manager on platforms where UTF8_RTL is defined. For platforms where UTF8_RTL is not defined, and NeedRTLAnsi contains True, non-ASCII values are converted to UTF-8 by calling AnsiToUTF8.

An overloaded variant of the function handles TFormatSettings for the platform. The return value for the function is the values specified in AFormatSettings after conversion from the system codepage to UTF-8. The values in the following format settings are updated: CurrencyString, LongMonthNames, ShortMonthNames, LongDayNames, and ShortDayNames.

AnsiToUTF8 TFormatSettings
Values after conversion to UTF-8. Values to examine in the function. Format settings to examine in the function. Converts an OEM-encoded string to UTF8.

ConsoleToUTF8 is a String function used to converts an OEM-encoded string to UTF8. The implementation of ConsoleToUTF8 is OS-specific, and essentially handles differences between various Windows platforms where use of OemToChar and WinCPToUTF8 are required. For UNIX-like environments, the value in s is converted by calling SysToUTF8.

ConsoleToUTF8 is used in the implementation of the GetEnvironmentStringUTF8 and GetEnvironmentVariableUTF8 functions.

UTF-8-encoded value for the specified string. Value to convert in the function. Converts a UTF-8-encoded string to console (OEM) encoding.

UTF8ToConsole converts a UTF-8-encoded string to console (OEM) encoding as used in Write and WriteLn. The implementation is platform specific.

For the Windows environment, either UTF8ToSys or UTF8ToWinCP is used to convert the value to the codepage or character set needed in RTL. The Windows CharToOem API is used to prepare the return value. In UNIX-like environments, UTF8ToSys is used to get the return value .

OEM-encoded value for the string. UTF-8-encode input values. Converts the string from Windows code page to UTF-8.

Converts the string from Windows code page to UTF-8. Used with some Windows-specific functions. For all Windows versions supporting 8-bit codepages (but not WinCE).

UTF-8-encoded values for the string. Input values in Windows codepage encoding. Converts the UTF-8-encoded string to the Windows code page encoding.

Converts the UTF-8-encoded string to the Windows code page encoding Used by Write and WriteLn.

UTF-8-encoded input values. Values in the Windows codepage encoding. Gets the specified command line parameter and converts it to a UTF-8-encoded string.

ParamStrUTF8 is a String function used to convert the specified command line parameter to a UTF-8-encoded string. The implementation for ParamStrUTF8 is OS- or platform-specific. For UNIX-like environments, SysToUTF8 is called to convert the value for the command line parameter at the position in Param. For Windows platforms, the stub for the Ansi or WideString version of ParamStrUTF8 is called. ParamStrUTF8 is the UTF-8-enabled counterpart to the ParamStr routine in RTL.

Index is the ordinal position for the requested parameter value. Index should be in the range 0..ParamCount. Values in Index outside this range cause an empty string ('') to be returned for a parameter value.

In most cases, the parameter at position 0 contains the name and optional path to the executable file for the application. For cross-platform compatibility, use the ExeName property in TCustomApplication to get the path and name for the binary instead. Subsequent index positions contain any command line arguments passed to the executable.

The return value contains the UTF-8-encoded string with the value for the parameter at the specified position, or an empty string when not present.

TCustomApplication.ExeName ParamStr
UTF-8-encoded value for the specified command line parameter. Ordinal position for the command line parameter retrieved in the routine. Gets the TFormatSettings for the platform.

GetFormatSettingsUTF8 is a procedure used to get the TFormatSettings for the Locale or Language Code for the platform. GetFormatSettingsUTF8 is defined for Windows environments only, and calls GetLocaleFormatSettingsUTF8 using the ThreadLocale or Language Code ID needed for the platform.

Gets format settings for a specific Language Code ID.

GetLocaleFormatSettingsUTF8 is a procedure used to get the TFormatSettings for the Locale or Language Code for the platform. GetLocaleFormatSettingsUTF8 is defined for Windows environments only.

GetLocaleFormatSettingsUTF8 ensures that values in the format settings use the Language Code ID for the platform. The following format settings are converted to their Locale-specific values:

  • ShortMonthNames
  • LongMonthNames
  • ShortDayName
  • LongDayName
  • DateSeparator
  • ShortDateFormat
  • LongDateFormat
  • TimeSeparator
  • TimeAMString
  • TimePMString
  • ShortTimeFormat
  • LongTimeFormat
  • CurrencyString
  • CurrencyFormat
  • NegCurrFormat
  • ThousandSeparator
  • DecimalSeparator
  • CurrencyDecimals
  • ListSeparator

In LCL version 3.0 or higher, LongTimeFormat and ShortTimeFormat can contain AM / PM format specifiers; i. e. 'hh:nn:ss AMPM'

Modified in LCL version 3.0 to return 12-hour time formats using AM / PM in format settings. TFormatSettings
Language Code ID. The locale-specific format settings for the platform. Returns the number of system environment variables.

Returns the number of UTF-8-encoded system environment variables. Used together with GetEnvironmentStringUTF8.

Number of variables in the system environment. Returns a system environment string.

Returns a UTF-8-encoded system environment string stored at the specified position. The value in Index is in the range 1..GetEnvironmentVariableCountUTF8. For Unix and Windows the string normally is in the form 'name=value'. Beware that Windows knows some special formats, e.g. '=C:=SomePath'. Nota bene: Raymond Chen called these "bookkeeping variables" which emulate the MS-DOS tracking mechanism for the current directory on different drives.

Use GetEnvironmentVariableUTF8 to lookup values for environment variables by name.

Value for the environment variable at the specified position. Position for the environment variable. Returns the value of a system environment variable.

Returns the value of an environment variable stored in the form 'EnvVar=value'. See GetEnvironmentStringUTF8 to retrieve the whole list of environment strings.

Value for the specified environment variable name. Environment variable with the value retrieved in the routine. Gets the UTF-8-encoded system error message for the specified error code.

SysErrorMessageUTF8 is used to get the UTF-8-encoded system error message for the specified error code. SysErrorMessageUTF8 calls the SysUtils.SysErrorMessage function and converts the error message using SysToUTF8.

UTF-8-encoded value for the system error message. Numeric system error code for the message. Returns the size of the UTF-8 codepoint in bytes.

Returns the size of the UTF-8 codepoint in bytes. The return value is for a single codepoint.

Number of bytes for the codepoint. UTF-8-encoded value to examine in the function. Fast version of UTF8CodepointSize.

Fast version of UTF8CodepointSize. Assumes the UTF-8 codepoint is valid. The return value is for a single codepoint.

Number of bytes for the codepoint. Encoded values to examine in the function. Returns the number of bytes needed for the UTF-8 codepoint starting at p. Deprecated. Use UTF8CodepointSize instead.

It returns 0 if p is nil. It returns 1 if p is a 1-byte UTF-8 codepoint or p is an invalid UTF-8 sequence. Otherwise it returns a number 2..4. It does not check for malicious codepoints like #$c0#$80, nor for undefined codepoints like #$f3#$a0#$87#$b9. Use UTF8CharacterLength to step through a string with a simple loop:

while p^ <> #0 do begin inc(p, UTF8CharacterLength(p)); end;

Even if p contains invalid UTF-8 codepoints it will run through the string without overflow.

UTF8CharacterStrictLength
Number of bytes required for the UTF-8 codepoint, or 0 (zero). Pointer to the value examined in the routine. Gets the length of a UTF-8-encoded string in codepoints.

UTF8Length is a function used to get the character length for the specified UTF-8-encoded string. The return value contains the number of UTF-8-encoded characters (or codepoints) found in the byte values for the string.

An overloaded variant of the function is provided which uses the PChar type to specify the byte values in the string. Internally, the String variant casts its value a PChar type and calls the overloaded variant.

UTF8Length iterates over the bytes in the UTF-8-encoded string data, and calls UTF8CodepointSize to determine the number of bytes needed for each codepoint. Use UTF8LengthFast for a version of the routine optimized for speed.

Number of codepoints in the byte values for the string. UTF-8-encoded string to examine in the function. Pointer to the UTF-8-encoded string to examine in the function. Number of byte values in the UTF-8-encoded string. Fast version of UTF8Length.

UTF8LengthFast is an overloaded PtrInt function used to get the length of a UTF-8-encoded string in codepoints. UTF8LengthFast is the fast version of UTF8Length. It does not call the UTF8CodepointSize function. The UTF-8-encoded string data is assumed to be valid. The native data size for the CPU is used to process blocks of UTF-8-encoded data. For a 64-bit CPU, this means that 8 bytes are read and processed at once.

The overloaded variants allow the UTF-8-encoded data to be specified as either a String type, or a null-terminated PChar type. Internally, the String-based variant casts its data to a PChar type and calls the overloaded variant.

UTF8LengthFast is a Free Pascal implementation of the C routine provided by Colin Percival:

Even faster UTF-8 character counting

Number of codepoints in the string. String with UTF-8-encoded values. Pointer to the String with UTF-8-encoded values. Number of byte values in the UTF-8-encoded string. Gets the number of valid UTF-8 codepoints in the specified value.

UTF8CodepointCount is an overloaded SizeInt function used to determine the number of UTF-8 codepoints found in the specified value. It is similar to the UTF8Length routine, but excludes any invalid codepoints found in the input value from the count in the return value. The overloaded variants allow the input value to be specified using either the String or the PChar type.

UTF8CodepointCount iterates over the byte values in the s or p arguments, and increments the return value when a valid UTF-8 codepoint is found. UTF8CodepointLen (in system.pp) is called to the get the size for each of the UTF-8 codepoints. Valid codepoints include those represented using combining characters. The process is repeated until all of the bytes in the input value have been examined, or until a codepoint with a length of zero (0) is encountered.

The return value is zero (0) if the s or p arguments are empty, or when the ByteCount argument is zero (0).

For example:

// var // Utf8Str, InvalidUtf8Str: String; // Cnt, Len: Integer; {A macron (decomposed)} Utf8Str := 'A' + #$CC#$84; {invalid single byte UTF-8} InvalidUtf8Str := #$C0#$C1#$F5#$F6#$F7#$F8#$F9#$FA#$FB#$FC#$FD#$FE#$FF; Cnt := UTF8CodePointCount(Utf8Str); // Cnt = 2 Len := UTF8Length(Utf8Str); // Len = 2 Cnt := UTF8CodePointCount(InvalidUtf8Str); // Cnt = 0 Len := UTF8Length(InvalidUtf8Str); // Len = 13 Cnt := UTF8CodePointCount(InvalidUtf8Str + Utf8Str); // Cnt = 2 Len := UTF8Length(InvalidUtf8Str + Utf8Str); // Len = 15
Added in LazUtils version 4.0. UTF8CodepointLen
Integer value with the number of valid codepoints including combining characters. String with the codepoints examined in the routine. PChar type with the codepoints examined in the routine. Number of bytes in the PChar value. Converts a UTF-8-encoded character to its unique Unicode U+XXXX character value.

UTF8CodepointToUnicode is a Cardinal function used to convert a UTF-8-encoded character to its representation as a unique Unicode U+XXXX hexadecimal character value. For example: The letter 'A' (Decimal 65) is expressed in Unicode as U+0041.

CodepointLen is an output variable used to store the number of UTF-8-encoded bytes needed for the codepoint. It will normally contain a value in the range 1..4 (the number of possible bytes used in the UTF-8 encoding scheme). It can contain 0 (zero) when p is an empty PChar value.

The return value for the function contains the hexadecimal Unicode character value as a Cardinal data type. It can contain 0 (zero) when the value in p is not a valid UTF-8-encoded character.

Use UTF8FixBroken to fix invalid UTF-8 encoding in the string.

Use UnicodeToUTF8 to convert a Unicode character value to its UTF-8-encoded value.

UTF8CodepointToUnicode does not check whether the codepoint is actually defined in Unicode tables.
Unicode character value for the UTF-8 character. The UTF-8-encode string value. Number of bytes needed for the codepoint. Returns the codepoint at p and the number of bytes to skip. Deprecated. Use Use UTF8CodepointToUnicode instead.

If p=nil then CharLen and result are 0 otherwise CharLen>0. If there is an encoding error the Result is 0 and CharLen=1 to skip forward. It is safe to do:

var s: string; p:=1; while p <= length(s) do begin UTF8CharacterToUnicode(@s[p], CharLen); inc(p, CharLen); end;

For speed reasons, this function only checks for 1, 2, 3, or 4 byte encoding errors. It does not check whether the codepoint is defined in the Unicode table.

Encodes the given code point as an UTF-8 sequence of 1 to 4 bytes.

UnicodeToUTF8 is an Integer function used to convert the Unicode character value in CodePoint to the sequence of bytes needed for the UTF-8 encoding. UnicodeToUTF8 stores the UTF-8-encoded byte values for the Unicode character in the Buf parameter.

The return value contains the number of bytes required for the UTF-8-encoded value (in the range 1..4). If it contains 0 (zero), the Unicode codepoint was invalid and an Exception is raised.

UnicodeToUTF8 does not process #0 byte values for the codepoint, as done for UTF-32.

Raises an Exception when Utf8TryFindCodepointStartCodePoint is an invalid Unicode character value. Raised with the message 'UnicodeToUTF8: invalid Unicode: XXXXXXXX'.

Number of bytes needed for the UTF-8-encoded value. Unicode character value to convert in the function. Stores the UTF-8-encoded byte values for the codepoint. Stores a single Unicode codepoint as a UTF-8-encoded value in the buffer.

UnicodeToUTF8SkipErrors is a simple and fast function used to write a single Unicode codepoint as a UTF-8-encoded value in Buf. It returns the number of bytes written. It does not append a terminating null (#0) character. It does not check if the codepoint actually exists in Unicode tables. It returns 0 if the codepoint can not be represented as a 1 to 4 byte UTF-8 sequence.

UTF-8-encoded value for the codepoint. Codepoint (Unicode character) to convert in the function. Buffer where the converted value is stored. Encodes the given code point as an UTF-8 sequence of 1 to 4 bytes.

UnicodeToUTF8Inline is an Integer function used to convert the Unicode character value in CodePoint to the sequence of bytes needed for the UTF-8 encoding. UnicodeToUTF8Inline stores the UTF-8-encoded byte values for the Unicode character in the Buf parameter.

The return value contains the number of bytes required for the UTF-8-encoded value (in the range 1..4).

Used in the implementation of UnicodeToUTF8 and UnicodeToUTF8SkipErrors.

UnicodeToUTF8Inline does not process #0 byte values for the codepoint, as done for UTF-32.
Number of bytes required for the UTF-8-encoded value. Unicode character value to convert. Destination where encoded byte values are stored. Converts UTF-8 values to their DBCS equivalent.

UTF8ToDoubleByteString is a String function used to convert UTF-8-encoded values to the representation used in Double Byte Character Sets (DBCS).

UTF8ToDoubleByteString calls UTF8Length to get the number of codepoints (or characters) in s, and calls UTF8ToDoubleByte to perform the conversion. Each codepoint is converted to Unicode by calling UTF8CodepointToUnicode.

The return value is a String type with the byte values from the conversion, or an empty string ('') when s does not contain a valid UTF-8-encoded string.

DBCS values for the specified codepoints. UTF-8-encoded values to convert in the function. Converts a UTF-8-encode string to its DBCS representation.

UTF8ToDoubleByte is used to convert UTF-8-encoded values to the representation used in Double Byte Character Sets (DBCS). UTF8ToDoubleByte calls UTF8CodepointToUnicode to process each of the codepoints in UTF8Str.

The return value contains the byte values from the conversion.

Number of double bytes converted in the function. UTF-8-encoded values to convert in the function. Length of the UTF-8-encoded input values. Storage for the Double Byte values. Finds the start of the UTF-8 character at the specified position.

Find the start of the UTF-8 character which contains BytePos. If BytePos is not part of a valid UTF-8 Codepoint the function returns BytePos. BytePos values starts at position 0.

Len is the length in bytes.

Position where the next codepoint begins. Values to examine in the function. Length of the input values. Offset into UTF8Str for the initial byte value. Tries to find the start of a valid UTF-8 codepoint in a string.

Utf8TryFindCodepointStart is a Boolean function which tries to find the start of a valid UTF-8 codepoint at the specified position in AString.

The return value contains True if the bytes at the specified position are a valid UTF-8 codepoint (1 - 4 bytes). When the return value is True, the value in CurPos is updated to contain the position in AString where the UTF-8 codepoint begins. Otherwise, the value in CurPos is unchanged. Please note, when CurPos points beyond the end of AString you will get a crash!

UTF8CodepointStrictSize will NOT "look" beyond the terminating #0 in a PChar, so this is safe with AnsiString values.
True when the bytes at the specified position are a valid UTF-8 codepoint. Pointer to the string to examine in the function. Pointer to the first position in the string examined in the function. Number of bytes in the codepoint, or 0 when invalid. Initial position in the string examined in the function. Number of bytes required for the UTF-8 codepoint. Finds the n-th UTF-8 codepoint.

Finds the n-th UTF-8 codepoint, ignoring BIDI.

Len is the length in bytes for the values in UTF8Str. CodepointIndex is the position of the desired codepoint (starting at 0), in characters.

The return value contains the byte values for the codepoint, or Nil when a valid codepoint was not found.

Byte values for the codepoint, or Nil. Values to examine in the function. Length in bytes for the input values. Character position for the desired codepoint (zero-based). Deprecated. Use UTF8CodepointStart instead. Deprecated. Use UTF8CodepointStart instead. Finds the byte index of the n-th UTF-8 codepoint.

UTF8CodepointToByteIndex is a PtrInt function used to find the byte index in UTF8Str where the n-th UTF-8 codepoint is located. It calls UTF8CodepointStart to get a pointer to the requested codepoint position.

The return value contains the difference between the pointer offsets in each of the PChar values. The return value is -1 when a codepoint is not found at the specified position.

UTF8CodepointToByteIndex ignores BIDI mode.

Byte position where the requested UTF-8 codepoint is located, or -1 when a codepoint is not available for the index value. PChar with the multi-byte UTF-8-encoded values examined in the routine. Length of the PChar value in UTF8Str in bytes. Position of the codepoint requested in the routine. This is 1-based, like a character index in String. Deprecated. Use UTF8CodepointToByteIndex instead. Deprecated. Use UTF8CodepointToByteIndex instead. Replaces all invalid UTF-8 characters in a string with the specified character.

UTF8FixBroken is an overloaded routine used to replace all invalid UTF-8 characters in the specified value with a replacement character. The overloaded variants allow the UTF-8-encoded content to be specified using either a PChar or a String type.

ReplaceChar contains the character used to replace any invalid UTF-8 characters found in the input value. The default value for ReplaceChar is the Space character (Hex $20 Decimal 32).

The PChar variant examines the specified byte values to determine when an invalid UTF-8 codepoint is found. This includes 1, 2, or 3 byte values, those that fall outside of the ranges allowed in UTF-8, and common byte sequences used to inject XSS vulnerabilities. UTF8FixBroken stops processing at the first occurrence of the byte value #0 (Decimal 0). UTF-8 byte sequences updated in the routine are stored in the original PChar argument.

The String variant converts the input argument to a PChar type and calls FindInvalidUTF8Codepoint to locate invalid UTF-8 byte sequences. If invalid bytes are found, UniqueString is called to get a new reference-counted String for the return value generated by calling the overloaded PChar variant.

Modified in LazUtils version 4.0 to include the ReplaceChar argument. UniqueString
PChar with the UTF-8-encoded values examined in the routine. String with the UTF-8-encoded values examined in the routine. Character used to replace invalid codepoints in the input argument. The default value for the argument is the Space character (decimal 32 hex $20). Gets the number of bytes needed for the UTF-8 codepoint.

Gets the number of bytes needed for the UTF-8 codepoint in P. The return value contains the number of bytes need for the codepoint (in the range 1..4), or 0 (zero) when P is not assigned or the codepoint is invalid.

UTF8CodepointStrictSize stops examining the byte values in P when #0 (Decimal 0) is encountered.
Number of bytes needed for the codepoint. UTF-8-encoded values to examine. Returns the length in bytes (1..4) for a valid UTF-8 character. Otherwise 0. Deprecated. Use UTF8CodepointStrictSize instead. Copies from a C-style string with UTF-8 encoding to UTF-8 string.

UTF8CStringToUTF8String is a String function used to copy the specified number of characters (codepoints) from a C-style string with UTF-8 encoding. The return value is a UTF-encoded string with C-style specials characters converted to their common equivalents. The following C-style quoted characters are handled in the function:

\t
Converted to a Tab character (Decimal 9)
\"
Converted to a Double Quote character (Decimal 34)
\\
Converted to a Reverse Solidus character (Decimal 92)
\n
Converted to the LineEnding ending for the OS or platform

The return value is a string which contains the number of codepoints in SourceStart specified in SourceLen, or an empty string ('') when SourceLen is 0 (zero).

UTF-8-encode string with C-style quoting removed. PChar with the UTF-8-encoded C-style string. Number of codepoints to copy in the method. Returns the character index where the search text starts in the string.

Returns the character index where SearchForText starts in SearchInText. An optional StartPos can be given to start searching at a given character index. StartPos starts at 1.

Returns 0 if the search text is not found in the string.

Character position where the search text was located. Value to locate in the string. String to search for the specified value. Returns a pointer to the position where SearchForText starts in SearchInText, or Nil when not found. Pointer to the character value where SearchForText was located in SearchInText, or Nil when not found. Pointer to the character(s) to locate in SearchInText. Number of bytes in SearchForText. Pointer to the character values examined in the routine. Number of bytes in SearchInText. Copies the specified number of codepoints from the UTF-8-encoded string.

UTF8Copy is a String function used copy to UTF-8-encoded values from s starting at the position in StartCharIndex. CharCount specifies the number of multi-byte characters (or codepoints) to include in the return value. The return value is an empty string ('') when s is not a valid UTF-8-encoded string.

UTF8Copy behaves like a substring function.

String with codepoints copied from the specified source. String with values to copy in the function. Initial character position for the copy operation. Number of characters (codepoints) to copy in the function. Deletes characters (or codepoints) in a UTF-8-encoded string.

UTF8Delete is an overloaded procedure used to delete characters (or codepoints) in a UTF-8-encoded string starting at a given position.

StartCharIndex contains the character position in s where values will be removed. StartCharIndex refers to codepoints and not individual byte or character values. A single character can be expressed as 1-4 byte values in UTF-8 encoding. CharCount indicates the number of codepoints to remove in the function.

The value in s is updated directly in the function.

An overloaded variant of the procedure is provided for platforms where the Win1252 code page is used. On these platforms, raw byte values values in s are converted to the UTF-8 code page prior to performing the delete operation.

String with values to delete in the procedure. Initial character position where values will be deleted. Number of characters (or codepoints) to remove in the procedure. Inserts the specified UTF-8 values into a string at the specified character position.

UTF8Insert inserts the specified values into a string at the specified character position. UTF8Insert is an overloaded procedure. The variants allow the string arguments to be specified as either String or UTF8String types.

source is the UTF-8-encoded values inserted in the routine.

s is the string where the values from source are inserted at the specified character position.

The value in StartCharIndex starts at 1, and represents the n-th codepoint (or character) in the destination string (s) where the values are inserted.

UTF8Insert calls UTF8CodepointStart to determine the position in s where the codepoint represented by StartCharIndex is located. No actions are performed in the routine if a valid codepoint is not found at the position specified in StartCharIndex. The RTL Insert routine is called to insert the UTF-8-encoded values from source into s.

String with the values inserted in the routine. String where the values from source are inserted. Starting character position (1-based) where the inserted values from source are stored in s. Replaces one or more values in a UTF-8-encoded string which match a given pattern.

UTF8StringReplace is an overloaded String function which replaces values in a string matching a given pattern.

S is the UTF-8-encoded string to update in the function. OldPattern is a pattern with the values to be replaced in S. NewPattern is the replacement value for OldPattern in S.

Flags contains TReplaceFlags values and control the options enabled in the operation. rfIgnoreCase causes case-insensitive comparisons to be used for values in S and OldPattern; both values are converted to lowercase copies for the purpose. rfReplaceAll causes all occurrences of OldPattern to be replaced with NewPattern in S. If the flag is omitted, only the first occurrence of OldPattern in S is replaced in the routine.

ALanguage is the 2-digit ISO 639-1 Language Code, like 'es' or 'de', used when converting values to lowercase for case-insensitive search. The default value is an empty string ('') and offers maximum speed when the language is not significant.

Count is an output variable used to return the actual number of replacements performed in the function.

UTF8StringReplace provides support for UTF-8 codepoints which have different sizes (byte counts) for the uppercase and lowercase variants of patterns. It ensures that the return value is resized (when needed) to account for individual codepoint sizes altered in S due to case conversion.

The return value is a UTF-8-encoded string with the updated values from S following replacements.

No actions are performed in the routine if OldPattern is an empty string ('').

UTF-8-encoded values after the replace operation. Original UTF-8-encoded values to examine. Pattern to replace in the function. Replacement values for the operation. Replace options enabled in the function. Language Code used for locale-specific lowercase conversions. Number of times the search pattern was replaced in the string. Converts the specified string to lowercase using Unicode case mapping rules.

UTF8LowerCase is a String function used to convert the UTF-8-encoded value in AInStr to its lowercase equivalent. UTF8LowerCase uses Unicode Data defined on on the Unicode.org website at ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. FTP lik removed. The conversion is performed using the Case Mapping Rules defined in https://www.ksu.ru/eng/departments/ktk/test/perl/lib/unicode/UCDFF301.html#CaseMappings [dead link renoved].

ALanguage indicates the language code to use for the conversion. ALanguage should be specified using ISO 639-1 format, which uses 2 characters to represent each language. If the language has no code in ISO 639-1, then the 3-chars code from ISO 639-2 should be used. For example: "tr"for the Turkish language locale. Special handling is provided in the function for Turkish ('tr') and Azeri ('az') language codes. ALanguage can be set to an empty string ('') for maximum speed in the conversion.

Lowercase values for the specified string. Values to convert in the function. Language code for the operation. Converts the specified string to lowercase using Unicode case mapping rules.

Calls UTF8LowerCase to get the return value for the function.

Lowercase values for the specified string. String value to convert in the function. Converts the specified string to uppercase using Unicode case mapping rules.

UTF8UpperCase is a String function used to convert the UTF-8-encoded value in AInStr to its uppercase equivalent. UTF8UpperCase uses Unicode Data as defined at the Unicode.org website. [ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt] FTP link removed. The conversion is performed using the Case Mapping Rules defined at https://www.ksu.ru/eng/departments/ktk/test/perl/lib/unicode/UCDFF301.html#CaseMappings. Dead link removed.

ALanguage indicates the language code to use for the conversion. ALanguage should be specified using ISO 639-1 format, which uses 2 characters to represent each language. If the language has no code in ISO 639-1, then the 3-chars code from ISO 639-2 should be used. For example: "tr"for the Turkish language locale. Special handling is provided in the function for Turkish ('tr') and Azeri ('az') language codes.ALanguage can be set to an empty string ('') for maximum speed in the conversion.

Uppercase values for the specified string. Values to convert in the function. Language code for the operation. Inline variant of UTF8UpperCase. Inline variant of UTF8UpperCase. Uppercase values for the string. Values to convert in the function. Gets the uppercase value for the specified text. Optimized to improve speed for ASCII content in the argument.

UTF8UpperCaseFast examines the ordinal values for the characters in AText to determine how the return value is derived. It is optimized for ASCII content (byte values in the range 1..128). It converts individual characters in the range ['a'..'z'] by subtracting 32 from their ordinal values.

If a non-ASCII byte value is found in AText, the return value is derived by calling UTF8UpperCase with the value in AText as an argument.

Added in LazUtils version 4.0.
Uppercase value for the specified text. String with the content examined and converted to its uppercase representation. Gets the lowercase value for the specified text. Optimized to improve speed for ASCII content in the argument.

UTF8LowerCaseFast examines the ordinal values for the characters in AText to determine how the return value is derived. It is optimized for ASCII content (byte values in the range 1..128). It converts individual characters in the range ['A'..'Z'] by adding 32 to their ordinal values.

If a non-ASCII byte value is found in AText, the return value is derived by calling UTF8LowerCase with the value in AText as an argument.

Added in LazUtils version 4.0.
Lowercase value for the specified text. String with the content examined and converted to its lowercase representation. Provides a simplistic implementation of UTF8UpperCase and UTF8LowerCase.

UTF8SwapCase provides a "naive" implementation that uses UTF8UpperCase and UTF8LowerCase. Performance is acceptable for short and reasonably long strings, but it could benefit from better performance and lower memory consumption.

AInStr contains a UTF-8-encoded string with values to convert it the method. Each character in AInStr will have its case "toggled" in the function. In other words, an uppercase character is converted to lowercase, and vice versa.

ALanguage indicates the language code to use for the conversion. ALanguage should be specified using ISO 639-1 format, which uses 2 characters to represent each language. If the language has no code in ISO 639-1, then the 3-character code from ISO 639-2 should be used. For example: "tr"for the Turkish language locale. Special handling is provided in the function for Turkish ('tr') and Azeri ('az') language codes. ALanguage can be set to an empty string ('') for maximum speed in the conversion.

No actions are performed in the method when the number of bytes for the converted value differs from the number of bytes in the original value. In this case, the return value contains the unmodified string in AInStr. The return value is an empty string ('') when AInStr is an empty string ('').

String with the converted case values. Original values for the conversion. Language code for the locale used in the conversion. Capitalizes the first letter of each word in the string.

UTF8ProperCase is a String function used to capitalize the first letter of each word in the specified string. WordDelims is set which contains the system characters used as word boundaries in the string.

UTF8ProperCase converts all of the values in AInStr to their lowercase equivalents, before converting letters following a word delimiter to uppercase.

Converting values for the string. Values to convert in the function. Characters used as word delimiters. Finds the position where an invalid UTF-8 codepoint is found in the string.

FindInvalidUTF8Codepoint is a PtrInt function used to find the position where an invalid UTF-8 codepoint is located in the specified value. The return value contains -1 when none of the values in p are invalid, or the zero-based offset into p where the invalid encoding was located.

StopOnNonUTF8 indicates if the function should exit when an encoded value is found that is not defined for the UTF-8 encoding, or for single byte characters inserted in the middle of a UTF-8 encoding (used in XSS attacks).

Offset into the string for the error. Values to examine in the function. Length of the input values. True to exit on an malformed codepoint. Returns -1 if OK, otherwise byte index of invalid UTF-8 codepoint. Deprecated. Use FindInvalidUTF8Codepoint instead.

It always stops on irregular codepoints. For example Codepoint 0 is normally encoded as #0, but it can also be encoded as #192#0. Because most software does not check this, it can be exploited and is a security risk. If StopOnNonUTF8 is False it will ignore undefined codes. For example #128. By default it stops on such codes.

Creates a string filled with the specified number of given codepoints.

UTF8StringOfChar is a function used to create a UTF-8-encoded string filled with the specified number of occurrences of the given codepoint. AUtf8Char is the UTF-8 codepoint to reproduce in the function. No actions are performed if AUtf8Char is an empty string (''), or contains a malformed UTF-8 codepoint.

The return value is filled with byte values for the codepoint (1 to 4 bytes as per the UTF-8 encoding). The process is repeated until the number of codepoints in N have been stored in the return value.

String with the specified number of occurrence of the codepoint. Codepoint to reproduce in the function. Number of occurrences to include in the return value. Adds the specified number of UTF-8 codepoints to a string.

UTF8AddChar is a String function used to add the specified number of UTF-8 codepoints to a string. AUtf8Char is the UTF-8-encoded codepoint to add to string value in S. N indicates the number of times the codepoint should be added to the string.

No actions are performed in the function when AUtf8Char is an empty string (''), or contains a malformed UTF-8 codepoint.

Values added to the string in S are inserted at the beginning of the string (prepended).
Updated value for the string. Codepoint to prepend to the string value. Original values for the string. Number of codepoints to prepend to the string. Appends the specified number of UTF-8 codepoints to a string.

UTF8AddChar is a String function used to append the specified number of UTF-8 codepoints to a string. AUtf8Char is the UTF-8-encoded codepoint to add to string value in S. N indicates the number of times the codepoint should be appended to the string.

No actions are performed in the function when AUtf8Char is an empty string (''), or contains a malformed UTF-8 codepoint.

Updated value for the string. Codepoint to append to the string value. Original values for the string. Number of codepoints to append to the string. Adds the specified number of values in AUtf8Char to the beginning of a string.

UTF8PadLeft is used to add the specified number of values in AUtf8Char to the beginning of a string. The default value for AUtf8Char is #32 ([SPACE]), but can contain any valid UTF-8 codepoint (1 to 4 bytes). UTF8PadLeft calls Utf8AddChar to create the return value for the function.

Updated value for the string with characters inserted at the beginning. Original string value to modify in the function. Number of codepoints desired in the modified string. UTF-8 codepoint to insert into the string. Appends the specified number of UTF-8 codepoints to the end of a string.

UTF8PadRight is used to append the specified number of UTF-8 codepoints to the end of a string. The default value for AUtf8Char is #32 ([SPACE]), but can contain any valid UTF-8 codepoint (1 to 4 bytes). UTF8PadRight calls Utf8AddCharR to create the return value for the function.

Updated value for the string. Original string to modify in the function. Number of codepoints desired in the modified string. Codepoint to append to the string value. Center aligns a string to the specified length.

UTF8PadCenter is used to center align a string to the specified length (number of codepoints). N indicates the length of the modified string after padding on the left and right with the UTF-8 codepoint in AUtf8Char. The default value for AUtf8Char is #32 ([SPACE]), but can contains any valid UTF-8 codepoint (1 to 4 bytes).

Modified value for the string after center alignment. Original string value. Desired length for the string (in codepoints). UTF-8 codepoint used as a padding character. Gets the specified number of characters (codepoints) at the start of the string.

UTF8LeftStr is used to get the specified number of characters (codepoints) at the beginning of the UTF-8-encoded string. UTF8LeftStr calls Utf8Copy to get the return value for the function.

Values from the specified string. Original string to examine in the function. Number of characters (codepoints) to get from the string. Gets the specified number of characters (codepoints) at the end of the string.

UTF8RightStr is used to get the specified number of characters (codepoints) at the end of the UTF-8-encoded string. UTF8RightStr calls Utf8Copy to get the return value for the function.

Values from the string. Original string to examine in the function. Number of characters (codepoints) to get from the string. Performs safe quoting for the specified UTF-8-encoded string value.

UTF8QuotedStr is a String function used to double all occurrences of the byte sequence in the Quote argument. It works like the QuotedStr or AnsiQuotedStr routines from the RTL sysutils unit, but allows the Quote character to contain a valid multi-byte UTF-8 codepoint. Processing in the routine is halted when the #0 (Decimal 0) character is encountered.

Like its counterparts, UTF8QuotedStr encloses the return value with the character specified in the Quote argument.

QuotedStr AnsiQuotedStr
Value in S after safe UTF-8 quoting has been applied. String with the values examined and quoted in the routine. Byte sequence with the quote character used in the routine. Determines if a string starts with the specified value.

UTF8StartsText determines if the value in AText begins with the value in ASubText. Both values can contain a valid UTF-8-encoded string. The return value is False when ASubText is an empty string (''), or ASubText contains more characters (codepoints) than the value in AText.

UTF8StartsText calls Utf8Copy and UTF8CompareLatinTextFast to perform a case-insensitive comparison between the values.

True when the strings starts with the specified text. Value to locate at the start of the string. String to examine in the function. Determines if a string ends with the specified value.

UTF8EndsText determines if the value in AText ends with the value in ASubText. Both values can contain a valid UTF-8-encoded string. The return value is False when ASubText is an empty string (''), or ASubText contains more characters (codepoints) than the value in AText.

UTF8StartsText calls Utf8Copy and UTF8CompareLatinTextFast to perform a case-insensitive comparison between the values.

True when the strings ends with the specified text. Value to locate at the end of the string. String to examine in the function. Reverses the order of codepoints in the specified string.

UTF8ReverseString is used to create a string with the specified content in reverse order. p contains the UTF-8-encoded values for the original string.

ByteCount indicates the total number of bytes needed to represent the codepoints in p.

UTF8ReverseString calls UTF8CodepointSize and moves the needed number of byte values in p to the return value for the function.

String with the reversed text values. PChar type with values reversed in the routine. Number of bytes reversed in the routine. String with the values reversed in the routine. Gets the right-most position in the Source string for the value in Substr. Pointer to the position in Source. Value to locate in Source. String with values examined in the routine. Creates a word-wrapped version of the specified string.

UTF8WrapText is an overloaded String function used to wrap lines of text in S at the number of characters (codepoints) specified in MaxCol.

The overloaded variant allow additional parameters to be provided with the EOL character sequence and a set of characters where a line break can be inserted. Default characters are used in BreakChars for the variant without a BreakChars argument. They include: ' ' (Space), '-' (Dash), and #9 (Tab). BreakStr contains the end-of-line sequence used to represent a line break inserted into the return value.

Use Indent to specify the number of Space (#32) characters inserted as indentation at the beginning of each word-wrapped line. The default value is 0 (zero) and omits indentation in the word-wrapped lines. A negative value in Indent causes the argument to be set to 0.

The Indent argument affects the number of UTF-8 characters allowed in each word-wrapped line. When set to a positive non-zero value, the maximum number of characters allowed per line is MaxCol - Indent.

No actions are performed in the function when S is an empty string (''), MaxCol is set to 0 (zero), or BreakChars is an empty set ([]).

Modified in LazUtils 4.0 to include the overload with an indentation argument. BreakString
Word-wrapped version of the specified text. String with values word-wrapped in the routine. End-of-line sequence used in the routine. Set of characters where a line break cab be inserted. Maximum line width in number of UTF-8 characters. Number of Space (#32) characters used to indent the individual lines of word-wrapped text. Determines whether the specified string contains only single-byte ASCII characters.

Used in the implementation of the TStringListUTF8Fast.InsertItem method.

Added in LazUtils version 3.2.
Returns True if all of the characters in S have a value less than $7F. String with the characters examined in the method. Represents styles used to escape control characters.

TEscapeMode is an enumerated type with values that determine the output style for escaped characters in Utf8EscapeControlChars.

Pascal-style escape characters '#27' Pascal-style hexadecimal strings '#$1B' C-style hexadecimal strings '\0x1B' C-style strings '\e' ASCII-style control names '[ESC]' Translates control characters in a UTF-8-encoded string into human readable format.

Utf8EscapeControlChars translates control characters inside a UTF-8-encoded string into human readable format. Characters in the range #0..#31 are converted into the human-readable values for the control characters in the format specified by EscapeMode, including:

emPascal
Pascal-style escape characters '#27'
emHexPascal
Pascal-style hexadecimal strings '#$1B'
emHexC
C-style hexadecimal strings '\0x1B'
emC
C-style strings '\e'
emAsciiControlNames
ASCII-style control names '[ESC]'

Utf8EscapeControlChars calls FindInvalidUTF8Codepoint to see if S contains any invalid codepoints for the UTF-8 encoding. UTF8FixBroken is called to repair the input value.

Utf8EscapeControlChars iterates over the characters in S, and converts any character value in the eligible range using an internal lookup table for the value in EscapeMode. All other character values (or values in multi-byte UTF-8 code points) are included in the return value in their unmodified form.

Mainly used as a diagnostic or logging tool.

String with the escaped values for control characters in S. UTF-8 encoded string with values converted in the routine. Controls the human readable format for escaped characters. Controls trimming actions performed in UTF8Trim.

TUTF8TrimFlag is an enumerated type with values that control trimming actions performed in the UTF8Trim function.

Keeps leading whitespace. Keeps trailing whitespace. Keeps tab characters. Keeps line breaks. Keeps no-break space characters. Keeps control codes other than tabs and line breaks. Stores values from the TUTF8TrimFlag enumeration.

TUTF8TrimFlags is a set type used to store values from the TUTF8TrimFlag enumeration. TUTF8TrimFlags is the type passed in arguments to the UTF8Trim function.

Removes leading and trailing whitespace or control characters.

UTF8Trim removes spaces, tabs, line breaks and control characters at both the start and the end of the UTF-8-encoded value in s. Use Flags to delete at the start only or at the end only, or to to not delete line breaks. Control characters are the Unicode sets C0 and C1, and the left-to-right and right-to-left marks.

Trimmed values for the string. String with values to trim. Actions to perform in the function. Compares the UTF-8-encoded string values.

UTF8CompareStr is a function used to compare the specified UTF-8-encoded string values. The return value indicates the relative sort order for the compared values, and includes:

0
Values are the same
<1
Value S1 comes before S2 in an alphabetic sort order
>1
Value S1 comes after S2 in an alphabetic sort order

Internally, UTF8CompareStr calls WideCompareText using the values in S1 and S2 converted to UTF-16 code points.

Relative order for the compared values. First value for the comparison. Second value for the comparison. Length of the first value. Length of the second value. Compares the specified PChar values.

Calls UTF8CompareStr to get the return value for the function.

Relative order for the compared values. First PChar value for the comparison. Second PChar value for the comparison. Case-insensitive comparison of two UTF-8-encoded values.

UTF8CompareText is a function used to perform a case-insensitive comparison between the specified UTF-8-encoded values. The return value indicates the relative sort order for the compared values, and includes:

0
Values are the same
< 0
Value S1 comes before S2 in an alphabetic sort order
> 0
Value S1 comes after S2 in an alphabetic sort order

Internally, UTF8CompareText uses WideCompareText when multi-byte codepoints are found in the compared values. This function guarantees proper collation on all supported platforms.

Relative order for the compared values. First value for the comparison. Second value for the comparison. Performs a case-insensitive comparision for the specified UTF-8-encoded PChar values.

Converts values in S1 and S2 to UTF-16 encoding, and calls WideCompareText to get the return value for the case-insensitive comparison. The return value contains the relative difference between the compared values. For instance:

  • <0 when S1<S2.
  • 0 when S1=S2.
  • >0 when S1>S2.
Integer result for the case-insensitive comparison. PChar with the values used in the comparison. PChar with the values used in the comparison. Deprecated. Use UTF8CompareText or AnsiCompareText instead.

UTF8CompareLatinTextFast is like UTF8CompareText, but does not return strict alphabetical order. The order is deterministic and good for binary search and similar uses. It avoids the conversions from UTF-8 to UTF-16 needed to use WideCompareText.

UTF8CompareLatinTextFast optimizes the comparison of values using single-byte encoding by converting uppercase characters to lowercase characters for the comparison. Multi-byte portions (with character values larger than Decimal 127) are optimized to ignore leading bytes sequences common to both compared values.

Otherwise, the routine falls back to AnsiCompareText to compare lowercase ASCII values in S1 and S2.

The return value is a pointer to an Integer where the relative sort order for the compared values is stored.

<0
S1 comes before S2 in the sort order.
>0
S1 comes after S2 in the sort order.
0 (zero)
S1 and S2 have the same value in the sort order.
Deprecated in LazUtils version 3.2 (Feb 2024). AnsiCompareText
Pointer to an Integer with the relative order for the compared values. UTF-8-encoded String value used in the comparison. UTF-8-encoded String value used in the comparison. Deprecated. Use UTF8CompareStr instead.

UTF8CompareStrCollated is used to compare two strings using language-specific sorting. The return value contains the relative sort order for the compared values, as defined for UTF8CompareStr.

Deprecated in LazUtils version 3.2 (Feb 2024).
Relative order for the compared values. First string for the comparison. Second string for the comparison. Compares the specified lines of text in a TStringList.

CompareStrListUTF8LowerCase is an Integer function used to compare the specified lines of text in the TStringList argument. Index1 and Index2 contain the ordinal positions for the respective lines of text. CompareStrListUTF8LowerCase calls UTF8CompareText to perform a case-insensitive comparison between the values.

The return value contains the relative sort order for the compared values, as defined for UTF8CompareText.

Relative order for the compared values. TStringList with values for the comparison. Position of the first text line. Position of the second text line. Implements a string list using fast ASCII comparison functions when its data is pure ASCII.

When data is Unicode, it switches to slower AnsiCompare functions. The switch is managed by setting the UseLocale property/option and should not be changed by the user.

TStringList
Ensures that the UseLocale property is enabled when a new line with non-ASCII data is stored in the string list. Ordinal position in the string list where the value in S is stored. String value examined and stored in the method. Constructor for the class instance.

Create is the overridden constructor for the class instance. It calls the inherited method on entry to set the default encoding and options used in the class instance. Create ensures that the UseLocale property is set to False to allow fast comparisons for ASCII data stored in the string list.

TStrings.UseLocale TStrings.DefaultEncoding
Indicates the result from UTF-8 <-> UTF-16 conversions.

TConvertResult is an enumeration type with values that indicate the result from ConvertUTF8ToUTF16 and ConvertUTF16ToUTF8 function calls.

No error in the conversion. Source value is null. Destination value is null. Destination value is too small for the converted value. An invalid encoding was found in the source value. An unfinished encoding was found in the source value. Indicates options enabled during UTF-8 <-> UTF-16 conversions.

TConvertOption is an enumeration type with values that indicate options enabled during UTF-8 <-> UTF-16 conversions.

Stop on invalid source char and report error. Replace invalid source chars with '?' Stop on unfinished source char and report error. Replace unfinished source char with '?' Stores values from the TConvertOption enumeration.

Stores values from the TConvertOption enumeration. Passed as an argument to ConvertUTF8ToUTF16 and ConvertUTF16ToUTF8.

Converts values from UTF-8 encoding to UTF-16 encoding.

ConvertUTF8ToUTF16 is used to convert the specified UTF-8 encoded string to UTF-16 encoded (system endian).

Options indicates the conversion options enabled in the function, and can include the following values:

toInvalidCharError
Stop on invalid source char and report error
toInvalidCharToSymbol
Replace invalid source chars with '?'
toUnfinishedCharError
Stop on unfinished source char and report error
toUnfinishedCharToSymbol
Replace unfinished source char with '?'

The return value is a value from the TConvertResult enumeration, including:

trNoError
The string was successfully converted without any error
trNullSrc
Pointer to source string is nil
trNullDest
Pointer to destination string is nil
trDestExhausted
Destination buffer size is not big enough to hold converted string
trInvalidChar
Invalid source char has occurred
trUnfinishedChar
Unfinished source char has occurred
Converted values from the function. Pointer to destination string. Wide char count allocated in destination string. Pointer to source string. Char count allocated in source string. Conversion options, if none is set, both invalid and unfinished source chars are skipped. Actual WideChar count used int he conversion. Converts values from UTF-16 encoding to UTF-8 encoding.

Converts the specified UTF-16 encoded string (system endian) to its UTF-8 encoding.

Options indicates the conversion options enabled in the function, and can include the following values:

toInvalidCharError
Stop on invalid source char and report error
toInvalidCharToSymbol
Replace invalid source chars with '?'
toUnfinishedCharError
Stop on unfinished source char and report error
toUnfinishedCharToSymbol
Replace unfinished source char with '?'

The return value is a value from the TConvertResult enumeration, including:

trNoError
The string was successfully converted without any error
trNullSrc
Pointer to source string is nil
trNullDest
Pointer to destination string is nil
trDestExhausted
Destination buffer size is not big enough to hold converted string
trInvalidChar
Invalid source char has occurred
trUnfinishedChar
Unfinished source char has occurred
Converted values from the function. Pointer to destination string. Char count allocated in destination string. Pointer to source string. Wide char count allocated in source string. Conversion options, if none is set, both invalid and unfinished source chars are skipped. Actual char count converted from source string to destination string. Converts the UTF-8 encoded string to UTF-16 encoding (system endian).

Converts the UTF-8 encoded string to UTF-16 encoding (system endian).

Converts a UTF-16-encoded string (system endian) to UTF-8 encoding.

UTF16ToUTF8 is a TConvertResult function used to convert the specified UTF-16-encoded string (system endian) to UTF-8 encoding.

The return value is a TConvertResult enumeration value, and includes:

trNoError
The string was successfully converted without any error
trNullSrc
Pointer to source string is Nil
trNullDest
Pointer to destination string is Nil
trDestExhausted
Destination buffer size is not big enough to hold converted string
trInvalidChar
Invalid source char has occurred
trUnfinishedChar
Unfinished source char has occurred
UTF-8-encoded string. Source UTF-16 string (system endian). Pointer to the Source UTF-16 string (system endian). Number of WideChar values in the source string. Deprecated. Use the GetLanguageID function from the translations.pas unit instead. Deprecated in LazUtils version 2.3.0. Deprecated. Use the GetLanguageID function from the translations.pas unit instead. Deprecated in LazUtils version 2.3.0. Contains uppercase characters for all values in the char type.

FPUpChars is an array of char type and uses the Lower and Upper bounds permitted for the char type. Values in FPUpChars are assigned in the initialization section for the lazutf8.pas unit, and contains the uppercase equivalent for all characters in the char type.

Gets the default system code page for the wide string manager.

UTF8GetStandardCodePage is a TSystemCodePage function used to get the default code page for strings in the Wide String manager. UTF8GetStandardCodePage is implemented for Windows platforms that use a UTF-8-enabled Run-time Library (RTL). It is assigned as the procedure used by the wide string manager for the platform.

stdcp contains the TStandardCodePageEnum enumeration value that identifies the default code page for the platform.

The return value is set to the CP_UTF8 constant.