Used to perform conversions between Unicode and System Code Page encodings.

lconvencoding.pas provides types, variable, constants, and routines used to perform conversions to and from Unicode and /or System Code Page encodings. Functions in this unit are thread-safe.

For environments where the RTL uses UTF-8 encoding, the UseSystemCPConv compiler define is enabled to include System Code Page conversions.

Represents the behaviors for an error occurring in an encoding conversion.

TConvertEncodingErrorMode is an enumerated type with values that represent the behavior applied when an encoding conversion error is detected. TConvertEncodingErrorMode is the type used for the ConvertEncodingErrorMode variable.

Skip or ignore the encoding conversion error. Raises an EConvertError exception with a message relevant to the conversion error. Replace the suspect value with an error placeholder (usually the '?' character). Return an empty string ('') for the suspect value. Error handling behavior for encoding conversion errors.

ConvertEncodingErrorMode is a unit global TConvertEncodingErrorMode variable which controls the behaviour of encoding conversion errors for the following:

  • UTF-8 to single byte encoding
  • DBCS (Asian) encoding to UTF-8
  • UTF-8 to DBCS

The default value for the variable is ceemSkip, and indicates that an error is skipped or ignored when encountered in an encoding conversion.

Encoding name for UTF-8. Encoding name for ANSI. Encoding name for UTF-8 with a byte order mark. Encoding name for UCS 2-byte Little Endian. Encoding name for UCS 2-byte Big Endian. Encoding name for Code Page 1250. Encoding name for Code Page 1251. Encoding name for Code Page 1252. Encoding name for Code Page 1253. Encoding name for Code Page 1254. Encoding name for Code Page 1255. Encoding name for Code Page 1256. Encoding name for Code Page 1257. Encoding name for Code Page 1258. Encoding name for Code Page 437. Encoding name for Code Page 850. Encoding name for Code Page 852. Encoding name for Code Page 866. Encoding name for Code Page 874. Encoding name for Code Page 932. Encoding name for Code Page 936. Encoding name for Code Page 949. Encoding name for Code Page 950. Encoding name for the Macintosh Code Page. Encoding name for KOI-8R Code Page. Encoding name for KOI-8U Code Page. Encoding name for KOI-8RU Code Page. Encoding name for ISO 8859-1 Code Page. Encoding name for ISO 8859-2 Code Page. Encoding name for ISO 8859-15 Code Page. ANSI representation for the UTF-8 Byte Order Mark. ANSI representation for the UTF-16 Big Endian Byte Order Mark. ANSI representation for the UTF-16 Little Endian Byte Order Mark. ANSI representation for the UTF-32 Big Endian Byte Order Mark. ANSI representation for the UTF-32 Little Endian Byte Order Mark. Tries to determine the encoding used for the specified value.

GuessEncoding is a String function which tries to determine the encoding used for the value specified in S. The return value contains an encoding name, like 'utf8' or 'ISO-8859-1'. It may contain an empty string ('') when S is also an empty string.

GuessEncoding checks S for various Byte Order Marks at the start of the value, including: UTF8BOM, UTF16LEBOM, and UTF16BEBOM. When present, the BOM determines the encoding used for the value.

Next, it checks for an explict '{%encoding ' marker at the start of the value. When present, the value after the marker (up to the closing '}' character) is normalized and used as the return value.

Finally, it checks for a valid UTF-8 encoding (which includes ASII characters). All characters in S are examined until a character whose UTF-8 code point is not valid is encountered.

When EncodingValid is True, EncodingAnsi is assumed. Otherwise, the default encoding for the platform is used. When the return value is EncodingUTF8, it is changed to 'ISO-8859-1'. This is done because the system may use the UTF-8 encoding, but the value in S does not. ISO 8859-1 has a full mapping to Unicode, and this prevents data loss in encoding conversions.

Encoding name detected, or a default value. String with the content examined in the routine. Converts the encoded value from UTF-8 to the specified encoding.

Used in the implementation of the ConvertEncoding function.

Converts the specified value to the UTF-8 encoding.

Used in the implementation of the ConvertEncoding function.

Converts the encoding for a value from its source encoding to the target encoding.

This routine gets the default encoding name used for text from the DefaultTextEncoding unit variable. It contains the encoding use for AnsiString in the RTL.

Returns the value in the EncodingAnsi constant ('ansi'). Gets the encoding used in a console on the platform.

This routine returns the console text encoding, which might be different from the normal system encoding on some Windows systems. See http://mantis.freepascal.org/view.php?id=20552

Converts the specified encoding name to lowercase and removes '-' characters. Specifies a function used to convert the encoding for a string value.

TConvertEncodingFunction is the type used for the ConvertAnsiToUTF8 variable.

Value after converting the value to the required encoding. Value examined and converted in the routine. Alias for the TCharToUTF8Table type in CodepagesCommon. Routine used to convert from ANSI to UTF-8 encoding.

ConvertAnsiToUTF8 is a TConvertEncodingFunction variable provides a routine used to convert a String value from its ANSI-encoded value to the UTF-8 encoding.

Routine used to convert from UTF-8 to ANSI encoding. Removes the UTF-8 BOM from the UTF-8 encoded value.

The return value is the value in s after removing the Byte Order Mark in UTF8BOM from the start of the string. The return value is the same as the value in s when UTF8BOM is not found in the string.

No actions are performed in the routine when s is an empty string (''). The return value is also an empty string.

Value after removing the UTF-8 BOM. UTF-8-encoded value examined in the routine. Converts an ISO 8859-1-encoded value to UTF-8. Converts an ISO 8859-2-encoded value to UTF-8. Converts a Code Page 1250-encoded value to UTF-8. Converts a Code Page 1251-encoded value to UTF-8. Converts a Code Page 1252-encoded value to UTF-8. Converts a Code Page 1253-encoded value to UTF-8. Converts a Code Page 1254-encoded value to UTF-8. Converts a Code Page 1255-encoded value to UTF-8. Converts a Code Page 1256-encoded value to UTF-8. Converts a Code Page 1257-encoded value to UTF-8. Converts a Code Page 1258-encoded value to UTF-8. Converts a Code Page 437-encoded value to UTF-8. Converts a Code Page 850-encoded value to UTF-8. Converts a Code Page 852-encoded value to UTF-8. Converts a Code Page 866-encoded value to UTF-8. Converts a Code Page 874-encoded value to UTF-8. Converts a KOI-8R-encoded value to UTF-8. Deprecated in Lazarus 2.2.0. Use KOI8RToUTF8 instead.

Deprecated in Lazarus 2.2.0. Use KOI8RToUTF8 instead.

Converts a value encoded using the the Macintosh Code Page to UTF-8. Converts a string with single-byte values to UTF-8 using a character mapping table. ArrayISO_8859_1ToUTF8 ArrayISO_8859_15ToUTF8 ArrayISO_8859_2ToUTF8 ArrayCP1250ToUTF8 ArrayCP1251ToUTF8 ArrayCP1252ToUTF8 ArrayCP1253ToUTF8 ArrayCP1254ToUTF8 ArrayCP1255ToUTF8 ArrayCP1255ToUTF8 ArrayCP1257ToUTF8 ArrayCP437ToUTF8 ArrayCP850ToUTF8 ArrayCP866ToUTF8 ArrayKOI8RToUTF8 ArrayKOI8UToUTF8 ArrayMacintoshToUTF8 String with the UTF-8-encoded value, or an empty string. String with the single-byte values converted in the routine. Table with Character to PChar mappings for the converted value. Converts a value in UCS 2-byte LE encoding to UTF8. Converts a value in UCS 2-byte BE encoding to UTF8. Converts a value in UTF-8 encoding to UTF-8 with a Byte Order Mark. Converts a value in UTF-8 encoding to ISO 8859-1. Converts a value in UTF-8 encoding to ISO 8859-2. Converts a value in UTF-8 encoding to Code Page 1250. Converts a value in UTF-8 encoding to Code Page 1251. Converts a value in UTF-8 encoding to Code Page 1252. Converts a value in UTF-8 encoding to Code Page 1253. Converts a value in UTF-8 encoding to Code Page 1254. Converts a value in UTF-8 encoding to Code Page 1255. Converts a value in UTF-8 encoding to Code Page 1256. Converts a value in UTF-8 encoding to Code Page 1257. Converts a value in UTF-8 encoding to Code Page 1258. Converts a value in UTF-8 encoding to Code Page 437. Converts a value in UTF-8 encoding to Code Page 850. Converts a value in UTF-8 encoding to Code Page 852. Converts a value in UTF-8 encoding to Code Page 866. Converts a value in UTF-8 encoding to Code Page 874. Converts a value in UTF-8 encoding to KOI-8R. Converts a value in UTF-8 encoding to KOI-8U. Converts a value in UTF-8 encoding to KOI-8RU. Converts a value in UTF-8 encoding to the Macintosh Code Page. Converts a UTF-8-encoded value to a single-bye character set using a conversion function. Converts a UTF-8-encoded value to UCS 2-byte LE. Converts a UTF-8-encoded value to UCS 2-byte BE. Converts a value using Code Page 932 to UTF-8. Converts a value using Code Page 936 to UTF-8. Converts a value using Code Page 949 to UTF-8. Converts a value using Code Page 950 to UTF-8. Converts a value using UTF-8 encoding to Code Page 932. Converts a value using UTF-8 encoding to Code Page 936. Converts a value using UTF-8 encoding to Code Page 949. Converts a value using UTF-8 encoding to Code Page 950. Converts a value from UTF-8 encoding to Double Byte Character Set encoding. Fills the specified string list with supported encoding names for the platform.

GetSupportedEncodings is a procedure used to get the names for the supported encodings on the platform or operating system. GetSupportedEncodings adds each of the encoding names to List; it does NOT removing any existing content in the string list.

GetSupportedEncodings stores the following values in List:

  • 'UTF-8'
  • 'UTF-8BOM'
  • 'Ansi'
  • EncodingCP1250
  • EncodingCP1251
  • EncodingCP1252
  • EncodingCP1253
  • EncodingCP1254
  • EncodingCP1255
  • EncodingCP1256
  • EncodingCP1257
  • EncodingCP1258
  • EncodingCP437
  • EncodingCP850
  • EncodingCP852
  • EncodingCP866
  • EncodingCP874

For platforms that support Asian code pages, the following encoding names are added to the list:

  • EncodingCP932
  • EncodingCP936
  • EncodingCP950
  • EncodingCP949

The following encoding names are added to the end of the list:

  • 'ISO-8859-1'
  • 'ISO-8859-2'
  • 'ISO-8859-15'
  • 'KOI8-R'
  • 'KOI8-U'
  • 'KOI8-RU'
  • 'Macintosh'
  • 'UCS-2LE'
  • 'UCS-2BE'
String list where supported encoding names are added.