Docs: LazUtils/lazutf8. Updates UTF8CodepointCount content for changes in 27063a7d.

(cherry picked from commit 44c78718af)
2025-08-30 12:12:47 +02:00 · 2024-08-30 21:40:50 +01:00 · 2024-08-30 21:40:50 +01:00 · 22eb8c0f62
commit 22eb8c0f62
parent 4df9861768
1 changed files with 39 additions and 21 deletions
--- a/docs/xml/lazutils/lazutf8.xml
+++ b/docs/xml/lazutils/lazutf8.xml
@ -634,42 +634,59 @@ Even faster UTF-8 character counting

 <element name="UTF8CodepointCount">
 <short>
-Gets the number of UTF-8-encoded codepoints in the specified value.
+Gets the number of valid UTF-8 codepoints in the specified value.
 </short>
 <descr>
 <p>
-<var>UTF8CodepointCount</var> is an overloaded <var>PtrInt</var> function used 
-to determine the number of UTF-8 codepoints found in the specified value. The 
-overloaded variants allow the value to be specified using either the String or 
-the PChar type.
+<var>UTF8CodepointCount</var> is an overloaded <var>SizeInt</var> function used 
+to determine the number of UTF-8 codepoints found in the specified value. It is 
+similar to the UTF8Length routine, but excludes any invalid codepoints found in 
+the input value from the count in the return value. The overloaded variants 
+allow the input value to be specified using either the String or the PChar type.
 </p>
 <p>
 UTF8CodepointCount iterates over the byte values in the s or p arguments, and 
-increments the return value when a valid UTF-8 codepoint is found. Valid 
-codepoints include those represented by combining character combinations. 
-UTF8CodepointLen (in system.pp) is called to the get the size for each 
-of the UTF-8 codepoints. The process is repeated until all of the bytes in the 
-input value have been examined, or a codepoint with a length of zero (0) is encountered.
+increments the return value when a valid UTF-8 codepoint is found. 
+UTF8CodepointLen (in system.pp) is called to the get the size for each of the 
+UTF-8 codepoints. Valid codepoints include those represented using combining 
+characters. The process is repeated until all of the bytes in the input value 
+have been examined, or until a codepoint with a length of zero (0) is 
+encountered.
 </p>
 <p>
 The return value is zero (0) if the s or p arguments are empty, or when the 
 ByteCount argument is zero (0).
 </p>
+<p>
+For example:
+</p>
+<code>
+// var
+//   Utf8Str, InvalidUtf8Str: String;
+//   Cnt, Len: Integer;
+
+{A macron (decomposed)}
+Utf8Str := 'A' + #$CC#$84;
+
+{invalid single byte UTF-8}
+InvalidUtf8Str := #$C0#$C1#$F5#$F6#$F7#$F8#$F9#$FA#$FB#$FC#$FD#$FE#$FF;
+
+Cnt := UTF8CodePointCount(Utf8Str); // Cnt = 2
+Len := UTF8Length(Utf8Str); // Len = 2
+
+Cnt := UTF8CodePointCount(InvalidUtf8Str); // Cnt = 0
+Len := UTF8Length(InvalidUtf8Str); // Len = 13
+
+Cnt := UTF8CodePointCount(InvalidUtf8Str + Utf8Str); // Cnt = 2
+Len := UTF8Length(InvalidUtf8Str + Utf8Str); // Len = 15
+</code>
 </descr>
 <version>
-Added in LazUtils version 4.0. (c8a1f93a)
+Added in LazUtils version 4.0.
 </version>
-<notes>
-<note>
-I wrote a test application to compare the results for UTF8Length and 
-UTF8CodepointCount. They return exactly the same values for the UTF-8 strings I 
-cribbed from the Unicode web site. 
-So basically, why is this routine needed?
-</note>
-</notes>
 <seealso>
-<link id="UTF8CodepointSize"/>
 <link id="UTF8Length"/>
+<link id="UTF8CodepointSize"/>
 <link id="UTF8LengthFast"/>
 <link id="UTF8CharacterLength"/>
 <link id="#rtl.system.UTF8CodepointLen">UTF8CodepointLen</link>
@ -677,7 +694,8 @@ So basically, why is this routine needed?
 </element>
 <element name="UTF8CodepointCount.Result">
 <short>
-Pointer to the Integer value with the number of codepoints including combining characters.
+Integer value with the number of valid codepoints including combining 
+characters.
 </short>
 </element>
 <element name="UTF8CodepointCount.s">