rtl: apply patch of Inoussa which describes the unicode collation algorithm data layout (mantis #0025240)

git-svn-id: trunk@26114 -
2025-04-08 09:28:19 +02:00 · 2013-11-20 11:41:50 +00:00 · 2013-11-20 11:41:50 +00:00 · f6e1c76aa8
commit f6e1c76aa8
parent 21e178d9d2
1 changed files with 63 additions and 0 deletions
--- a/rtl/objpas/unicodedata.pas
+++ b/rtl/objpas/unicodedata.pas
@ -17,6 +17,69 @@
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+
+-------------------------------------------------------------------------------
+
+  Overview of the Unicode Collation Algorithm(UCA) data layout :
+  ============================================================
+
+    The UCA data(see “TUCA_DataBook”) are organized into index data
+    (see the “TUCA_DataBook” fields “BMP_Table1”, “BMP_Table2”,
+    “OBMP_Table1” and “OBMP_Table2”) and actual properties data(see
+    the “Props” field of  “TUCA_DataBook”). The index is a 3 level
+    tables designed to minimize the overhaul data size. The
+    properties’ data contain the actual (used) UCA’s properties
+    for the customized code points(or sequence of code points)
+    data (see TUCA_PropItemRec).
+    To get the properties’ record of a code point, one goes
+    through the index data to get its offset into the “Props”
+    serialized data, see the “GetPropUCA” procedure.
+    The “TUCA_PropItemRec” record, that represents the actual
+    properties, contains a fixed part and a variable part. The
+    fixed part is directly expressed as fields of the record :
+      “WeightLength”, “ChildCount”, “Size”, “Flags”. The
+    variable part depends on some values of the fixed part; For
+    example “WeightLength” specify the number of weight[1] item,
+    it can be zero or not null; The “Flags” fields does contains
+    some bit states to indicate for example if the record’s owner,
+    that is the target code point, is present(it is not always
+    necessary to store the code point as you are required to have
+    this information in the first place in order to get the
+    “TUCA_PropItemRec” record).
+
+    The data, as it is organized now, is as follow for each code point :
+      * the fixed part is serialized,
+      * if there are weight item array, they are serialized
+          (see the "WeigthLength")
+      * the code point is serialized (if needed)
+      * the context[2] array is serialized
+      * The children[3] record are serialized.
+
+    The “Size” represent the size of the whole record, including its
+    children records(see [3]). The “GetSelfOnlySize” returns the size
+    of the queried record, excluding the size of its children.
+
+
+    Notes :
+
+    [1] : A weight item is an array of 3 words. A code point/sequence of code
+          point may have zero or multiple items.
+
+    [2] :  There are characters(mostly japanese ones) that do not have their
+           own weighs; There inherit the weights of the preceding character
+           in the string that you will be evaluating.
+    [3] :  Some unicode characters are expressed using more than one code point.
+           In that case the properties records are serialized as a trie. The
+           trie data structure is useful when many characters’ expression have
+           the same starting code point(s).
+
+    [4] TUCA_PropItemRec serialization :
+            TUCA_PropItemRec :
+              WeightLength, ChildCount, Size, Flags [weight item array]
+    [Code Point] [Context data]
+              [Child 0] [Child 1] .. [Child n]
+
+        each [Child k] is a TUCA_PropItemRec.
 }

 unit unicodedata;