From f6e1c76aa8555963c50ec79297d31046ebed2133 Mon Sep 17 00:00:00 2001 From: paul Date: Wed, 20 Nov 2013 11:41:50 +0000 Subject: [PATCH] rtl: apply patch of Inoussa which describes the unicode collation algorithm data layout (mantis #0025240) git-svn-id: trunk@26114 - --- rtl/objpas/unicodedata.pas | 63 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/rtl/objpas/unicodedata.pas b/rtl/objpas/unicodedata.pas index fc39f0ad0a..4f9b907815 100644 --- a/rtl/objpas/unicodedata.pas +++ b/rtl/objpas/unicodedata.pas @@ -17,6 +17,69 @@ This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +------------------------------------------------------------------------------- + + Overview of the Unicode Collation Algorithm(UCA) data layout : + ============================================================ + + The UCA data(see “TUCA_DataBook”) are organized into index data + (see the “TUCA_DataBook” fields “BMP_Table1”, “BMP_Table2”, + “OBMP_Table1” and “OBMP_Table2”) and actual properties data(see + the “Props” field of “TUCA_DataBook”). The index is a 3 level + tables designed to minimize the overhaul data size. The + properties’ data contain the actual (used) UCA’s properties + for the customized code points(or sequence of code points) + data (see TUCA_PropItemRec). + To get the properties’ record of a code point, one goes + through the index data to get its offset into the “Props” + serialized data, see the “GetPropUCA” procedure. + The “TUCA_PropItemRec” record, that represents the actual + properties, contains a fixed part and a variable part. The + fixed part is directly expressed as fields of the record : + “WeightLength”, “ChildCount”, “Size”, “Flags”. The + variable part depends on some values of the fixed part; For + example “WeightLength” specify the number of weight[1] item, + it can be zero or not null; The “Flags” fields does contains + some bit states to indicate for example if the record’s owner, + that is the target code point, is present(it is not always + necessary to store the code point as you are required to have + this information in the first place in order to get the + “TUCA_PropItemRec” record). + + The data, as it is organized now, is as follow for each code point : + * the fixed part is serialized, + * if there are weight item array, they are serialized + (see the "WeigthLength") + * the code point is serialized (if needed) + * the context[2] array is serialized + * The children[3] record are serialized. + + The “Size” represent the size of the whole record, including its + children records(see [3]). The “GetSelfOnlySize” returns the size + of the queried record, excluding the size of its children. + + + Notes : + + [1] : A weight item is an array of 3 words. A code point/sequence of code + point may have zero or multiple items. + + [2] : There are characters(mostly japanese ones) that do not have their + own weighs; There inherit the weights of the preceding character + in the string that you will be evaluating. + [3] : Some unicode characters are expressed using more than one code point. + In that case the properties records are serialized as a trie. The + trie data structure is useful when many characters’ expression have + the same starting code point(s). + + [4] TUCA_PropItemRec serialization : + TUCA_PropItemRec : + WeightLength, ChildCount, Size, Flags [weight item array] + [Code Point] [Context data] + [Child 0] [Child 1] .. [Child n] + + each [Child k] is a TUCA_PropItemRec. } unit unicodedata;