rtl: apply patch of Inoussa which describes the unicode collation algorithm data layout (mantis #0025240)

git-svn-id: trunk@26114 -
This commit is contained in:
paul 2013-11-20 11:41:50 +00:00
parent 21e178d9d2
commit f6e1c76aa8

View File

@ -17,6 +17,69 @@
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-------------------------------------------------------------------------------
Overview of the Unicode Collation Algorithm(UCA) data layout :
============================================================
The UCA data(see “TUCA_DataBook”) are organized into index data
(see the “TUCA_DataBook” fields “BMP_Table1”, “BMP_Table2”,
“OBMP_Table1” and “OBMP_Table2”) and actual properties data(see
the “Props” field of “TUCA_DataBook”). The index is a 3 level
tables designed to minimize the overhaul data size. The
properties data contain the actual (used) UCAs properties
for the customized code points(or sequence of code points)
data (see TUCA_PropItemRec).
To get the properties record of a code point, one goes
through the index data to get its offset into the “Props”
serialized data, see the “GetPropUCA” procedure.
The “TUCA_PropItemRec” record, that represents the actual
properties, contains a fixed part and a variable part. The
fixed part is directly expressed as fields of the record :
“WeightLength”, “ChildCount”, “Size”, “Flags”. The
variable part depends on some values of the fixed part; For
example “WeightLength” specify the number of weight[1] item,
it can be zero or not null; The “Flags” fields does contains
some bit states to indicate for example if the records owner,
that is the target code point, is present(it is not always
necessary to store the code point as you are required to have
this information in the first place in order to get the
“TUCA_PropItemRec” record).
The data, as it is organized now, is as follow for each code point :
* the fixed part is serialized,
* if there are weight item array, they are serialized
(see the "WeigthLength")
* the code point is serialized (if needed)
* the context[2] array is serialized
* The children[3] record are serialized.
The “Size” represent the size of the whole record, including its
children records(see [3]). The “GetSelfOnlySize” returns the size
of the queried record, excluding the size of its children.
Notes :
[1] : A weight item is an array of 3 words. A code point/sequence of code
point may have zero or multiple items.
[2] : There are characters(mostly japanese ones) that do not have their
own weighs; There inherit the weights of the preceding character
in the string that you will be evaluating.
[3] : Some unicode characters are expressed using more than one code point.
In that case the properties records are serialized as a trie. The
trie data structure is useful when many characters expression have
the same starting code point(s).
[4] TUCA_PropItemRec serialization :
TUCA_PropItemRec :
WeightLength, ChildCount, Size, Flags [weight item array]
[Code Point] [Context data]
[Child 0] [Child 1] .. [Child n]
each [Child k] is a TUCA_PropItemRec.
}
unit unicodedata;