mirror of
https://gitlab.com/freepascal.org/fpc/source.git
synced 2025-04-08 09:28:19 +02:00
rtl: apply patch of Inoussa which describes the unicode collation algorithm data layout (mantis #0025240)
git-svn-id: trunk@26114 -
This commit is contained in:
parent
21e178d9d2
commit
f6e1c76aa8
@ -17,6 +17,69 @@
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Overview of the Unicode Collation Algorithm(UCA) data layout :
|
||||
============================================================
|
||||
|
||||
The UCA data(see “TUCA_DataBook”) are organized into index data
|
||||
(see the “TUCA_DataBook” fields “BMP_Table1”, “BMP_Table2”,
|
||||
“OBMP_Table1” and “OBMP_Table2”) and actual properties data(see
|
||||
the “Props” field of “TUCA_DataBook”). The index is a 3 level
|
||||
tables designed to minimize the overhaul data size. The
|
||||
properties’ data contain the actual (used) UCA’s properties
|
||||
for the customized code points(or sequence of code points)
|
||||
data (see TUCA_PropItemRec).
|
||||
To get the properties’ record of a code point, one goes
|
||||
through the index data to get its offset into the “Props”
|
||||
serialized data, see the “GetPropUCA” procedure.
|
||||
The “TUCA_PropItemRec” record, that represents the actual
|
||||
properties, contains a fixed part and a variable part. The
|
||||
fixed part is directly expressed as fields of the record :
|
||||
“WeightLength”, “ChildCount”, “Size”, “Flags”. The
|
||||
variable part depends on some values of the fixed part; For
|
||||
example “WeightLength” specify the number of weight[1] item,
|
||||
it can be zero or not null; The “Flags” fields does contains
|
||||
some bit states to indicate for example if the record’s owner,
|
||||
that is the target code point, is present(it is not always
|
||||
necessary to store the code point as you are required to have
|
||||
this information in the first place in order to get the
|
||||
“TUCA_PropItemRec” record).
|
||||
|
||||
The data, as it is organized now, is as follow for each code point :
|
||||
* the fixed part is serialized,
|
||||
* if there are weight item array, they are serialized
|
||||
(see the "WeigthLength")
|
||||
* the code point is serialized (if needed)
|
||||
* the context[2] array is serialized
|
||||
* The children[3] record are serialized.
|
||||
|
||||
The “Size” represent the size of the whole record, including its
|
||||
children records(see [3]). The “GetSelfOnlySize” returns the size
|
||||
of the queried record, excluding the size of its children.
|
||||
|
||||
|
||||
Notes :
|
||||
|
||||
[1] : A weight item is an array of 3 words. A code point/sequence of code
|
||||
point may have zero or multiple items.
|
||||
|
||||
[2] : There are characters(mostly japanese ones) that do not have their
|
||||
own weighs; There inherit the weights of the preceding character
|
||||
in the string that you will be evaluating.
|
||||
[3] : Some unicode characters are expressed using more than one code point.
|
||||
In that case the properties records are serialized as a trie. The
|
||||
trie data structure is useful when many characters’ expression have
|
||||
the same starting code point(s).
|
||||
|
||||
[4] TUCA_PropItemRec serialization :
|
||||
TUCA_PropItemRec :
|
||||
WeightLength, ChildCount, Size, Flags [weight item array]
|
||||
[Code Point] [Context data]
|
||||
[Child 0] [Child 1] .. [Child n]
|
||||
|
||||
each [Child k] is a TUCA_PropItemRec.
|
||||
}
|
||||
|
||||
unit unicodedata;
|
||||
|
Loading…
Reference in New Issue
Block a user