mirror of
https://gitlab.com/freepascal.org/fpc/source.git
synced 2025-04-05 04:58:00 +02:00
215 lines
7.7 KiB
Plaintext
215 lines
7.7 KiB
Plaintext
CONCISE REPORT ON THE ALGORITHMS IN SIM 970623
|
||
|
||
|
||
|
||
INTRODUCTION
|
||
|
||
The general outline of the similarity checker is as follows:
|
||
|
||
1. the files are read in (pass 1)
|
||
2. a forward-reference table is prepared
|
||
3. the set of interesting runs is determined
|
||
4. the line numbers of the runs are determined (pass 2)
|
||
5. the contents of the runs are printed in order (pass 3)
|
||
|
||
To keep the memory requirements (relatively) small, the exact positions
|
||
of the tokens are not recorded. This necessitates pass 2. See, however,
|
||
the pertinent chapter.
|
||
|
||
|
||
READING THE FILES
|
||
|
||
Each file is tokenized using an lex-generated scanner appropriate for
|
||
the input. Each token fits in one byte, possibly using all 8 bits. The
|
||
tokens are stored in the array TokenArray[], which is extended by
|
||
reallocation if it overflows. See tokenarray.c.
|
||
|
||
Also, to optimize away pass 2, an attempt is made to remember the token
|
||
positions of all beginnings of lines. The token-positions at BOL are
|
||
stored in the array nl_buff[], which is also extended by reallocation,
|
||
if needed. If the attempt fails due to lack of memory, nl_buff[] is
|
||
abandoned, and pass2 will read the files instead.
|
||
|
||
|
||
PREPARING THE FORWARD-REFERENCE TABLE
|
||
|
||
Text is compared by comparing every substring to all substrings
|
||
to the right of it; this process is in essence quadratic. However,
|
||
only substrings of length at least 'MinRunSize' are of interest,
|
||
which gives us the possibility to speed up this process by using
|
||
a hash table.
|
||
|
||
Once the entire text has been read in, a forward-reference table
|
||
forward_references[] is made (see hash.c).
|
||
For every position in the text, we construct an index which gives
|
||
the next position in the text where a run of MinRunSize tokens
|
||
starts that has the same hash code. If there is no such run, the
|
||
index is 0.
|
||
|
||
To fill in this array, we use a hash table last_index[], such that
|
||
last_index[i] is the index of the latest token with hash_code i, or 0 if
|
||
there is none. If at a given position p, we find that the text ahead of
|
||
us has hash code i, last_index[i] tells us which position in
|
||
forward_references[] will have to be updated to p.
|
||
See MakeForwardReferences().
|
||
|
||
For long text sequences (say hundreds of thousands of tokens), the
|
||
hashing is not really efficient any more since too many spurious matches
|
||
occur. Therefore, the forward reference table is scanned a second time,
|
||
eliminating from any chain all references to runs that do not start with
|
||
and end in the same token (actually this is a second hash code).
|
||
For the UNIX manuals this reduced the number of matches from 91.9% to 1.9%
|
||
(of which 0.06% was genuine).
|
||
|
||
DETERMINING THE SET OF INTERESTING RUNS
|
||
|
||
The overall structure of the routine Compare() (see compare.c) is:
|
||
|
||
for all new files
|
||
for all texts it must be compared to
|
||
for all positions in the new file
|
||
for all positions in the text
|
||
for ever increasing sizes
|
||
try to match and keep the best
|
||
|
||
If for a given position in the new file a good run (i.e. on of at least
|
||
minimum length) has been found, the run is registered using a call of
|
||
add_run(), the run is skipped in the new file and searching continues at
|
||
the position after it. This prevents duplicate reports of runs.
|
||
|
||
Add_run() allocates a struct run for the run (see sim.h)
|
||
which contains two struct chunks and a quality description. It fills
|
||
in the two chunks with the pertinent info, one for the first file and
|
||
one for the second (which may be the same, if the run relates two chunks
|
||
in the same file).
|
||
|
||
The run is then entered into the arbitrary-in-sorted-out store AISO (see
|
||
aiso.spc and aiso.bdy, a genuine generic abstract data type in C!), in
|
||
which it is inserted according to its quality. Both positions
|
||
(struct position) in both chunks in the run (so four in total) are each
|
||
entered in a linked list starting at the tx_pos field in the struct text
|
||
of the appropriate file.
|
||
|
||
When this is finished, the forward reference table can be deleted.
|
||
|
||
So the final results of this phase are visible both through the tx_pos
|
||
fields and through the aiso interface.
|
||
|
||
|
||
DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2)
|
||
|
||
The purpose of this pass is to find for each chunk, which up to now is
|
||
known by token position only, its starting and ending line number (which
|
||
cannot be easily derived from the token position).
|
||
|
||
For each file that has a non-zero tx_pos field, ie. that has some
|
||
interesting chunks, the positions in the tx_pos list are sorted on
|
||
ascending line number (they have been found in essentially arbitrary
|
||
order) by sort_pos() in pass2.c.
|
||
|
||
Next we scan the pos list and the file in parallel, updating the info in
|
||
a position when we meet it. A position carries an indication whether it
|
||
is a starting or an ending position, since slightly differing
|
||
calculations have to be done in each case.
|
||
|
||
Actually, if the nl_buff[] data structure still exists, the file is not
|
||
accessed at all and the data from nl_buff[] is used instead. This is
|
||
done transparently in buff.c.
|
||
|
||
|
||
PRINTING THE CONTENTS OF THE RUNS (PASS 3)
|
||
|
||
Since each struct run has now been completely filled in, this is simple;
|
||
the hard work is calculating the page layout.
|
||
Pass3() accesses the aiso store and retrieves from it the runs in
|
||
descending order of importance. Show_run() opens both files, positions
|
||
them using the line numbers and prints the runs.
|
||
|
||
================================================================
|
||
CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (980222)
|
||
|
||
sim:
|
||
get command line options
|
||
check the options
|
||
|
||
init language, to precompute tables
|
||
|
||
pass1, read the files
|
||
# there is an array TokenArray[] that holds all input tokens
|
||
|
||
make forward reference table
|
||
# there is an array forward_references[], with one entry for
|
||
# each token in the input; forward_references[i] gives the
|
||
# token number where a token sequence starts with the same
|
||
# hash value as the one starting at i
|
||
|
||
compare various files to find runs
|
||
delete forward reference table
|
||
pass2, find newline positions of found similarities
|
||
pass3, print the similarities
|
||
|
||
|
||
|
||
pass1, read the files:
|
||
for each file
|
||
divide the text into tokens
|
||
store all tokens except newlines in TokenArray and try to
|
||
keep a record of the newline positions
|
||
|
||
|
||
|
||
make forward reference table:
|
||
# there are two independent hash functions, hash1() and hash2().
|
||
# hash1(i) gives the hash value of the token sequence starting at i
|
||
# likewise for hash2(i)
|
||
|
||
set up the forward references using the last_index table:
|
||
# there is an array last_index[], with one entry for each
|
||
# possible hash value; last_index[i] gives the position in
|
||
# forward_references[] at which i was most recently
|
||
# encountered as a hash value
|
||
for each file
|
||
for all positions in file except the last MinRunSize
|
||
set forward_references[] and update last_index[]
|
||
|
||
use hash2() to clean out matches:
|
||
for all tokens
|
||
find first token in chain with same hash2 code
|
||
short-circuit forward reference to it
|
||
|
||
|
||
|
||
compare:
|
||
for all new files
|
||
for all texts it must be compared to
|
||
for all positions in the new file
|
||
for all positions in the text
|
||
for ever increasing sizes
|
||
try to match and keep the best
|
||
try to match and keep the best:
|
||
# using forward_references[], we find a list of positions in
|
||
# which a matching token sequence will start;
|
||
# scanning this list, we measure the maximum length of the
|
||
# match and add the longest match to the run collection
|
||
|
||
|
||
|
||
pass2, find positions of found runs:
|
||
for all files:
|
||
sort the positions in the runs
|
||
|
||
# we scan the pos list and the file in parallel
|
||
for all positions inside this file
|
||
if it matches a token position in a run
|
||
record line number
|
||
|
||
|
||
|
||
pass3, print the similarities:
|
||
for all runs
|
||
# a run consists of two chunks
|
||
open the files that hold the chunks and position them
|
||
at the beginning of the chunk
|
||
display the chunks
|
||
|