+ Initial implementation

This commit is contained in:
michael 2000-01-12 23:31:23 +00:00
parent a1bfbbfeea
commit 78794fe79a
2 changed files with 1399 additions and 0 deletions

350
install/man/man1/plex.1 Normal file
View File

@ -0,0 +1,350 @@
.TH plex 1 "10 Jan 2000" FreePascal "Pascal lexical analyzer generator"
.SH NAME
plex - The Pascal Lex lexical analyzer generator.
.SH Usage
.B lex [options] lex-file[.l] [output-file[.pas]]
.SH Options
.TP
.B \-v
.I Verbose:
Lex generates a readable description of the generated
lexical analyzer, written to lex-file with new extension
.I .lst
.TP
.B \-o
.I Optimize:
Lex optimizes DFA tables to produce a minimal DFA.
.SH Description
TP Lex is a program generator that is used to generate the Turbo Pascal source
code for a lexical analyzer subroutine from the specification of an input
language by a regular expression grammar.
TP Lex parses the source grammar contained in lex-file (with default suffix
.l) and writes the constructed lexical analyzer subroutine to the specified
output-file (with default suffix .pas); if no output file is specified, output
goes to lex-file with new suffix .pas. If any errors are found during
compilation, error messages are written to the list file (lex-file with new
suffix .lst).
The generated output file contains a lexical analyzer routine, yylex,
implemented as:
.RS
function yylex : Integer;
.RE
This routine has to be called by your main program to execute the lexical
analyzer. The return value of the yylex routine usually denotes the number
of a token recognized by the lexical analyzer (see the return routine in the
LexLib unit). At end-of-file the yylex routine normally returns 0.
The code template for the yylex routine may be found in the yylex.cod
file. This file is needed by TP Lex when it constructs the output file. It
must be present either in the current directory or in the directory from which
TP Lex was executed (TP Lex searches these directories in the indicated
order). (NB: For the Linux/Free Pascal version, the code template is searched
in some directory defined at compile-time instead of the execution path,
usually /usr/lib/fpc/lexyacc.)
The TP Lex library (LexLib) unit is required by programs using Lex-generated
lexical analyzers; you will therefore have to put an appropriate uses clause
into your program or unit that contains the lexical analyzer routine. The
LexLib unit also provides various useful utility routines; see the file
lexlib.pas for further information.
.SH Lex Source
A TP Lex program consists of three sections separated with the %% delimiter:
definitions
%%
.LP
rules
.LP
%%
.LP
auxiliary procedures
All sections may be empty. The TP Lex language is line-oriented; definitions
and rules are separated by line breaks. There is no special notation for
comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
fragments (see below).
The definitions section may contain the following elements:
.TP
.B regular definitions in the format:
name substitution
which serve to abbreviate common subexpressions. The {name} notation
causes the corresponding substitution from the definitions section to
be inserted into a regular expression. The name must be a legal
identifier (letter followed by a sequence of letters and digits;
the underscore counts as a letter; upper- and lowercase are distinct).
Regular definitions must be non-recursive.
.TP
.B start state definitions in the format:
%start name ...
which are used in specifying start conditions on rules (described
below). The %start keyword may also be abbreviated as %s or %S.
.TP
.B Turbo Pascal declarations enclosed between %{ and %}.
These will be
inserted into the output file (at global scope). Also, any line that
does not look like a Lex definition (e.g., starts with blank or tab)
will be treated as Turbo Pascal code. (In particular, this also allows
you to include Turbo Pascal comments in your Lex program.)
.SH Rules
The rules section of a TP Lex program contains the actual specification of
the lexical analyzer routine. It may be thought of as a big CASE statement
discriminating over the different patterns to be matched and listing the
corresponding statements (actions) to be executed. Each rule consists of a
regular expression describing the strings to be matched in the input, and a
corresponding action, a Turbo Pascal statement to be executed when the
expression matches. Expression and statement are delimited with whitespace
(blanks and/or tabs). Thus the format of a Lex grammar rule is:
expression statement;
Note that the action must be a single Turbo Pascal statement terminated
with a semicolon (use begin ... end for compound statements). The statement
may span multiple lines if the successor lines are indented with at least
one blank or tab. The action may also be replaced by the | character,
indicating that the action for this rule is the same as that for the next
one.
The TP Lex library unit provides various variables and routines which are
useful in the programming of actions. In particular, the yytext string
variable holds the text of the matched string, and the yyleng Byte variable
its length.
Regular expressions are used to describe the strings to be matched in a
grammar rule. They are built from the usual constructs describing character
classes and sequences, and operators specifying repetitions and alternatives.
The precise format of regular expressions is described in the next section.
The rules section may also start with some Turbo Pascal declarations
(enclosed in %{ %}) which are treated as local declarations of the
actions routine.
Finally, the auxiliary procedures section may contain arbitrary Turbo
Pascal code (such as supporting routines or a main program) which is
simply tacked on to the end of the output file. The auxiliary procedures
section is optional.
.SH Regular Expressions
The following table summarizes the format of the regular expressions
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).
c stands for a single character, s for a string, r for a regular expression,
and n,m for nonnegative integers.
expression matches example
---------- ---------------------------- -------
c any non-operator character c a
\\c character c literally \\*
"s" string s literally "**"
. any character but newline a.*b
^ beginning of line ^abc
$ end of line abc$
[s] any character in s [abc]
[^s] any character not in s [^abc]
r* zero or more r's a*
r+ one or more r's a+
r? zero or one r a?
r{m,n} m to n occurrences of r a{1,5}
r{m} m occurrences of r a{5}
r1r2 r1 then r2 ab
r1|r2 r1 or r2 a|b
(r) r (a|b)
r1/r2 r1 when followed by r2 a/b
<x>r r when in start condition x <x>abc
---------------------------------------------------
The operators *, +, ? and {} have highest precedence, followed by
concatenation. The | operator has lowest precedence. Parentheses ()
may be used to group expressions and overwrite default precedences.
The <> and / operators may only occur once in an expression.
The usual C-like escapes are recognized:
\\n denotes newline
\\r denotes carriage return
\\t denotes tab
\\b denotes backspace
\\f denotes form feed
\\NNN denotes character no. NNN in octal base
You can also use the \\ character to quote characters which would otherwise
be interpreted as operator symbols. In character classes, you may use
the - character to denote ranges of characters. For instance, [a-z]
denotes the class of all lowercase letters.
The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
which match more than one rule. In such a case, the lexical analyzer prefers
the longest match and, if it still has the choice between different rules,
it picks the first of these. If no rule matches, the lexical analyzer
executes a default action which consists of copying the input character
to the output unchanged. Thus, if the purpose of a lexical analyzer is
to translate some parts of the input, and leave the rest unchanged, you
only have to specify the patterns which have to be treated specially. If,
however, the lexical analyzer has to absorb its whole input, you will have
to provide rules that match everything. E.g., you might use the rules
. |
\\n ;
which match "any other character" (and ignore it).
Sometimes certain patterns have to be analyzed differently depending on some
amount of context in which the pattern appears. In such a case the / operator
is useful. For instance, the expression a/b matches a, but only if followed
by b. Note that the b does not belong to the match; rather, the lexical
analyzer, when matching an a, will look ahead in the input to see whether
it is followed by a b, before it declares that it has matched an a. Such
lookahead may be arbitrarily complex (up to the size of the LexLib input
buffer). E.g., the pattern a/.*b matches an a which is followed by a b
somewhere on the same input line. TP Lex also has a means to specify left
context which is described in the next section.
Start Conditions
----------------
TP Lex provides some features which make it possible to handle left context.
The ^ character at the beginning of a regular expression may be used to
denote the beginning of the line. More distant left context can be described
conveniently by using start conditions on rules.
Any rule which is prefixed with the <> construct is only valid if the lexical
analyzer is in the denoted start state. For instance, the expression <x>a
can only be matched if the lexical analyzer is in start state x. You can have
multiple start states in a rule; e.g., <x,y>a can be matched in start states
x or y.
Start states have to be declared in the definitions section by means of
one or more start state definitions (see above). The lexical analyzer enters
a start state through a call to the LexLib routine start. E.g., you may
write:
%start x y
%%
<x>a start(y);
<y>b start(x);
%%
begin
start(x); if yylex=0 then ;
end.
Upon initialization, the lexical analyzer is put into state x. It then
proceeds in state x until it matches an a which puts it into state y.
In state y it may match a b which puts it into state x again, etc.
Start conditions are useful when certain constructs have to be analyzed
differently depending on some left context (such as a special character
at the beginning of the line), and if multiple lexical analyzers have to
work in concert. If a rule is not prefixed with a start condition, it is
valid in all user-defined start states, as well as in the lexical analyzer's
default start state.
Lex Library
-----------
The TP Lex library (LexLib) unit provides various variables and routines
which are used by Lex-generated lexical analyzers and application programs.
It provides the input and output streams and other internal data structures
used by the lexical analyzer routine, and supplies some variables and utility
routines which may be used by actions and application programs. Refer to
the file lexlib.pas for a closer description.
You can also modify the Lex library unit (and/or the code template in the
yylex.cod file) to customize TP Lex to your target applications. E.g.,
you might wish to optimize the code of the lexical analyzer for some
special application, make the analyzer read from/write to memory instead
of files, etc.
Implementation Restrictions
---------------------------
Internal table sizes and the main memory available limit the complexity of
source grammars that TP Lex can handle. There is currently no possibility to
change internal table sizes (apart from modifying the sources of TP Lex
itself), but the maximum table sizes provided by TP Lex seem to be large
enough to handle most realistic applications. The actual table sizes depend on
the particular implementation (they are much larger than the defaults if TP
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
Free Pascal), and are shown in the statistics printed by TP Lex when a
compilation is finished. The units given there are "p" (positions, i.e. items
in the position table used to construct the DFA), "s" (DFA states) and "t"
(transitions of the generated DFA).
As implemented, the generated DFA table is stored as a typed array constant
which is inserted into the yylex.cod code template. The transitions in each
state are stored in order. Of course it would have been more efficient to
generate a big CASE statement instead, but I found that this may cause
problems with the encoding of large DFA tables because Turbo Pascal has
a quite rigid limit on the code size of individual procedures. I decided to
use a scheme in which transitions on different symbols to the same state are
merged into one single transition (specifying a character set and the
corresponding next state). This keeps the number of transitions in each state
quite small and still allows a fairly efficient access to the transition
table.
The TP Lex program has an option (-o) to optimize DFA tables. This causes a
minimal DFA to be generated, using the algorithm described in Aho, Sethi,
Ullman (1986). Although the absolute limit on the number of DFA states that TP
Lex can handle is at least 300, TP Lex poses an additional restriction (100)
on the number of states in the initial partition of the DFA optimization
algorithm. Thus, you may get a fatal `integer set overflow' message when using
the -o option even when TP Lex is able to generate an unoptimized DFA. In such
cases you will just have to be content with the unoptimized DFA. (Hopefully,
this will be fixed in a future version. Anyhow, using the merged transitions
scheme described above, TP Lex usually constructs unoptimized DFA's which are
not far from being optimal, and thus in most cases DFA optimization won't have
a great impact on DFA table sizes.)
.SH Differences from UNIX Lex
Major differences between TP Lex and UNIX Lex are listed below.
TP Lex produces output code for Turbo Pascal, rather than for C.
Character tables (%T) are not supported; neither are any directives
to determine internal table sizes (%p, %n, etc.).
Library routines are named differently from the UNIX version (e.g.,
the `start' routine takes the place of the `BEGIN' macro of UNIX
Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had
to be implemented as procedures.
The TP Lex library unit starts counting line numbers at 0, incrementing
the count BEFORE a line is read (in contrast, UNIX Lex initializes
yylineno to 1 and increments it AFTER the line end has been read). This
is motivated by the way in which TP Lex maintains the current line,
and will not affect your programs unless you explicitly reset the
yylineno value (e.g., when opening a new input file). In such a case
you should set yylineno to 0 rather than 1.

1049
install/man/man1/pyacc.1 Normal file

File diff suppressed because it is too large Load Diff