mirror of
https://gitlab.com/freepascal.org/fpc/source.git
synced 2025-08-11 10:26:05 +02:00
+ Initial implementation
This commit is contained in:
parent
a1bfbbfeea
commit
78794fe79a
350
install/man/man1/plex.1
Normal file
350
install/man/man1/plex.1
Normal file
@ -0,0 +1,350 @@
|
|||||||
|
.TH plex 1 "10 Jan 2000" FreePascal "Pascal lexical analyzer generator"
|
||||||
|
.SH NAME
|
||||||
|
plex - The Pascal Lex lexical analyzer generator.
|
||||||
|
|
||||||
|
|
||||||
|
.SH Usage
|
||||||
|
|
||||||
|
.B lex [options] lex-file[.l] [output-file[.pas]]
|
||||||
|
|
||||||
|
|
||||||
|
.SH Options
|
||||||
|
|
||||||
|
.TP
|
||||||
|
.B \-v
|
||||||
|
.I Verbose:
|
||||||
|
Lex generates a readable description of the generated
|
||||||
|
lexical analyzer, written to lex-file with new extension
|
||||||
|
.I .lst
|
||||||
|
.TP
|
||||||
|
.B \-o
|
||||||
|
.I Optimize:
|
||||||
|
Lex optimizes DFA tables to produce a minimal DFA.
|
||||||
|
|
||||||
|
.SH Description
|
||||||
|
|
||||||
|
TP Lex is a program generator that is used to generate the Turbo Pascal source
|
||||||
|
code for a lexical analyzer subroutine from the specification of an input
|
||||||
|
language by a regular expression grammar.
|
||||||
|
|
||||||
|
TP Lex parses the source grammar contained in lex-file (with default suffix
|
||||||
|
.l) and writes the constructed lexical analyzer subroutine to the specified
|
||||||
|
output-file (with default suffix .pas); if no output file is specified, output
|
||||||
|
goes to lex-file with new suffix .pas. If any errors are found during
|
||||||
|
compilation, error messages are written to the list file (lex-file with new
|
||||||
|
suffix .lst).
|
||||||
|
|
||||||
|
The generated output file contains a lexical analyzer routine, yylex,
|
||||||
|
implemented as:
|
||||||
|
|
||||||
|
.RS
|
||||||
|
function yylex : Integer;
|
||||||
|
.RE
|
||||||
|
|
||||||
|
This routine has to be called by your main program to execute the lexical
|
||||||
|
analyzer. The return value of the yylex routine usually denotes the number
|
||||||
|
of a token recognized by the lexical analyzer (see the return routine in the
|
||||||
|
LexLib unit). At end-of-file the yylex routine normally returns 0.
|
||||||
|
|
||||||
|
The code template for the yylex routine may be found in the yylex.cod
|
||||||
|
file. This file is needed by TP Lex when it constructs the output file. It
|
||||||
|
must be present either in the current directory or in the directory from which
|
||||||
|
TP Lex was executed (TP Lex searches these directories in the indicated
|
||||||
|
order). (NB: For the Linux/Free Pascal version, the code template is searched
|
||||||
|
in some directory defined at compile-time instead of the execution path,
|
||||||
|
usually /usr/lib/fpc/lexyacc.)
|
||||||
|
|
||||||
|
The TP Lex library (LexLib) unit is required by programs using Lex-generated
|
||||||
|
lexical analyzers; you will therefore have to put an appropriate uses clause
|
||||||
|
into your program or unit that contains the lexical analyzer routine. The
|
||||||
|
LexLib unit also provides various useful utility routines; see the file
|
||||||
|
lexlib.pas for further information.
|
||||||
|
|
||||||
|
|
||||||
|
.SH Lex Source
|
||||||
|
|
||||||
|
A TP Lex program consists of three sections separated with the %% delimiter:
|
||||||
|
|
||||||
|
definitions
|
||||||
|
|
||||||
|
%%
|
||||||
|
.LP
|
||||||
|
rules
|
||||||
|
.LP
|
||||||
|
%%
|
||||||
|
.LP
|
||||||
|
auxiliary procedures
|
||||||
|
|
||||||
|
All sections may be empty. The TP Lex language is line-oriented; definitions
|
||||||
|
and rules are separated by line breaks. There is no special notation for
|
||||||
|
comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
|
||||||
|
fragments (see below).
|
||||||
|
|
||||||
|
The definitions section may contain the following elements:
|
||||||
|
|
||||||
|
.TP
|
||||||
|
.B regular definitions in the format:
|
||||||
|
|
||||||
|
name substitution
|
||||||
|
|
||||||
|
which serve to abbreviate common subexpressions. The {name} notation
|
||||||
|
causes the corresponding substitution from the definitions section to
|
||||||
|
be inserted into a regular expression. The name must be a legal
|
||||||
|
identifier (letter followed by a sequence of letters and digits;
|
||||||
|
the underscore counts as a letter; upper- and lowercase are distinct).
|
||||||
|
Regular definitions must be non-recursive.
|
||||||
|
|
||||||
|
.TP
|
||||||
|
.B start state definitions in the format:
|
||||||
|
|
||||||
|
%start name ...
|
||||||
|
|
||||||
|
which are used in specifying start conditions on rules (described
|
||||||
|
below). The %start keyword may also be abbreviated as %s or %S.
|
||||||
|
|
||||||
|
.TP
|
||||||
|
.B Turbo Pascal declarations enclosed between %{ and %}.
|
||||||
|
|
||||||
|
These will be
|
||||||
|
inserted into the output file (at global scope). Also, any line that
|
||||||
|
does not look like a Lex definition (e.g., starts with blank or tab)
|
||||||
|
will be treated as Turbo Pascal code. (In particular, this also allows
|
||||||
|
you to include Turbo Pascal comments in your Lex program.)
|
||||||
|
|
||||||
|
.SH Rules
|
||||||
|
|
||||||
|
The rules section of a TP Lex program contains the actual specification of
|
||||||
|
the lexical analyzer routine. It may be thought of as a big CASE statement
|
||||||
|
discriminating over the different patterns to be matched and listing the
|
||||||
|
corresponding statements (actions) to be executed. Each rule consists of a
|
||||||
|
regular expression describing the strings to be matched in the input, and a
|
||||||
|
corresponding action, a Turbo Pascal statement to be executed when the
|
||||||
|
expression matches. Expression and statement are delimited with whitespace
|
||||||
|
(blanks and/or tabs). Thus the format of a Lex grammar rule is:
|
||||||
|
|
||||||
|
expression statement;
|
||||||
|
|
||||||
|
Note that the action must be a single Turbo Pascal statement terminated
|
||||||
|
with a semicolon (use begin ... end for compound statements). The statement
|
||||||
|
may span multiple lines if the successor lines are indented with at least
|
||||||
|
one blank or tab. The action may also be replaced by the | character,
|
||||||
|
indicating that the action for this rule is the same as that for the next
|
||||||
|
one.
|
||||||
|
|
||||||
|
The TP Lex library unit provides various variables and routines which are
|
||||||
|
useful in the programming of actions. In particular, the yytext string
|
||||||
|
variable holds the text of the matched string, and the yyleng Byte variable
|
||||||
|
its length.
|
||||||
|
|
||||||
|
Regular expressions are used to describe the strings to be matched in a
|
||||||
|
grammar rule. They are built from the usual constructs describing character
|
||||||
|
classes and sequences, and operators specifying repetitions and alternatives.
|
||||||
|
The precise format of regular expressions is described in the next section.
|
||||||
|
|
||||||
|
The rules section may also start with some Turbo Pascal declarations
|
||||||
|
(enclosed in %{ %}) which are treated as local declarations of the
|
||||||
|
actions routine.
|
||||||
|
|
||||||
|
Finally, the auxiliary procedures section may contain arbitrary Turbo
|
||||||
|
Pascal code (such as supporting routines or a main program) which is
|
||||||
|
simply tacked on to the end of the output file. The auxiliary procedures
|
||||||
|
section is optional.
|
||||||
|
|
||||||
|
|
||||||
|
.SH Regular Expressions
|
||||||
|
|
||||||
|
The following table summarizes the format of the regular expressions
|
||||||
|
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).
|
||||||
|
c stands for a single character, s for a string, r for a regular expression,
|
||||||
|
and n,m for nonnegative integers.
|
||||||
|
|
||||||
|
expression matches example
|
||||||
|
---------- ---------------------------- -------
|
||||||
|
c any non-operator character c a
|
||||||
|
\\c character c literally \\*
|
||||||
|
"s" string s literally "**"
|
||||||
|
. any character but newline a.*b
|
||||||
|
^ beginning of line ^abc
|
||||||
|
$ end of line abc$
|
||||||
|
[s] any character in s [abc]
|
||||||
|
[^s] any character not in s [^abc]
|
||||||
|
r* zero or more r's a*
|
||||||
|
r+ one or more r's a+
|
||||||
|
r? zero or one r a?
|
||||||
|
r{m,n} m to n occurrences of r a{1,5}
|
||||||
|
r{m} m occurrences of r a{5}
|
||||||
|
r1r2 r1 then r2 ab
|
||||||
|
r1|r2 r1 or r2 a|b
|
||||||
|
(r) r (a|b)
|
||||||
|
r1/r2 r1 when followed by r2 a/b
|
||||||
|
<x>r r when in start condition x <x>abc
|
||||||
|
---------------------------------------------------
|
||||||
|
|
||||||
|
The operators *, +, ? and {} have highest precedence, followed by
|
||||||
|
concatenation. The | operator has lowest precedence. Parentheses ()
|
||||||
|
may be used to group expressions and overwrite default precedences.
|
||||||
|
The <> and / operators may only occur once in an expression.
|
||||||
|
|
||||||
|
The usual C-like escapes are recognized:
|
||||||
|
|
||||||
|
\\n denotes newline
|
||||||
|
\\r denotes carriage return
|
||||||
|
\\t denotes tab
|
||||||
|
\\b denotes backspace
|
||||||
|
\\f denotes form feed
|
||||||
|
\\NNN denotes character no. NNN in octal base
|
||||||
|
|
||||||
|
You can also use the \\ character to quote characters which would otherwise
|
||||||
|
be interpreted as operator symbols. In character classes, you may use
|
||||||
|
the - character to denote ranges of characters. For instance, [a-z]
|
||||||
|
denotes the class of all lowercase letters.
|
||||||
|
|
||||||
|
The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
|
||||||
|
which match more than one rule. In such a case, the lexical analyzer prefers
|
||||||
|
the longest match and, if it still has the choice between different rules,
|
||||||
|
it picks the first of these. If no rule matches, the lexical analyzer
|
||||||
|
executes a default action which consists of copying the input character
|
||||||
|
to the output unchanged. Thus, if the purpose of a lexical analyzer is
|
||||||
|
to translate some parts of the input, and leave the rest unchanged, you
|
||||||
|
only have to specify the patterns which have to be treated specially. If,
|
||||||
|
however, the lexical analyzer has to absorb its whole input, you will have
|
||||||
|
to provide rules that match everything. E.g., you might use the rules
|
||||||
|
|
||||||
|
. |
|
||||||
|
\\n ;
|
||||||
|
|
||||||
|
which match "any other character" (and ignore it).
|
||||||
|
|
||||||
|
Sometimes certain patterns have to be analyzed differently depending on some
|
||||||
|
amount of context in which the pattern appears. In such a case the / operator
|
||||||
|
is useful. For instance, the expression a/b matches a, but only if followed
|
||||||
|
by b. Note that the b does not belong to the match; rather, the lexical
|
||||||
|
analyzer, when matching an a, will look ahead in the input to see whether
|
||||||
|
it is followed by a b, before it declares that it has matched an a. Such
|
||||||
|
lookahead may be arbitrarily complex (up to the size of the LexLib input
|
||||||
|
buffer). E.g., the pattern a/.*b matches an a which is followed by a b
|
||||||
|
somewhere on the same input line. TP Lex also has a means to specify left
|
||||||
|
context which is described in the next section.
|
||||||
|
|
||||||
|
|
||||||
|
Start Conditions
|
||||||
|
----------------
|
||||||
|
|
||||||
|
TP Lex provides some features which make it possible to handle left context.
|
||||||
|
The ^ character at the beginning of a regular expression may be used to
|
||||||
|
denote the beginning of the line. More distant left context can be described
|
||||||
|
conveniently by using start conditions on rules.
|
||||||
|
|
||||||
|
Any rule which is prefixed with the <> construct is only valid if the lexical
|
||||||
|
analyzer is in the denoted start state. For instance, the expression <x>a
|
||||||
|
can only be matched if the lexical analyzer is in start state x. You can have
|
||||||
|
multiple start states in a rule; e.g., <x,y>a can be matched in start states
|
||||||
|
x or y.
|
||||||
|
|
||||||
|
Start states have to be declared in the definitions section by means of
|
||||||
|
one or more start state definitions (see above). The lexical analyzer enters
|
||||||
|
a start state through a call to the LexLib routine start. E.g., you may
|
||||||
|
write:
|
||||||
|
|
||||||
|
%start x y
|
||||||
|
%%
|
||||||
|
<x>a start(y);
|
||||||
|
<y>b start(x);
|
||||||
|
%%
|
||||||
|
begin
|
||||||
|
start(x); if yylex=0 then ;
|
||||||
|
end.
|
||||||
|
|
||||||
|
Upon initialization, the lexical analyzer is put into state x. It then
|
||||||
|
proceeds in state x until it matches an a which puts it into state y.
|
||||||
|
In state y it may match a b which puts it into state x again, etc.
|
||||||
|
|
||||||
|
Start conditions are useful when certain constructs have to be analyzed
|
||||||
|
differently depending on some left context (such as a special character
|
||||||
|
at the beginning of the line), and if multiple lexical analyzers have to
|
||||||
|
work in concert. If a rule is not prefixed with a start condition, it is
|
||||||
|
valid in all user-defined start states, as well as in the lexical analyzer's
|
||||||
|
default start state.
|
||||||
|
|
||||||
|
|
||||||
|
Lex Library
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The TP Lex library (LexLib) unit provides various variables and routines
|
||||||
|
which are used by Lex-generated lexical analyzers and application programs.
|
||||||
|
It provides the input and output streams and other internal data structures
|
||||||
|
used by the lexical analyzer routine, and supplies some variables and utility
|
||||||
|
routines which may be used by actions and application programs. Refer to
|
||||||
|
the file lexlib.pas for a closer description.
|
||||||
|
|
||||||
|
You can also modify the Lex library unit (and/or the code template in the
|
||||||
|
yylex.cod file) to customize TP Lex to your target applications. E.g.,
|
||||||
|
you might wish to optimize the code of the lexical analyzer for some
|
||||||
|
special application, make the analyzer read from/write to memory instead
|
||||||
|
of files, etc.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation Restrictions
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Internal table sizes and the main memory available limit the complexity of
|
||||||
|
source grammars that TP Lex can handle. There is currently no possibility to
|
||||||
|
change internal table sizes (apart from modifying the sources of TP Lex
|
||||||
|
itself), but the maximum table sizes provided by TP Lex seem to be large
|
||||||
|
enough to handle most realistic applications. The actual table sizes depend on
|
||||||
|
the particular implementation (they are much larger than the defaults if TP
|
||||||
|
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
|
||||||
|
Free Pascal), and are shown in the statistics printed by TP Lex when a
|
||||||
|
compilation is finished. The units given there are "p" (positions, i.e. items
|
||||||
|
in the position table used to construct the DFA), "s" (DFA states) and "t"
|
||||||
|
(transitions of the generated DFA).
|
||||||
|
|
||||||
|
As implemented, the generated DFA table is stored as a typed array constant
|
||||||
|
which is inserted into the yylex.cod code template. The transitions in each
|
||||||
|
state are stored in order. Of course it would have been more efficient to
|
||||||
|
generate a big CASE statement instead, but I found that this may cause
|
||||||
|
problems with the encoding of large DFA tables because Turbo Pascal has
|
||||||
|
a quite rigid limit on the code size of individual procedures. I decided to
|
||||||
|
use a scheme in which transitions on different symbols to the same state are
|
||||||
|
merged into one single transition (specifying a character set and the
|
||||||
|
corresponding next state). This keeps the number of transitions in each state
|
||||||
|
quite small and still allows a fairly efficient access to the transition
|
||||||
|
table.
|
||||||
|
|
||||||
|
The TP Lex program has an option (-o) to optimize DFA tables. This causes a
|
||||||
|
minimal DFA to be generated, using the algorithm described in Aho, Sethi,
|
||||||
|
Ullman (1986). Although the absolute limit on the number of DFA states that TP
|
||||||
|
Lex can handle is at least 300, TP Lex poses an additional restriction (100)
|
||||||
|
on the number of states in the initial partition of the DFA optimization
|
||||||
|
algorithm. Thus, you may get a fatal `integer set overflow' message when using
|
||||||
|
the -o option even when TP Lex is able to generate an unoptimized DFA. In such
|
||||||
|
cases you will just have to be content with the unoptimized DFA. (Hopefully,
|
||||||
|
this will be fixed in a future version. Anyhow, using the merged transitions
|
||||||
|
scheme described above, TP Lex usually constructs unoptimized DFA's which are
|
||||||
|
not far from being optimal, and thus in most cases DFA optimization won't have
|
||||||
|
a great impact on DFA table sizes.)
|
||||||
|
|
||||||
|
|
||||||
|
.SH Differences from UNIX Lex
|
||||||
|
|
||||||
|
Major differences between TP Lex and UNIX Lex are listed below.
|
||||||
|
|
||||||
|
|
||||||
|
TP Lex produces output code for Turbo Pascal, rather than for C.
|
||||||
|
|
||||||
|
Character tables (%T) are not supported; neither are any directives
|
||||||
|
to determine internal table sizes (%p, %n, etc.).
|
||||||
|
|
||||||
|
Library routines are named differently from the UNIX version (e.g.,
|
||||||
|
the `start' routine takes the place of the `BEGIN' macro of UNIX
|
||||||
|
Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had
|
||||||
|
to be implemented as procedures.
|
||||||
|
|
||||||
|
The TP Lex library unit starts counting line numbers at 0, incrementing
|
||||||
|
the count BEFORE a line is read (in contrast, UNIX Lex initializes
|
||||||
|
yylineno to 1 and increments it AFTER the line end has been read). This
|
||||||
|
is motivated by the way in which TP Lex maintains the current line,
|
||||||
|
and will not affect your programs unless you explicitly reset the
|
||||||
|
yylineno value (e.g., when opening a new input file). In such a case
|
||||||
|
you should set yylineno to 0 rather than 1.
|
||||||
|
|
1049
install/man/man1/pyacc.1
Normal file
1049
install/man/man1/pyacc.1
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user