mirror of
https://gitlab.com/freepascal.org/fpc/source.git
synced 2025-04-21 20:09:25 +02:00
+ Initial implementation
This commit is contained in:
parent
a1bfbbfeea
commit
78794fe79a
350
install/man/man1/plex.1
Normal file
350
install/man/man1/plex.1
Normal file
@ -0,0 +1,350 @@
|
||||
.TH plex 1 "10 Jan 2000" FreePascal "Pascal lexical analyzer generator"
|
||||
.SH NAME
|
||||
plex - The Pascal Lex lexical analyzer generator.
|
||||
|
||||
|
||||
.SH Usage
|
||||
|
||||
.B lex [options] lex-file[.l] [output-file[.pas]]
|
||||
|
||||
|
||||
.SH Options
|
||||
|
||||
.TP
|
||||
.B \-v
|
||||
.I Verbose:
|
||||
Lex generates a readable description of the generated
|
||||
lexical analyzer, written to lex-file with new extension
|
||||
.I .lst
|
||||
.TP
|
||||
.B \-o
|
||||
.I Optimize:
|
||||
Lex optimizes DFA tables to produce a minimal DFA.
|
||||
|
||||
.SH Description
|
||||
|
||||
TP Lex is a program generator that is used to generate the Turbo Pascal source
|
||||
code for a lexical analyzer subroutine from the specification of an input
|
||||
language by a regular expression grammar.
|
||||
|
||||
TP Lex parses the source grammar contained in lex-file (with default suffix
|
||||
.l) and writes the constructed lexical analyzer subroutine to the specified
|
||||
output-file (with default suffix .pas); if no output file is specified, output
|
||||
goes to lex-file with new suffix .pas. If any errors are found during
|
||||
compilation, error messages are written to the list file (lex-file with new
|
||||
suffix .lst).
|
||||
|
||||
The generated output file contains a lexical analyzer routine, yylex,
|
||||
implemented as:
|
||||
|
||||
.RS
|
||||
function yylex : Integer;
|
||||
.RE
|
||||
|
||||
This routine has to be called by your main program to execute the lexical
|
||||
analyzer. The return value of the yylex routine usually denotes the number
|
||||
of a token recognized by the lexical analyzer (see the return routine in the
|
||||
LexLib unit). At end-of-file the yylex routine normally returns 0.
|
||||
|
||||
The code template for the yylex routine may be found in the yylex.cod
|
||||
file. This file is needed by TP Lex when it constructs the output file. It
|
||||
must be present either in the current directory or in the directory from which
|
||||
TP Lex was executed (TP Lex searches these directories in the indicated
|
||||
order). (NB: For the Linux/Free Pascal version, the code template is searched
|
||||
in some directory defined at compile-time instead of the execution path,
|
||||
usually /usr/lib/fpc/lexyacc.)
|
||||
|
||||
The TP Lex library (LexLib) unit is required by programs using Lex-generated
|
||||
lexical analyzers; you will therefore have to put an appropriate uses clause
|
||||
into your program or unit that contains the lexical analyzer routine. The
|
||||
LexLib unit also provides various useful utility routines; see the file
|
||||
lexlib.pas for further information.
|
||||
|
||||
|
||||
.SH Lex Source
|
||||
|
||||
A TP Lex program consists of three sections separated with the %% delimiter:
|
||||
|
||||
definitions
|
||||
|
||||
%%
|
||||
.LP
|
||||
rules
|
||||
.LP
|
||||
%%
|
||||
.LP
|
||||
auxiliary procedures
|
||||
|
||||
All sections may be empty. The TP Lex language is line-oriented; definitions
|
||||
and rules are separated by line breaks. There is no special notation for
|
||||
comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
|
||||
fragments (see below).
|
||||
|
||||
The definitions section may contain the following elements:
|
||||
|
||||
.TP
|
||||
.B regular definitions in the format:
|
||||
|
||||
name substitution
|
||||
|
||||
which serve to abbreviate common subexpressions. The {name} notation
|
||||
causes the corresponding substitution from the definitions section to
|
||||
be inserted into a regular expression. The name must be a legal
|
||||
identifier (letter followed by a sequence of letters and digits;
|
||||
the underscore counts as a letter; upper- and lowercase are distinct).
|
||||
Regular definitions must be non-recursive.
|
||||
|
||||
.TP
|
||||
.B start state definitions in the format:
|
||||
|
||||
%start name ...
|
||||
|
||||
which are used in specifying start conditions on rules (described
|
||||
below). The %start keyword may also be abbreviated as %s or %S.
|
||||
|
||||
.TP
|
||||
.B Turbo Pascal declarations enclosed between %{ and %}.
|
||||
|
||||
These will be
|
||||
inserted into the output file (at global scope). Also, any line that
|
||||
does not look like a Lex definition (e.g., starts with blank or tab)
|
||||
will be treated as Turbo Pascal code. (In particular, this also allows
|
||||
you to include Turbo Pascal comments in your Lex program.)
|
||||
|
||||
.SH Rules
|
||||
|
||||
The rules section of a TP Lex program contains the actual specification of
|
||||
the lexical analyzer routine. It may be thought of as a big CASE statement
|
||||
discriminating over the different patterns to be matched and listing the
|
||||
corresponding statements (actions) to be executed. Each rule consists of a
|
||||
regular expression describing the strings to be matched in the input, and a
|
||||
corresponding action, a Turbo Pascal statement to be executed when the
|
||||
expression matches. Expression and statement are delimited with whitespace
|
||||
(blanks and/or tabs). Thus the format of a Lex grammar rule is:
|
||||
|
||||
expression statement;
|
||||
|
||||
Note that the action must be a single Turbo Pascal statement terminated
|
||||
with a semicolon (use begin ... end for compound statements). The statement
|
||||
may span multiple lines if the successor lines are indented with at least
|
||||
one blank or tab. The action may also be replaced by the | character,
|
||||
indicating that the action for this rule is the same as that for the next
|
||||
one.
|
||||
|
||||
The TP Lex library unit provides various variables and routines which are
|
||||
useful in the programming of actions. In particular, the yytext string
|
||||
variable holds the text of the matched string, and the yyleng Byte variable
|
||||
its length.
|
||||
|
||||
Regular expressions are used to describe the strings to be matched in a
|
||||
grammar rule. They are built from the usual constructs describing character
|
||||
classes and sequences, and operators specifying repetitions and alternatives.
|
||||
The precise format of regular expressions is described in the next section.
|
||||
|
||||
The rules section may also start with some Turbo Pascal declarations
|
||||
(enclosed in %{ %}) which are treated as local declarations of the
|
||||
actions routine.
|
||||
|
||||
Finally, the auxiliary procedures section may contain arbitrary Turbo
|
||||
Pascal code (such as supporting routines or a main program) which is
|
||||
simply tacked on to the end of the output file. The auxiliary procedures
|
||||
section is optional.
|
||||
|
||||
|
||||
.SH Regular Expressions
|
||||
|
||||
The following table summarizes the format of the regular expressions
|
||||
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).
|
||||
c stands for a single character, s for a string, r for a regular expression,
|
||||
and n,m for nonnegative integers.
|
||||
|
||||
expression matches example
|
||||
---------- ---------------------------- -------
|
||||
c any non-operator character c a
|
||||
\\c character c literally \\*
|
||||
"s" string s literally "**"
|
||||
. any character but newline a.*b
|
||||
^ beginning of line ^abc
|
||||
$ end of line abc$
|
||||
[s] any character in s [abc]
|
||||
[^s] any character not in s [^abc]
|
||||
r* zero or more r's a*
|
||||
r+ one or more r's a+
|
||||
r? zero or one r a?
|
||||
r{m,n} m to n occurrences of r a{1,5}
|
||||
r{m} m occurrences of r a{5}
|
||||
r1r2 r1 then r2 ab
|
||||
r1|r2 r1 or r2 a|b
|
||||
(r) r (a|b)
|
||||
r1/r2 r1 when followed by r2 a/b
|
||||
<x>r r when in start condition x <x>abc
|
||||
---------------------------------------------------
|
||||
|
||||
The operators *, +, ? and {} have highest precedence, followed by
|
||||
concatenation. The | operator has lowest precedence. Parentheses ()
|
||||
may be used to group expressions and overwrite default precedences.
|
||||
The <> and / operators may only occur once in an expression.
|
||||
|
||||
The usual C-like escapes are recognized:
|
||||
|
||||
\\n denotes newline
|
||||
\\r denotes carriage return
|
||||
\\t denotes tab
|
||||
\\b denotes backspace
|
||||
\\f denotes form feed
|
||||
\\NNN denotes character no. NNN in octal base
|
||||
|
||||
You can also use the \\ character to quote characters which would otherwise
|
||||
be interpreted as operator symbols. In character classes, you may use
|
||||
the - character to denote ranges of characters. For instance, [a-z]
|
||||
denotes the class of all lowercase letters.
|
||||
|
||||
The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
|
||||
which match more than one rule. In such a case, the lexical analyzer prefers
|
||||
the longest match and, if it still has the choice between different rules,
|
||||
it picks the first of these. If no rule matches, the lexical analyzer
|
||||
executes a default action which consists of copying the input character
|
||||
to the output unchanged. Thus, if the purpose of a lexical analyzer is
|
||||
to translate some parts of the input, and leave the rest unchanged, you
|
||||
only have to specify the patterns which have to be treated specially. If,
|
||||
however, the lexical analyzer has to absorb its whole input, you will have
|
||||
to provide rules that match everything. E.g., you might use the rules
|
||||
|
||||
. |
|
||||
\\n ;
|
||||
|
||||
which match "any other character" (and ignore it).
|
||||
|
||||
Sometimes certain patterns have to be analyzed differently depending on some
|
||||
amount of context in which the pattern appears. In such a case the / operator
|
||||
is useful. For instance, the expression a/b matches a, but only if followed
|
||||
by b. Note that the b does not belong to the match; rather, the lexical
|
||||
analyzer, when matching an a, will look ahead in the input to see whether
|
||||
it is followed by a b, before it declares that it has matched an a. Such
|
||||
lookahead may be arbitrarily complex (up to the size of the LexLib input
|
||||
buffer). E.g., the pattern a/.*b matches an a which is followed by a b
|
||||
somewhere on the same input line. TP Lex also has a means to specify left
|
||||
context which is described in the next section.
|
||||
|
||||
|
||||
Start Conditions
|
||||
----------------
|
||||
|
||||
TP Lex provides some features which make it possible to handle left context.
|
||||
The ^ character at the beginning of a regular expression may be used to
|
||||
denote the beginning of the line. More distant left context can be described
|
||||
conveniently by using start conditions on rules.
|
||||
|
||||
Any rule which is prefixed with the <> construct is only valid if the lexical
|
||||
analyzer is in the denoted start state. For instance, the expression <x>a
|
||||
can only be matched if the lexical analyzer is in start state x. You can have
|
||||
multiple start states in a rule; e.g., <x,y>a can be matched in start states
|
||||
x or y.
|
||||
|
||||
Start states have to be declared in the definitions section by means of
|
||||
one or more start state definitions (see above). The lexical analyzer enters
|
||||
a start state through a call to the LexLib routine start. E.g., you may
|
||||
write:
|
||||
|
||||
%start x y
|
||||
%%
|
||||
<x>a start(y);
|
||||
<y>b start(x);
|
||||
%%
|
||||
begin
|
||||
start(x); if yylex=0 then ;
|
||||
end.
|
||||
|
||||
Upon initialization, the lexical analyzer is put into state x. It then
|
||||
proceeds in state x until it matches an a which puts it into state y.
|
||||
In state y it may match a b which puts it into state x again, etc.
|
||||
|
||||
Start conditions are useful when certain constructs have to be analyzed
|
||||
differently depending on some left context (such as a special character
|
||||
at the beginning of the line), and if multiple lexical analyzers have to
|
||||
work in concert. If a rule is not prefixed with a start condition, it is
|
||||
valid in all user-defined start states, as well as in the lexical analyzer's
|
||||
default start state.
|
||||
|
||||
|
||||
Lex Library
|
||||
-----------
|
||||
|
||||
The TP Lex library (LexLib) unit provides various variables and routines
|
||||
which are used by Lex-generated lexical analyzers and application programs.
|
||||
It provides the input and output streams and other internal data structures
|
||||
used by the lexical analyzer routine, and supplies some variables and utility
|
||||
routines which may be used by actions and application programs. Refer to
|
||||
the file lexlib.pas for a closer description.
|
||||
|
||||
You can also modify the Lex library unit (and/or the code template in the
|
||||
yylex.cod file) to customize TP Lex to your target applications. E.g.,
|
||||
you might wish to optimize the code of the lexical analyzer for some
|
||||
special application, make the analyzer read from/write to memory instead
|
||||
of files, etc.
|
||||
|
||||
|
||||
Implementation Restrictions
|
||||
---------------------------
|
||||
|
||||
Internal table sizes and the main memory available limit the complexity of
|
||||
source grammars that TP Lex can handle. There is currently no possibility to
|
||||
change internal table sizes (apart from modifying the sources of TP Lex
|
||||
itself), but the maximum table sizes provided by TP Lex seem to be large
|
||||
enough to handle most realistic applications. The actual table sizes depend on
|
||||
the particular implementation (they are much larger than the defaults if TP
|
||||
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
|
||||
Free Pascal), and are shown in the statistics printed by TP Lex when a
|
||||
compilation is finished. The units given there are "p" (positions, i.e. items
|
||||
in the position table used to construct the DFA), "s" (DFA states) and "t"
|
||||
(transitions of the generated DFA).
|
||||
|
||||
As implemented, the generated DFA table is stored as a typed array constant
|
||||
which is inserted into the yylex.cod code template. The transitions in each
|
||||
state are stored in order. Of course it would have been more efficient to
|
||||
generate a big CASE statement instead, but I found that this may cause
|
||||
problems with the encoding of large DFA tables because Turbo Pascal has
|
||||
a quite rigid limit on the code size of individual procedures. I decided to
|
||||
use a scheme in which transitions on different symbols to the same state are
|
||||
merged into one single transition (specifying a character set and the
|
||||
corresponding next state). This keeps the number of transitions in each state
|
||||
quite small and still allows a fairly efficient access to the transition
|
||||
table.
|
||||
|
||||
The TP Lex program has an option (-o) to optimize DFA tables. This causes a
|
||||
minimal DFA to be generated, using the algorithm described in Aho, Sethi,
|
||||
Ullman (1986). Although the absolute limit on the number of DFA states that TP
|
||||
Lex can handle is at least 300, TP Lex poses an additional restriction (100)
|
||||
on the number of states in the initial partition of the DFA optimization
|
||||
algorithm. Thus, you may get a fatal `integer set overflow' message when using
|
||||
the -o option even when TP Lex is able to generate an unoptimized DFA. In such
|
||||
cases you will just have to be content with the unoptimized DFA. (Hopefully,
|
||||
this will be fixed in a future version. Anyhow, using the merged transitions
|
||||
scheme described above, TP Lex usually constructs unoptimized DFA's which are
|
||||
not far from being optimal, and thus in most cases DFA optimization won't have
|
||||
a great impact on DFA table sizes.)
|
||||
|
||||
|
||||
.SH Differences from UNIX Lex
|
||||
|
||||
Major differences between TP Lex and UNIX Lex are listed below.
|
||||
|
||||
|
||||
TP Lex produces output code for Turbo Pascal, rather than for C.
|
||||
|
||||
Character tables (%T) are not supported; neither are any directives
|
||||
to determine internal table sizes (%p, %n, etc.).
|
||||
|
||||
Library routines are named differently from the UNIX version (e.g.,
|
||||
the `start' routine takes the place of the `BEGIN' macro of UNIX
|
||||
Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had
|
||||
to be implemented as procedures.
|
||||
|
||||
The TP Lex library unit starts counting line numbers at 0, incrementing
|
||||
the count BEFORE a line is read (in contrast, UNIX Lex initializes
|
||||
yylineno to 1 and increments it AFTER the line end has been read). This
|
||||
is motivated by the way in which TP Lex maintains the current line,
|
||||
and will not affect your programs unless you explicitly reset the
|
||||
yylineno value (e.g., when opening a new input file). In such a case
|
||||
you should set yylineno to 0 rather than 1.
|
||||
|
1049
install/man/man1/pyacc.1
Normal file
1049
install/man/man1/pyacc.1
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user