fpc/packages/symbolic/doc/symbolic.txt

Unit Symbolic
------------

Unit Symbolic is something I wrote to take care of all my needs in the fields
of simple expression parsing and evaluating, and to act as base for more
complex manipulation.

This doc originally was a notes file for myself. If it is unreadable, then
sorry. Rewrite of docs will have to wait until FCL doc-making practices
are clear.

Author:
-------

Marco van de Voort (Marco@freepascal.org)

Target/Compiler:
------

FreePascal 1.1 (post 1.0 development). www.freepascal.org

Should run on Delphi 4 with minimal changes. (and any Delphi that supports
overloading). If you remove the overloading it should run on D2..D5. I never
programmed 16-bit Object Pascal, so I don't know the D1 status

I tested with D4, but forgot to merge all changes.
I fixed the more difficult Delphi problems see the ifdef near
the pvliwevalword definition) Probably replacing all Upcase() functions with
ansiuppercase and commenting the runerror msgs should get it to compile under
Delphi.

Key features:
--------------
(for the meaning of abbreviations, see the glossary
               at the end of this document)

General:
      - 32-bit. Probably close to being 64-bit clean. (no integer <->
         pointer casts). D1 status unknown, since I never used it, and can't
         tell easily. Biggest problem for ports is probably that it doesn't
         account for aligned arrays. It also assumes pointer arithmic.
      - OOP interface design, but sometimes uses procedures internally for
         speed.
      - Doesn't use (co)processor dependant features atm. An alternate method
         in TEvaluator will have to take care of that.
      - Optimised (algorithm) with high speed repeated evaluations in mind.
         Parsing is NOT optimised, but not particulary dumb either.
         If parsing is a speed problem, one should eliminate the parsetree
         generation and conversion to back VLIWRPN, and generate VLIWRPN
         directly

- Expression parsing and conversion:
      - Infix to RPN
      - infix to Parsetree
      - Parsetree to infix
      - Parsetree to RPN

- Symbolic Expression handling.
   - Simple operators on expressions + / * - ^
   - Derivation of simple functions (all operators + most functions in math
     unit)
   - taylor polynomal.
   - Precalculate Newton. (almost non-feature :-)
   - Primitives for rearranging
                - Identifying of terms.
                - Simple simplying (2*2*x -> 4*x)
                - (de)factoring (*)
                - Rearrange so that when converted to RPN, maximal stack depth
                  for evaluation is 4. This also needs a detector routine
                  (determine required RPN stack room)
   - Operator overloading possible?

- High speed evaluating. (parse once, evaluate often principle)
   - Infinite variables
   - Infinite (symbolic) constants.
   - Fast (hopefully)
   - Structure designed so that a lowlevel (processor dependant) version of
      the core evaluating routine is possible.

TO DO, problems, missing features.
------

The biggest feature missing for me (at the moment) is the possibility to use
user defined (other TExpression) functions in my expressions. Only built in
functions are allowed. A procedure variable system as seen in some freeware
examples could be done too. Procedure variables would be faster. However they
would be compiletime (while texpressions can be changed runtime)
(one can workaround this for the evaluator by applying some substitutions)

Another problem can be seen both as bug and as missing feature: 5+x+7 doesn't
get simplified to x+13 or 13+x. Only 5+7+x gets optimised. This also goes for
the other infix operators.

- (Symbolic) Integration. At least the parts that *can* be done. Hard, there is
  no foolproof approach, and even determining *if* integration is possible is
  hard.
  User assisted? (e.g. let the user identify the partial integration terms)
  Integration further opens the door to Laplace and Fourier.
- Equation support? Or Equation is an equity operator and 2 TExpressions?
- Other mathematical symbolic functions.
- The RPNCalc example is 90% of a simple (symbolic!) RPN calculator. It looks
  and feels awfull, but the base functionality is all there, and more important
  easy to use and extend.
  Maybe for the GUI freaks it is nice to have some GUI RPNcalc widget. Same for
   TUI (TV/FV/IDE)
- Polynomal to (Jedi's or other) vector/Matrix type conversion.
   Would create entanglement between the units though. Maybe better via
   ^array of arbfloat. Could need an import method in target unit somewhere.
- Rearranging of the parsetree so that it requires maximally 4 stack
   positions to evaluate the expression (which according to RPN theory
   is possible?)
  This would allow to run the evaluator purely on the i386 coprocessor
   stack, which probably would mean an enormous speed increase.
- As first step: inline math functions in assembler in evaluator
  (with conditional)
- Other "smart" rearranging of expressions. Since this is not always possible
   without user intervention, this will boil down in creating the support
   methods for user assisted symbolic rearraning.
- Clean up, real docs. I waited with real docs because Jedi and FPC use
  different document formats and philosophies with it. Personally I prefer the
  FPC way of course. A PDF loads as fast as such a html-hlp, and looks ten
  times better. AND can generate html if necessary, and get used for real books.
- Complex?
- Predefined symbolic constants? pi, G, h, e(logaritm), e(elementary charge)
   (comment: Essentially not necessary anymore.)
- Some more experienced classes programmer must decide which methods to make
   virtual, and maybe rework the current inheritance between the classes.
- Support in TEvaluator for constant substitution followed by an
   TExpression.Simplify BEFORE vliwarr generation. This to avoid recalculating
   things like h/(2*pi) in each evaluation. Will need to copy exprtree for
   this?
- Changing parser to allow foreign characters. (anything not in a certain
   set is appended to token). Important for people using other codepages.
- Common subexpression elimination? (probably needed in some form for some
   rearrangements)
- XML / HTML 4.0 / \Latex formatted output expression :-)
- (Delphi) Controls that allow you to enter mathematical expressions in
  numerical fields?
- Graphical viewing of expressions? How to do it graph library (graph,
  graphiX,GTK,QT,directx etc etc) independant?
  (I have some idea for an algoritm for this from a LaTeX tutorial. Basically
    parse the tree and assign all end nodes a size. Parents size can be
    calculated from children. Then another pass for rendering to coordinates,
    followed by the plot. Will have to be parameterized and with callbacks for
    flexibility and system independance)
- Doesn't check for bounderies. (treats e.g. x/x=1 but if x=0 then officially
   it isn't defined) I hope to implement a TExpression method for this
   someday. (that checks a function for continuety problem spots)

Class overview.
-------------

1. TBaseExpression.                 Very basic support for creating parsetrees.
2. TBaseExprParser(TBaseExpression) Parser class. Most basic conversion
                                    between the three expression types
                                    (infix, RPN, parsetree)
3. TExpression(TBaseExprParser)     Main class. An expression and the operations
                                    you can do on them.
                                    Can do some simple evaluation.
4. TEvaluator                       Plugin evaluation class. Operates
                                    on a parsetree.


Evaluating an expression.
-------------------------

There are two ways of evaluating a simple expression, the method
TExpression.SimplifyConstants and the class TEvaluator. The differences are:

- TExpression.SimplifyConstants is actually not written for evaluations but
   for flexible combining constants after derivation. (  deriv(2x^2) returns
   2*2*x, calling SimplifyConstants changes it to 4*x)
   It can be used for simple evaluations too, but it is probably too slow for
   repeated iterations. So in case of repeated iterations use TEvaluator.
   For one simple evaluation: use simplify, unless you have symbolic
   constants.

- TEvaluator is written for speed. More specifically for high speed *repeated*
   evaluations. So setting up the evaluation (creating the TEvaluator class),
   is a parsing process and relatively slow. Each iteration after that however
   is about as fast as I can imagine without using processor specific lowlevel
   features in a HLL. (like internal compilation, FP assembler etc)

- TEvaluator requires you to subst all values for symbolic constants/variables.
  Simplify doesn't allow to subst values for symbolic constants/variables.

TEvaluator algoritm and internals.
--------------------

TEvaluator had two design requirements:
1 High speed for repeated evaluations of the same function with slightly
  different values. (read: plot a graph reasonably fast)
2 Must be usable to evaluate TExpressions, but not inherit directly from
  TExpression. Since TEvaluate only needs the parsetree from TExpression,
  this was easily possible.

The reason for requirement 1 is that on modern computers the application's
speed won't be affected by a little more parsing overhead for a single
evaluation, while repeated evaluations can still slow down any system.
(people who object to this, please calculate the Wave function for all known
organic compounds:-)
This is implemented by moving as much as possible to the (single) parsing
stage, and keeping the repeated part as lean and fast as possible.

As an application for the repeated evaluations I mainly had numerical searching
for roots and drawing graphs in mind.


The TEvaluator class generates something what I named VLIW-RPN array.
- RPN because the array's elemental operations are equivalent to RPN stack
  operations (very comparable to a Hewlett Packard RPN calculator).
  This is mainly important because RPN is
       - parsed linearly, and
       - each operation is very simple, which is both fast.
- VLIW because all operations are of uniform size. This makes sure that
  finding the next item is equivalent to one pointer addition instruction.
  Also looking ahead and back is easy. Contrary to "real" VLIW, only one
  instruction per word exists.
- Array vs linked list or OOP thingy like tlist: Same reasons as VLIW.

In TEvaluator, symbolic values are subdivided into symbolic constants and
variables. There is no mathematical difference (you define what a constant,
and what is a variable. If you choose "wrong", there is a speed penalty, but
no failure). The difference between constants and variables is that constants
are embedded in the VLIW-RPN array before each evaluation, while variables are
passed as parameters to each evaluation.
Constants can be changed between each evaluation. If a variable only changes
each 50 or more evaluations, make it a constant, and change it after 50
evaluations.

Example:

 somefunc(x,y,pi,h)=h/(2*pi)*x^2+5*x+y

Obviously, it is smart to choose pi and h for constants, since they won't
change each evaluation again. (even smarter would be to evaluate h/2*pi  :-)


A VLIW record basically can be 4 or 5 things atm:

- a floating point value.
- an integer value.
- a RPN operator or function (which isn't a difference in RPN), though
   this routine makes a difference between one and two parameter
   functions/operators for speed. Two types:
       - An operator or a function which takes two arguments. (there is no
          difference in RPN, an operator is a function and vice versa)
       - A function that takes one argument.
- (administrative value, no mathematical meaning) placeholder for a symbolic
  constant, to be able to to detect a constant/variable which wasn't given a
  value, and raise an exception.

- Symbolic variables. The variables in the expression are identified by an
   integer sequential value (first var=1, second 2 etc). Variable values ARE
   looked up each occurance during evaluation, and the only data used from
   outside the RPN-VLIW array in a standard evaluation.

The symbolic constants initially get the "placeholder" value, and when the
user sets the constants via the SetConstant method, it gets a "floating point
value" or "integer value" type.
The class stores all references to all occurances of a constant in the VLIW
array in a tlist.

The Parser
----------

The parser is based on a PD stack RPN based non symbolic constant evaluator, I
found in SWAG. It is practically rewritten, and only the elegant principle
stands. The parser is NOT recursive-descent. It purely parses from left to
right and creates for each token it finds a parsetree record.
Parsetree records that can't be added to the parsetree yet, are pushed on an
argument or separate operator stack.
When an operator is found, then the operator stack is evaluated (popping arguments
of the argument stack) until an operator with higher precendence than the new
one is found. Only then the new operator is pushed on the operator stack.

I don't know if this is the fastest way, but it is simple, quite elegant and
probably not very bug-sensitive. If somebody has sensible reasons to change it
to recur. descent, please mail me.

Exceptions
-------------

I'm still working on the errorhandling (exceptions) of the classes.
Besides some more specific cases, there are two or three basic exception groups:

- (RPN)Stack under/overflow exceptions. This is not necessarily a fault
  in the program, but more likely a fault in the passed (infix) expression.
  (specially if they occur in the parser). Smartest is to log the expression
  passed to parser somewhere in such cases.
  Note: These signal problems with internal RPN stacks,
  not the processor stack. Do not mix these up. (by reraising a processor
  stack exception). The fault is not necessarily in the program.

- Internal errors. (IE) These are mosttimes problems in the class, and logging
  the "message" gives some information about the location of the problem.
  Most IE's are ELSE clauses that shouldn't occur, or datacorruption that
  is not acceptable.  Probably they only occur if one part of the package
  is out of sync with another part, with dangling pointers etc.
  E.g. Parser is updated to support function "x", but TEvaluator.Evaluate
  wasn't. The CASE node for "x" doesn't exist, so it ends up in the ELSE
  clause which triggers the exception.
  If you use FPC, and your application is compiled with -gl, you'll probably
  get a nice backtrace with sourcename and line of the exception.

- Division by zero might be the third. This is NOT the processor division
   trap by zero, but a RPN stack one.

Glossary
---------
Some often used abbreviations and terms:

FPC  : Free Pascal Compiler, the one that I use. Has a 32-bit Delphi mode.
       Misses dynamic arrays, interfaces, and nearly the entire VCL in
       production version 1.0.x. (Meanwhile, most of the language is
       already in 1.1.x development version)
        http://www.freepascal.org

IE   : Internal error. Under FPC we try to append an ELSE clause to all
       CASE statements, even if the ELSE shouldn't occur. In such CASE
       statement the ELSE calls an internal error procedure.
       This is also used for other important decisions with situations that
       shouldn't occur. (e.g. enumerations with values that aren't defined,
       placed there by casts, circular references in linked lists etc.)
       I use the same system in these units, but with Exceptions.
       See "Exceptions" paragraph for more information about IEs.
       A good generic IE routine should be able to obtain the name of the class
       in string form.

Infix: The way poeple usually write down expressions. An operator between its
       operands. (5+2 operates on 5 and 2. Could also be written as add(5,2)
       or 5 2 +
       Has some advantages, but is complicated and slow to parse. However
       users(except some Hewlett Packard calculator users like me) seem to
       prefer it.

RPN : Reverse Polish Notation, an alternate notation of expression.
       Any operator appears AFTER its operands.
       e.g. 1+2+3*sin(4)  could be written as 1 2 + 3 4 sin * +
       Biggest advantage: Linear parsing from left to right.
       Being able to convert a parsetree to RPN is also a good debugging aid.
       (since it can be simply printed instead of having to browse a
       parsetree runtime)
       You can also think of it as replacing the infix operators in an infix
       expression by functions (so add(x,y) instead of x+y), and then parse
       from end to start (the "Reverse" of RPN)
       This also displays another feature of RPN: There is no difference between
       operators and functions. There are only functions that take different
       amounts of parameters.

Parsetree:
      The way an expression (or even an entire program) is stored
       after parsing in compilers. Often, the main type is a variant record
       (see treenode, pnode in the source) in which an operator or a function
       has pointers to each operand. Parsetrees are often visualised as below.
       Each operation, function or constant is a record, the lines made with
       slashes are the pointers between the records. (so the top "+" has a
       pointer to  another "+" record, and one to a "*" record)


                    +
                   / \
                  +   \
                 / \   \
                1   2   \
                         *
                        / \
                       3   SIN
                             \
                              4

                        Fig 1.  1+2+3*sin(4)

       Parsetrees are the easiest way to operate on (transform, derive etc)
        expressions. Mainly because you don't have to move much data to move one
	part of the expression to another place. Parsetrees are kinda slow though)
	(compared to RPN), or VLIWRPN

VLIW:  Very Large Instruction Word. Acronym from the RISC world that simply
          boils down to "a linear sequence (array,stream) of uniform sized
          "items" is the simplest and fastest way to parse something."
          The RISC people are of course talking about instructions to process
          and schedule. I'm using the analogy to evaluate an array of
          elementary RPN instructions.
          This principle is used to get the expression evaluator fast per
          iteration. The main difference is that in VLIW processors more than
          one operation can be packed in a VLI-Word. (which must be independant
          then). This unit doesn't :-)