Contains an HTML-to-Text renderer

html2textrender.pas contains an HTML-to-Text renderer. It converts HTML into plain text by converting tags and their attributes to a representation as plain text.

THTML2TextRenderer is an HTML-to-Text renderer. It converts HTML into plain text by stripping tags and their attributes. Converted text includes configurable indentation for HTML tags that affect the indentation level. The following HTML tags include special processing in the renderer:

  • HTML
  • BODY
  • P
  • BR
  • HR
  • OL
  • UL
  • LI
  • DIV CLASS="TITLE" (forces title mark output)

The following Named character entities are converted to their plain text equivalent:

 
' '
<
'<'
&gt;
'>;'
&amp;
'&'

Other named character entities or numeric character entities are included verbatim in the plain text output.

A UTF-8 Byte Order Mark in the HTML is ignored.

Set values for properties in the class instance to customize the content and formatting produced in the output. Use the Render method to parse and process the HTML content passed to the constructor, and generate the output for the class instance.

HTML content examined in the class Output value without HTML tags and attributes Maximum number of lines allowed in the output from the class End of line marker, by default standard LineEnding Markup used at the start/end of title text Markup used for an HR Tag Markup used at the start of an Anchor Tag Markup used at the end of an Anchor Tag Markup used for a list item tag Text added when there are too many lines Flag used to suppress output of line breaks in the output Flag used to indicate that a DIV tag with a TITLE attribute is being processed Flag used to indicate that a space character needs to be added the end of a wrapped line Indicates a line break needs to be appended in the output Increment (in spaces) for each nested HTML level The current indentation level for the renderer Number of lines added to the output for the class Length of the HTML examined in the class Current character position in the HTML Sets a pending line break to be added later Sets a maximum of one pending line break to be added later Appends text to the plaint-text output for the renderer

AddOutput is a Boolean function used to append the value specified in aText to the output for the renderer.

AddOutput ensures that a space character is included for wrapped lines in the HTML when there are no pending new lines . Otherwise, the required number of line ending sequences are appended to the output for the render and the line count is increased accordingly. If the line count exceeds the maximum number allowed in Render, the value in MoreMark is appended to the output.

Pending new line(s) also cause required indentation spaces to be appended to the output.

The value in aText is appended to the output for the renderer prior to exiting from the method.

AddOutput is used in the implementation of the Render, HtmlTag, and HtmlEntity methods.

Text value appended to the output for the renderer True when the value was added to the output; False when the maximum number of lines is exceeded Handles an HTML tag and its attributes values

HtmlTag is a Boolean function used to locate and process an HTML start or end tag, and any attribute name/value pairs present in the tag. HtmlTag handles the following HTML tag and attribute/value names:

HTML
Sets the FInHeader flag to indicate that the content is for a whole page.
BODY
Call Reset to initialize the renderer.
P, /P, BR, /UL
Adds a new line sequence to the output.
DIV CLASS="Title"
Sets the fInDivTitle flag, and adds a NewLine and a TitleMark to the output. When the CLASS attribute is omitted or has a different value, only a NewLine sequence is appended.
/DIV
Appends a trailing TitleMark, resets the FInDivTitle flag, and appends a NewLine sequence and decrements the indentation level.
LI
Increments the indentation level and adds a single NewLine prior to adding the content in the list.
/LI
Decrements the indentation level.
A
Appends a Space character and the LinkBegin sequence to the output.
/A
Appends a LinkEnd sequence and a Space character to the output.
HR
Adds a single NewLine and the content in HorzLine to the output.

All other tag names are ignored in the method.

The return value contains True when the HTML content is successfully added by calling AddOutput. The return value is False when the maximum number of lines specified in the Render method is exceeded.

Does not appear to recognize HTML5 empty attributes (with no attribute value assignment).
True when output is successfully added to the output; False when the maximum number of lines is exceeded Handles an HTML character entity

HtmlEntity is a Boolean function used to convert common character entities in HTML to their plain text equivalent. The following Named character entities are converted to their plain text equivalent:

&nbsp;
' '
&lt;
'<'
&gt;
'>;'
&amp;
'&'

Other named character entities or numeric character entities are included verbatim in the plain text output.

The return value is the result from the AddOutput method, and contains False when the maximum number of lines has been exceeded in the renderer.

True on success, False when the maximum number of lines is exceeded Resets the state and output for the renderer

Reset is a procedure used to reset the state and output for the renderer. Reset sets values for internal flags used in the class, and clears any content stored in the render output.

Constructor for the class instance

Create is the overloaded constructor for the class instance. An argument passed to the method contains the HTML content examined in the class as either a String value or a TStream instance. The Stream-based variant reads the content in Stream into a String variable for processing. The position in the stream is not changed prior to or after reading its content.

Create stores the HTML content in aHTML to an internal member used when parsing and processing using methods in the class. A UTF-8 Byte Order Mark (BOM) at the start of the HTML content is removed prior to processing.

Create sets the default values for the following properties:

LineEndMark
Set to the value in the LineEnding constant for the platform or OS.
TitleMark
Set to the UTF-8 character '◈' (#9672 or #x25C8)
HorzLineMark
Set to the UTF-8 characters '——————————————————'.
LinkBeginMark
Set to the character '_'.
LinkEndMark
Set to the character '_'.
ListItemMark
Set to the UTF-8 characters '✶ ' (Hex #$2736).
MoreMark
Set to the characters '...' (Three Period characters - not an Ellipsis character).
IndentStep
Set to 2.
String with the HTML content examined in the class TStream instance with the HTML content examined in the class Frees the class instance

Destroy is the overridden destructor for the class instance. Destroy calls the inherited destructor.

Parses the HTML and renders the plain text output

Render is a String function used to parse the HTML passed as an argument to the constructor, and to render the plain text output in the return value. The output is limited to the number of lines specified in the aMaxLines argument. The default value for the argument is the MaxInt constant.

Please note: AddOutput, HtmlTag, and HtmlEntity return False if aMaxLines was exceeded.

Renders calls the Reset method to set the initial values for members and flags used in the class instance. The parsing mechanism looks for HTML tags and character entities/references, processes their content, and calls the AddOutput method. Whitespace (characters #32, #9, #10, and #13) between tags and entities is always normalized into a single space character.

Render calls the HtmlTag, HtmlEntity, and AddOutput methods to process the HTML content passed to the method.

Maximum number of lines to process in the method String with the plain text content extracted from the HTML Defines the end-of-line character sequence

LineEndMark is a String property which contains the end-of-line character sequence inserted in the plain text output for the renderer. By convention, the default value for the property is the value from the LineEnding constant defined for the platform or OS. The value is inserted in the renderer output in the AddOutput method.

Defines the character used to delimit a title or header

TitleMark is inserted both prior to and following a title/header found in the HTML content in the HtmlTag method. The default value is the UTF-8 character '◈' (Decimal #9672 or Hex #x25C8).

Represents a HR tag in the plaint text output

HorzLineMark is used in the implementation of the HtmlTag method when a HR tag is encountered in the HTML content. The default value for the property is the UTF-8 characters '——————————————————' (Eighteen Hex #$2013 characters).

Represents an A start tag in the plain text output

LinkBeginMark is a String property used to represent the start of the plain text output for an HTML A tag. LinkEndMark is used to represent the end of the anchor. The value is added to the plain text output for the renderer in the HtmlTag method.

Represents an A end tag in the plain text output

LinkEndMark is a String property used to represent the end of the plain text output for an HTML A tag. LinkBeginMark is used to represent the start of the anchor. The value is added to the plain text output for the renderer in the HtmlTag method.

Represents a list item in the plain text output

ListItemMark is a String property which contains the character(s) inserted before a HTML LI tag. The value is added to the plain text output for the renderer in the HtmlTag method.

Indicates that the plain text output is truncated due to a line limit restriction

The default value for the property is three (3) Period ('.') characters - NOT an Ellipsis character. The value is added to the plain text output for the renderer when the maximum number of lines has been exceeded in the AddOutput method.

Number of space characters used for each indentation level in the plain text output

IndentStep is an Integer property used to indicate the number of space characters generated for each indentation level in the plain text output for the renderer. The default value for the property is 2, and is used in the implementation of the AddOutput method.