Contains an HTML-to-Text renderer.

html2textrender.pas contains an HTML-to-Text renderer. It converts HTML into plain text by converting tags and their attributes to a representation as plain text.

html2textrender.pas is part of the lazutils package.

Implements an HTML-to-Text renderer.

THTML2TextRenderer is an HTML-to-Text renderer. It converts HTML into plain text by stripping tags and their attributes. Converted text includes configurable indentation for HTML tags that affect the indentation level. The following HTML tags include special processing in the renderer:

  • HTML
  • BODY
  • P
  • BR
  • HR
  • OL
  • UL
  • LI
  • DIV CLASS="TITLE" (forces title mark output)

The following Named character entities are converted to their plain text equivalent:

 
' '
<
'<'
&gt;
'>;'
&amp;
'&'

Other named character entities or numeric character entities are included verbatim in the plain text output.

A UTF-8 Byte Order Mark in the HTML is ignored.

Set property values in the class instance to customize the content and formatting produced in the output. Use the Render method to parse and process the HTML content passed to the constructor, and generate the output for the class instance.

HTML content examined in the class. Output value without HTML tags and attributes. Maximum number of lines allowed in the output from the class. End of line marker, by default standard LineEnding. Markup used at the start/end of title text. Markup used for an HR Tag. Markup used at the start of an Anchor Tag. Markup used at the end of an Anchor Tag. Markup used for a list item tag. Text added when there are too many lines. Flag used to suppress output of line breaks in the output. Flag used to indicate that a DIV tag with a TITLE attribute is being processed. Flag used to indicate that a space character needs to be added the end of a wrapped line. Indicates a line break needs to be appended in the output. Increment (in spaces) for each nested HTML level. The current indentation level for the renderer. Number of lines added to the output for the class. Length of the HTML examined in the class. Current character position in the HTML. Sets a pending line break to be added later. Sets a maximum of one pending line break to be added later. Appends text to the plaint-text output for the renderer.

AddOutput is a Boolean function used to append the value specified in aText to the output for the renderer.

AddOutput ensures that a space character is included for wrapped lines in the HTML when there are no pending new lines. Otherwise, the required number of line ending sequences are appended to the output for the render and the line count is increased accordingly. If the line count exceeds the maximum number allowed in Render, the value in MoreMark is appended to the output.

Pending new line(s) also cause indentation spaces to be appended to the output.

The value in aText is appended to the output for the renderer prior to exiting from the method.

AddOutput is used in the implementation of the Render, HtmlTag, and HtmlEntity methods.

Text value appended to the output for the renderer. True when the value was added to the output; False when the maximum number of lines is exceeded. Handles an HTML tag and its attributes values.

HtmlTag is a Boolean function used to locate and process an HTML start or end tag, and any attribute name/value pairs present in the tag. HtmlTag handles the following HTML tag and attribute/value names:

HTML
Sets the FInHeader flag to indicate that the content is for a whole page.
BODY
Call Reset to initialize the renderer.
P, /P, BR, /UL
Adds a new line sequence to the output.
DIV CLASS="Title"
Sets the fInDivTitle flag, and adds a NewLine and a TitleMark to the output. When the CLASS attribute is omitted or has a different value, only a NewLine sequence is appended.
/DIV
Appends a trailing TitleMark, resets the FInDivTitle flag, and appends a NewLine sequence and decrements the indentation level.
LI
Increments the indentation level and adds a single NewLine prior to adding the content in the list.
/LI
Decrements the indentation level.
A
Appends a Space character and the LinkBegin sequence to the output.
/A
Appends a LinkEnd sequence and a Space character to the output.
HR
Adds a single NewLine and the content in HorzLine to the output.

All other tag names are ignored in the method.

The return value is True when the HTML content is successfully added by calling AddOutput. The return value is False when the maximum number of lines specified in the Render method is exceeded.

True when output is successfully added; False when the maximum number of lines is exceeded. Handles an HTML character entity.

HtmlEntity is a Boolean function used to convert common character entities in HTML to their plain text equivalent. The following Named character entities are converted to their plain text equivalent:

&nbsp;
' '
&lt;
'<'
&gt;
'>;'
&amp;
'&'

Other named character entities or numeric character entities are included verbatim in the plain text output.

The return value is the result from the AddOutput method, and contains False when the maximum number of lines has been exceeded in the renderer.

True on success, False when the maximum number of lines is exceeded. Resets the state and output for the renderer.

Reset is a procedure used to reset the state and output for the renderer. Reset sets values for internal flags used in the class, and clears any content stored in the render output.

Constructor for the class instance.

Create is the overloaded constructor for the class instance. HTML content is passed as an argument to the method using either a String value or a TStream instance. The Stream-based variant reads the content in Stream into a String variable for processing. The position in the stream is not changed prior to or after reading its content.

Create stores the HTML content in aHTML to an internal member used when parsing and processing using methods in the class. A UTF-8 Byte Order Mark (BOM) at the start of the HTML content is removed prior to processing.

Create sets the default values for the following properties:

LineEndMark
Set to the value in the LineEnding constant for the platform or OS.
TitleMark
Set to the UTF-8 character '◈' (#9672 or #x25C8)
HorzLineMark
Set to the UTF-8 characters '——————————————————'.
LinkBeginMark
Set to the character '_'.
LinkEndMark
Set to the character '_'.
ListItemMark
Set to the UTF-8 characters '✶ ' (Hex #$2736).
MoreMark
Set to the characters '...' (Three Period characters - not an Ellipsis character).
IndentStep
Set to 2.
String with the HTML content examined in the class. TStream instance with the HTML content examined in the class. Frees the class instance.

Destroy is the overridden destructor for the class instance. Destroy calls the inherited destructor.

Parses the HTML and renders the plain text output.

Render is a String function used to parse the HTML passed as an argument to the constructor, and to render the plain text output in the return value. The output is limited to the number of lines specified in the aMaxLines argument. The default value for the argument is the MaxInt constant.

AddOutput, HtmlTag, and HtmlEntity return False if aMaxLines was exceeded.

Renders calls the Reset method to set the initial values for members and flags used in the class instance. The parsing mechanism looks for HTML tags and character entities/references, processes their content, and calls the AddOutput method. Whitespace (characters #32, #9, #10, and #13) between tags and entities is always normalized into a single space character.

Render calls the HtmlTag, HtmlEntity, and AddOutput methods to process the HTML content passed to the method.

Maximum number of lines to process in the method. String with the plain text content extracted from the HTML. Defines the end-of-line character sequence.

LineEndMark is a String property which contains the end-of-line character sequence inserted in the plain text output for the renderer. By convention, the default value for the property is the value from the LineEnding constant defined for the platform or OS. The value is inserted in the renderer output in the AddOutput method.

LineEnding
Defines the character used to delimit a title or header.

TitleMark is inserted both prior to and following a title/header found in the HTML content in the HtmlTag method. The default value is the UTF-8 character '◈' (Decimal #9672 or Hex #x25C8).

Represents a HR tag in the plaint text output.

HorzLineMark is used in the implementation of the HtmlTag method when a HR tag is encountered in the HTML content. The default value for the property is the UTF-8 characters '——————————————————' (Eighteen Hex #$2013 characters).

Represents an A start tag in the plain text output.

LinkBeginMark is a String property used to represent the start of the plain text output for an HTML A tag. LinkEndMark is used to represent the end of the anchor. The value is added to the plain text output for the renderer in the HtmlTag method.

Represents an A end tag in the plain text output.

LinkEndMark is a String property used to represent the end of the plain text output for an HTML A tag. LinkBeginMark is used to represent the start of the anchor. The value is added to the plain text output for the renderer in the HtmlTag method.

Represents a list item in the plain text output.

ListItemMark is a String property which contains the character(s) inserted before a HTML LI tag. The value is added to the plain text output for the renderer in the HtmlTag method.

Indicates that the plain text output is truncated due to a line limit restriction.

The default value for the property is three (3) Period ('.') characters - NOT an Ellipsis character. The value is added to the plain text output for the renderer when the maximum number of lines has been exceeded in the AddOutput method.

Number of space characters used for each indentation level in the plain text output.

IndentStep is an Integer property used to indicate the number of space characters generated for each indentation level in the plain text output for the renderer. The default value for the property is 2, and is used in the implementation of the AddOutput method.

Converts the specified HTML content to a plain text value.

RenderHTML2Text is a String function used to convert the HTML content specified in AHTML to a string with the plain text for the content. RenderHTML2Text creates a temporary THTML2TextRenderer instance (using its default configuration values) to remove any HTML mark-up found in the AHTML argument by calling its Render method.

RenderHTML2Text is a convenience routine; use a THTML2TextRenderer instance when the HTML content is stored in a TStream instance, or to override the default configuration settings for the class instance.

String with the plain text value for the specified HTML content. String with the HTML content converted in the routine.