htmllib
SGMLParser
defined in module sgmllib
.
The following is a summary of the interface defined by
sgmllib.SGMLParser
:
feed()
method, which takes a string argument. This can be called with as
little or as much text at a time as desired;
p.feed(a); p.feed(b)
has the same effect as p.feed(a+b)
.
When the data contains complete
HTML elements, these are processed immediately; incomplete elements
are saved in a buffer. To force processing of all unprocessed data,
call the close()
method.
Example: to parse the entire contents of a file, do*
parser.feed(open(file).read()); parser.close()
.
start_tag()
,
end_tag()
, or do_tag()
. The parser will
call these at appropriate moments: start_tag
or
do_tag
is called when an opening tag of the form
<tag ...>
is encountered; end_tag
is called
when a closing tag of the form <tag>
is encountered. If
an opening tag requires a corresponding closing tag, like <H1>
... </H1>
, the class should define the start_tag
method; if a tag requires no closing tag, like <P>
, the class
should define the do_tag
method.
SGMLParser
base
class, •
. It also defines handlers for the following
tags: <LISTING>...</LISTING>
, <XMP>...</XMP>
, and
<PLAINTEXT>
(the latter is terminated only by end of file).
HTMLParser
, collects various useful
bits of information from the HTML text. To this end it defines
additional handlers for the following tags: <A>...</A>
,
<HEAD>...</HEAD>
, <BODY>...</BODY>
,
<TITLE>...</TITLE>
, <NEXTID>
, and <ISINDEX>
.
CollectingParser
, interprets a wide
selection of HTML tags so it can produce formatted output from the
parsed data. It is initialized with two objects, a formatter
which should define a number of methods to format text into
paragraphs, and a stylesheet which defines a number of static
parameters for the formatting process. Formatters and style sheets
are documented later in this section.
FormattingParser
, extends the handling
of the <A>...</A>
tag pair to call the formatter's
bgn_anchor()
and end_anchor()
methods. This allows the
formatter to display the anchor in a different font or color, etc.
CollectingParser
(and thus also instances of
FormattingParser
and AnchoringParser
) have the following
instance variables:
NAME
attributes of the <A>
tags encountered.
HREF
attributes of the <A>
tags
encountered.
TYPE
attributes of the <A>
tags encountered.
<A>...</A>
tag pair, this is zero. Inside such a
pair, it is a unique integer, which is positive if the anchor has a
HREF
attribute, negative if it hasn't. Its absolute value is
one more than the index of the anchor in the anchors
,
anchornames
and anchortypes
lists.
<ISINDEX>
tag has been encountered.
<NEXTID>
tag encountered, or
an empty list if none.
<TITLE>...</TITLE>
tag pair, or
''
if no title has been encountered yet.
anchors
, anchornames
and anchortypes
lists
are ``parallel arrays'': items in these lists with the same index
pertain to the same anchor. Missing attributes default to the empty
string. Anchors with neither a HREF
nor a NAME
attribute are not entered in these lists at all.
The module also defines a number of style sheet classes. These should never be instantiated --- their class variables are the only behavior required. Note that style sheets are specifically designed for a particular formatter implementation. The currently defined style sheets are:
stdwin
module; it is an alias
for either X11Stylesheet
or MacStylesheet
.
gl
and fm
).
setfont()
method.
<H1>...</H1>
tag pairs etc.).
<DD>
tags.
<UL>
tags.
<PRE>...</PRE>
and similar tag pairs).
FormattingParser
class assumes that formatters have a
certain interface. This interface requires the following methods:
flush()
.
addword
. It should be set to false after a non-empty word has
been added.
'c'
(center), 'l'
(left
justified), 'r'
(right justified) or 'lr'
(left and
right justified).
inanchor
attribute.
inanchor
attribute.
fmt
, which in turn uses the module Para
. These modules are
not intended as standard library modules; they are available as an
example of how to write a formatter.