StructuredTextNG - Format

Contents


Scope

This document is being written as a contribution to the work for PEP-2016 (Docstring Format).

It is an attempt to describe the format of StructuredTextNG more precisely than at StructuredTextNGRules.

Note that it is not an official statement, but is derived from trying to deduce what is meant, and also from looking at the source code to see what is (or might be) done.

NB: see NGDocumentation for how one uses the StructureTextNG system to process text into structured text.

Text highlighted:

like this

is quoted from StructuredTextNGRules.


Changes

Changes from 2000-111-15


Still to think about

My "todo list" (or "things not to forget to think about"):


Paragraphs

A structured string consists of a sequence of paragraphs separated by one or more blank lines. Each paragraph has a level which is defined as the minimum indentation of the paragraph. A paragraph is a sub-paragraph of another paragraph if the other paragraph is the last preceding paragraph that has a lower level.

The basic input to StructureTextNG is a string representing the text to be processed. The output is a structure representing a hierarchy of paragraphs (specifically, a list of StructuredTextParagraph objects for the "top level" paragraphs, each of which may have a list of sub-paragraphs, each of which...).

Splitting into paragraphs
The first step of structuring is to split the list of lines into a list of paragraphs, where paragraphs are separated by one or more blank lines.

The effect is as if the following Python were obeyed (actually, this essentially is the Python that is obeyed, but rendered more verbosely):

    import re,string
    # Remove any confusing tabbing - follow Python rules!
    text = string.expandtabs("\n\n"+text+"\n\n")
    # Split into paragraphs
    blankline = re.compile("\n\s*\n")
    paragraphs = blankline.split(text)
	

Any "empty" paragraphs (i.e., those which consist solely of whitespace) are discarded.

We now have a list of "paragraphs" - note that these are not lists of lines, but still just "chunks of text", with embedded NEWLINE characters.

Calculating the indentation
For each paragraph, its indentation is calculated - this is done simply by counting the spaces at its start (remember, tabs were converted to spaces earlier on).

It is important to note that the indentation of a paragraph depends only on the indentation of its "first line" (i.e., its start). Any internal indentaion is invisible to the processor, since it does not look at the innards of the paragraph.

Thus:

    This is a single line
	

and:

    * This is a list element
      with continuation that is indented the way I would do it
	

and

    * This is a list element
    with continuation that is indented rather less
	

are all considered to have the same paragraph indentaion.

Structuring according to indentation levels
Given a list of paragraphs with known indentations, it is then simple to assign a level to each paragraph - if it has no indentation, then it is at level 0, if it has the same indentation as the preceding paragraph, then it has the same level as that paragraph, and if it has greater indentation than the preceding paragraph, then it is at a level one (or more) higher, depending on its actual indentation. These rules should be familiar, as they are essentially those used by Python itself.

Given this structure, one can then form a list of level 0 paragraphs, with the level 1 paragraphs as "children" of the level 0 paragraphs, and so on.

This is the basic structure used by StructuredTextNG.

Notes:


Terminology

In the following, I use these terms with specific meaning:

ASTERISK-The character "*"
COLON-The character ":"
COMMA-The character ","
DOT-The character "."
DOUBLEQUOTE-The character """
EXCLAMATION-The character "!"
HYPHEN-The character "-"
NEWLINE-The character "\n".
Possibly also the sequence carriage-return, newline ("\r\n")
SLASH-The character "/"
QUOTE-The character "'"
SEMICOLON-The character ";"
SMALL-O-The character "o"
SPACE-The character " "
TILDE-The character "~"
QUERY-The character "?"
AT-The character "@"
L-BRACKET-The character "["
R-BRACKET-The character "]"
whitespace-any of SPACE, NEWLINE
letter-any of a-z or A-Z
digit-any of 0-9
roman-digit-any of ivxlcdmIVXLCDM
alphanumeric-any letter or digit
enumeration-either a single letter or one or more digits or one or more roman-digits
punctuation-any of ,.;:!?
URL-A URL (e.g., http://www.zope.org/) - the exact specification of this is left unspecified for the moment...
not(FRED)-Any characters except that (or those) specified - for instance, not(QUOTE) would mean any character except QUOTE.
begpara-This is intended to match the beginning of a paragraph (which is, of course, just the start of the paragraph's "string").
endpara-This is intended to match the end of a paragraph (which is, of course, just the end of the paragraph's "string").
begline-This is intended to match either   begpara or   NEWLINE

Items within "[" and "]" are optional. Items on either side of "|" are alternatives (and "|" only applies to the items immediately adjacent to it).


Special constructs (Colorizing)

Once one has the text organised as hierarchical paragraphs, one can start to process the result to obtain "proper" structured text. This is done by the code in DocumentClass.py, and is termed "colourising" (well, OK, I guess it's termed "colorizing", but I tend to spell that wrong).

StructuredTextNGRules presents the rules for colorizing StructuredText in an informal manner. The following is my interpretation of a stricter version of these rules, presented slightly more formally. Note that it does not necessarily reflect anyone else's actual intent.

Note that it is not clear to me whether/how colorized regions can be embedded within each other. My assumptions (or, putting it another way, what I would like) is that it works so that:

A single-line paragraph with subparagraphs is treated as a header.

This is perfectly clear.

A paragraph that begins with a -, *, or o is treated as an unordered list (bullet) element.

Any of the sequences:

is treated as the start of an unordered (bullet) list element. Which type of "bullet" was used will be remembered.

Note: I would like to replace begpara with begline.

A paragraph that begins with a sequence of digits followed by a white-space character is treated as an ordered list element.

A paragraph that begins with a sequence of sequences, where each sequence is a sequence of digits or a sequence of letters followed by a period, is treated as an ordered list element.

Any of the sequences:

is treated as the start of an ordered list element. The enumerations for the list element will be remembered.

Note: I would like to replace begpara begline.

A paragraph with a first line that contains some text, followed by some white-space and -- is treated as a descriptive list element. The leading text is treated as the element title.

The sequence:

is treated as the start of a descriptive list element. The not(NEWLINE) text is treated as the element title, and may not be of zero length.

Note: I would like to replace begpara with begline.

Sub-paragraphs of a paragraph that ends in the word example or the word examples, or :: is treated as example code and is output as in::
   For instance ::

    <table border=0>
      <tr>
        <td> Foo 
    </table>
	    

If a paragraph with subparagraphs ends with COLON COLON, those subparagraphs will be treated as "example" text, and will not be colourised.

(Note that the words example or examples are not, in fact, treated specially at the end of a paragraph in STNG.)

Note that the COLON COLON will be kept as such, so will appear in the output, unless the formatting tool treats them specially.

For example:
    Code::

        We can have *fred* here if we like...
	    

might appear as:

Code::

        We can have *fred* here if we like...
	    

Text enclosed single quotes (with white-space to the left of the first quote and whitespace or puctuation to the right of the second quote) is treated as example code::
    For example: '<dtml-var foo>'.
	    

In-line example text is represented as:

The text within QUOTEs will not be colorized.

Text surrounded by * characters (with white-space to the left of the first * and whitespace or puctuation to the right of the second *) is emphasized.

Emphasized text is represented as:

The text within ASTERISKs is emphasized.

Text surrounded by ** characters (with white-space to the left of the first ** and whitespace or puctuation to the right of the second **) is made strong.

Strong text is represented as:

The text within double ASTERISKs is strong.

Note that the sequence **Some *bold* text** is allowed, but the sequence **Some *bold text*** may or may not be allowed (I'd like it if it were).

Text encloded by double quotes followed by a colon, a URL, and concluded by punctuation plus white space, or just white space, is treated as a hyper link.
For example, "Zope":http://www.zope.org/ is interpreted as Zope
Note: This works for relative as well as absolute URLs.

Hyperlinks may be represented in several ways. The first is:

The text "Zope":http://www.zope.org/ will be represented as Zope.

This mechanism is designed to work for both relative and absolute URLs.

Text enclosed by double quotes followed by a comma, one or more spaces, an absolute URL and concluded by punctuation plus white space, or just white space, is treated as a hyper link.
For example: "mail me", mailto:amos@digicool.com is interpreted as mail me

An alternative representation for URLs is:

The text "Zope", http://www.zope.org/ will be represented as Zope.

This mechanism is only designed to work for absolute URLs.

SGML text is ignored and outputed as is.

This is implicit in that "SGML" text is not mentioned in the other paragraphs.


The actual rules

Inspection of the source code for (a recent version of) STNG, dated about 2000-11-09, appears to show the following implementation details. These are, of course, not necessarily final, but since I assume each implementation "oddity" has a sensible reason for having come into being, they are worth studying - they may be what STNG finally ends up providing...


Other notes


Proposals

Out-of-line URLs
A proposed representation for out-of-line URLs is:

The first item occurs in the body of the text, and the second item occurs on a line by itself, and provides the URL.

Thus "Zope"[zp] would be represented as Zope if there was a later line of the form:

.. [zp] http://www.zope.org/

I believe this to be (more or less) the same as the mechanism currently implemented in StructuredTextClassic, and the same as the proposal at OutOfLineURL.

(NB: also see DefaultURLInterpretation for another relevant suggestion, although one of less interest to us, I think.)

In-line Python code fragments
The 'text' mechanism for escaping literal text is considered by some to be unnatural for Python code because of Python's tendency to use ' as a string delimiter.

(The same argument doesn't apply to " because it has no meaning for a "bare" string, only for strings used as URL descriptors.)

Can we do without this? (No, say I and some others.)

The main proposal that resulted was to enclose Python code fragments within # (HASH) delimiters, as follows:

(Note that Python comments within example code would still be safe, since example code is not colourised.)

Some discussion ensued about how that related to the next item, Python code references.

Python code references
The docstring processor needs to be able to create "references" (links) from the names of Python modules, classes, functions/methods, values and so on to the relevant Python entity.

Two forms of this are wanted:

  1. "guessing"
  2. "explicit"

The "guessing" technique takes the highlighted entity, and attempts to deduce what it is (a class name, method name, function name, etc.) by reference to the module including the docstring, using the normal Python rules for resolving a name. It then creates a link to the answer (or, perhaps, to the most likely answer).

The "explicit" technique needs to allow the docstring author to indicate what type of Python entity is being linked to - this can become necessary, for instance, if one has a class and method of the same name - from within the class's docstring, it is not clear which is required, and it may be undesirable to say #self.name#.

Note that a Python programmer will be willing to lose a little "naturalness" in the text for the sake of achieving this linkage.

The main proposals for delimiting the entity name were:

Proposed means of indicating what type of thing is to be referenced still to be annotated, but I can't help thinking that something along the lines of
    #name#:Class

(in pseudo-parallel with URL references) might work...

Test code detection
Tim Peters' doctest utility will run Python code examples embedded in docstrings (for testing purposes). To do this, it needs to identify said code examples. It has been proposed that the docstring utility should have the same functionality, and treat such code examples as examples (without the need for the preceding paragraph to end in ::.

As an example, consider the following:

    """A simple docstring for a simple function.

    Arguments
      arg1 -- an integer
      arg2 -- another integer, to be added to the first

    In normal usage, one just adds numbers together:

        >>> add(1,3)
        4
        >>> add(-1,-2)
        -3

    Unfortunately, we don't cope with real numbers:

        >>> add(1.0,2.0)
        ERROR: can't add floating point numbers
    """
	

I'm not sure if this is in Python 2.0 as standard, but see (for instance) FAQTS - Docstring-driven testing for details.

Note: I regard this as a very important thing to include in docstring support - to my mind it is more important than the issue of references into the code (above).

It is worth noting that by Tim's rules about what can be considered as test text (i.e., that any blank lines must be absent) this already counts as a paragraph in the STNG sense, so should be very easy to detect...

Escaping characters
I can't find the Wiki page for this item, although I'm sure there is one.

In normal use of ST[NG], there is almost no need to be able to "escape" a character. Given this, and given the requirement for StructuredText to read naturally of itself, Jim Fulton is reluctant to introduce general "quoting" of special characters, unless it can be justified very carefully.

When is it actually needed? In traditional ST usage, I suspect only in the context of 'literal text', when one might want to insert a ', and this is both relatively unlikely, and probably possible to write "around".

In a docstring environment, however, it is a certainty that one day someone will want to do the equivalent of

    'And the single quote is "'"'

If we have HASH quoting for Python code fragments, the same problem would arise with:

    #print "#",#

A suggestion was made of using !, since it is used to escape Zwiki names, but this doesn't look very good, and Jim Fulton doesn't like it. He has come up with some suggestions for alternative (balanced) quoting schemes that I don't entirely follow, but the text is on the Zwiki page I can't find. I don't believe it addressed the Python-specific problems, anyway.

I have in the past suggested the TILDE character for such escaping (doubling it to get the actual character), but that was more in a sense of desparation than anything else. We need to consider what characters are rarely, if ever, used in Python, and how they look (for instance, AT would be a bad choice visually).

It would be nice (although not essential) if we also solved the problem for "mainstream" STNG...

Tagged paragraphs
An earlier attempt at resolving the docstring problem made extensive use of markup by "tagged paragraphs", allowing one to say, for instance:

Author: Guido van Rossum

or

History:
    0.1: June 1920
    0.2: July 1919

or even

Arguments:
    name -- The person's full name
    age -- Their age

The : at the end of the "tag" was not finalised - in context of the time :: was also considered (I think a single COLON would do, actually, and we can't use :: without causing some confusion!).

The idea also made heavy use of the ability not to require "spurious" blank lines between content lines, and it was assumed that some of the parsing of the content would be dependent upon the actual tag.

A list of tags would need to be made - proposals made included:
Argument[s], Returns, Exception[s], Author, History, Version, and maybe Code (this may be redundant) and Links (intended to introduce the links section of an OutOfLineURL, but probably redundant).

My gut feeling is that this is a Really Good Idea, and that we should leave it alone for the moment.

Reference: see David Ascher's email of 1999-11-29 in the DocSIG, [Doc-SIG] docstring grammar

Tables
A mechanism for representing tables would be good. Of course, we're not alone in thinking this, and indeed a mechanism was provided in StructuredTextClassic, but it does not read naturally in the "bare", and Jim Fulton has dropped it from STNG (it was apparently snuck in to STClassic when he wasn't looking).

(The Wiki page for discussing the proposal to reintroduce that representation, or indeed an alternative, is StructuredTextTables.
My personal feeling is that this is likely to take a long while to find a solution for, since tables are difficult to describe with ASCII text in a way that everyone finds palatable.)


Using StructuredTextNG

The StructuredText module appears (ignoring the residual StructureTextClassic content) to work as follows:

STDOM.py
Provides DOM functionality (for reading only).
ST.py
Imports STDOM
Provides a function StructuredText() which takes a list of lines and returns a StructuredTextDocument instance. The StructuredTextDocument class is a subclass of StructuredTextParagraph, which is in turn a subclass of STDOM.Element.
DocumentClass.py
Imports ST and STDOM
Provides functionality which takes a StructuredTextDocument instance (for instance) and processes its innards according to the rules of STNG.
HTMLClass.py
Imports ST
Provides functinality to take a StructuredTextDocument instance (for instance) and emit HTML.
DocBookClass.py
Provides functionality to take a StructuredTextDocument instance (for instance) and emit DocBook text.

The __init__.py for the StructuredText module provides the following (again, ignoring Classic stuff):

.Basic()
This is a pseudonym for the StructuredText function.
.HTML()
This is the interface to the conversion to HTML.
.Document
This is the interface to the Document Class functionality.
.DocBookBook
This creates an instance of a DocBook book.
.DocBookChapter and .DocBookArticle
These are the interface to the DocBookClass "conversion" functionality.


Tibs (tibs@tibsnjoan.co.uk)

Last modified: Thu Nov 16 12:00:47 GMT Standard Time 2000