Structured Text for Python docstrings

STpy documentation release 0.2.2/"merry christmas and a happy..."
("blockquote" support added; amendment to Python indentation)
(various typos fixed, and some reorganisation)


Scope

This document is being written as a contribution to the work for PEP-2016 (Docstring Format).

It aims to document what the Python docutils package should support, with a particular emphasis on documentation strings ("docstrings"). It does not (yet) document what the code I have been writing does support, because I haven't finished that code yet.


Contents


What's in a name, then?

Obvious names for this thing are (in parallel with STNG) either STpy (not easy to say, but follows the same rules), or pyST (easier to say, and sillier - does that make it more likely to be adopted?).

(Hmm - is STpy too easily confused with STk?)

(OK - the pronunciation of pyST is obvious (no, I wasn't thinking of "like those snowy Alp things you ski down"). I vote, however, that STpy be pronounced to rhyme with "guppy", 'cos that's easy.)


docutils

The Python package docutils is intended as a repository for code that reads and manipulates the Python specific variant of Structured Text.

Structured text is ultimately based on setext (see, for example, Setext information and samples).

In the early days of the Doc-SIG, suggestions were made that a variant of setext be adopted for use in Python docstrings.

Digicool did adopt an improved variant of setext for use in Zope, and called it StructuredText. The term is used for both the format and the software used to read that format and produce (typically) HTML as output. StructuredText is used widely, notably within Zwiki pages, and in the preparation of The Zope Book.

The Zope world is currently working on the specification of StructuredTextNG (ST Next Generation, of course).

STpy is perhaps best seen as a sibling format - although it would almost certainly be possible to implement the processor as a subclassed offspring of the Zope processor, so perhaps "child" would be more appropriate.

It is the intention of this author, at least, that STpy remain closely linked with the syntax of STNG, not least because of the large amount of experience that the STNG maintainers and developers have acquired in what actually works.

I previously wrote a document attempting to define STNG slightly more formally than is currently done on the "official" pages: STNG-format.html

The goals of Structured Text are (amalgamating different sources wildly):

That last item is fundamental - note that the table format in ClassicStructuredText is considered to break this rule (I agree!) - it was snuck in when Jim Fulton was looking elsewhere, and will not (in that form) be in STNG.

Put another way, structured text is (plain) text that uses indentation and simple symbology to indicate the structure of a document.


stpy.py


Summary

stpy.py is a docutils utility. If used as an independent program, it finds all of the docstrings in a module or package, and produces XML from them.


Command line usage

to be written

Intended inputs include:

Searching needs to be optionally along the PYTHONPATH

Intended outputs include:

Only the first will be available at first...

Intended qualifiers include:


Importing docutils

to be written

Obviously, one might also want to import docutils, and then perform any or all of the phases "by hand" - for instance, IDLE might want to use the formatter followed by output of HTML to a browser to display documentation for a particular function...


How it works

stpy.py has three phases:

  1. Find -- Parse the Python file or files (using the parser implementation from Tools/compiler, in Python 2.0 and subsequent - if this is not present a local copy will be used), and locate the docstrings.
    (This phase is obviously omitted if a Structured Text file is being processed.)
  2. Parse -- Parse each docstring, and produce a DOM datastructure to represent them (note that a single datastructure is used to represent all of the docstrings found - see STpy DTD below for details).
  3. Output -- Output the resultant DOM tree in whatever format is chosen (initially, only XML will be available, but HTML and some form of LaTeX seem like probable later additions).


STpy syntax


Overview

A structured text (of whatever type) may be considered as a text string to be split into paragraphs and then "colourised" (or marked up). Thus this section describes how the text is manipulated in those two phases (even if the implementation need not necessarily keep them so separate).


Terminology

The following "terms" are used in this document:

newline\n and/or \r. Note that end of paragraph is considered to count as newline as well...
whitespaceone or more spaces
optional whitespacezero or more spaces
start of linethe start of a line, followed by optional whitespace
start of paragraphthe start of a paragraph, followed by optional whitespace
end of paragraphthe end of a paragraph.
beginning punctuationany of ('".:
ending punctuationany of .:,;!?'")
some textone or more characters, the first of which may not be whitespace or newline
a URLa valid URL, with spaces "escaped" (this to be more precisely defined later on)

In addition, "enumerator sequence" and "enumerator" are defined in Ordered list items below.


Text representation

An important principle of StructuredText is that text is rendered by the formatter as the user typed it, with the exception of indentation and other formatting controls (as discussed below).

In particular, this means that given the text:

    This is some >>odd<< $text$
    

then an HTML formatter must produce:

    This is some &gt;&gt;odd&lt;&lt; $text$
    

and a LaTeX formatter must produce:

    This is some >>odd<< \$text\$
    

This is an incompatibility with the (current) spec of STNG, which claims that "SGML text is ignored and outputed as is". Given that we wish to support (potentially) multiple output formatters, probably including some form of TeX/LaTeX, I cannot see how this would be feasible. I also think it is wrong - it means that sometimes the text I am writing will be mangled by the processor - for instance, if I use <item> type terms in a BNF diagram. Historically, allowing this sort of thing is known to cause problems - I cite the case of LaTeX users breaking things inadvertently by using raw TeX, or JavaDoc users who insist on inserting <h1> at some random point in the text.

If people really feel the need to insert random HTML, I would argue for a specific way of supporting it - probably via an html: clause (see Suggestions below, when I get round to writing the relevant bit).

By the way, I suspect the main things people would do are:

  • Want to use <br> - well, I think we may need a way of doing that.
  • Want to use &nbsp; - ditto.
  • Want to define tables - push for table support in ST!
  • Want to insert non-ASCII characters - use Python 2.0 and Unicode strings...


Docstring indentation

The Python tutorial says:

The Python parser does not strip indentation from multi-line string literals in Python, so tools that process documentation have to strip indentation if desired. This is done using the following convention. The first non-blank line after the first line of the string determines the amount of indentation for the entire documentation string. (We can't use the first line since it is generally adjacent to the string's opening quotes so its indentation is not apparent in the string literal.) Whitespace "equivalent" to this indentation is then stripped from the start of all lines of the string. Lines that are indented less should not occur, but if they occur all their leading whitespace should be stripped. Equivalence of whitespace should be tested after expansion of tabs (to 8 spaces, normally).

It may be assumed that this process will be performed on a docstring before it is passed to the paragraphing code...


Splitting into paragraphs

The first step of structuring is to split the list of lines into a list of paragraphs.

First, all tabs are converted to spaces (using string.expandtabs). This means that structured text should not rely on tabs for any alignment purposes.

The structured text string is then split into lines, and these lines are used to assemble the paragraphs.

A new paragraph is considered to start at a line which follows one or more blank lines (lines containing nothing or just spaces), or at a line which appears to be the start of a list item (this will be explained later on!). For pedants, the first line of the text also starts a paragraph. Thus given the text:

    This is a line followed by a blank line.           [1]

    This is thus a new paragraph.                      [2]
    1. And so is this, because it starts a list.       [3]
       - And this is a sublist of that list,           [4]
         which continues onto another line.
    

we would expect to end up with four paragraphs. Each paragraph is flagged with the number of the line at which it starts (that is, the first line of a text is line 1).

For each paragraph, its indentation is then calculated - this is done simply by counting the spaces at its start (remember, tabs were converted to spaces earlier on).

It is important to note that the indentation of a paragraph depends only on the indentation of its "first line" (i.e., its start). Any internal indentation is invisible to the processor, since it does not look at the innards of the paragraph.

Given the indentation of the paragraphs, they can be arranged into a hierarchical structure (in much the same way that a Python program is organised by its indentation).

Specifically, each paragraph is assigned a level (the first paragraph arbitrarily gets level 0).

Given a paragraph P at indentation I and level L:
  • If the following paragraph has the same indentation, then it is at the same level, and is a sibling of P.
  • If the following paragraph has more indentation, then it is at the next level (L+1), and is considered a subparagraph of P. The indentation corresponding to the new level is remembered.
  • If the following paragraph has less indentation, then the corresponding level is looked up. If there is no corresponding level, then either an error is given (the current behaviour), or else an appropriate level has to be "guessed". The paragraph is added to the hierarchical structure as a sibling of the previous paragraph with the new level, and the indentations corresponding to all greater levels are forgotten.

Looking at an example:

    The first paragraph is at level 0.                 [1]

    This second paragraph is also at level 0.          [2]
       1. This list item has indentation 3 and is      [3]
          at level 1.

          This paragraph has indentation 6 and is      [4]
          thus at level 2.

       This paragraph is back at level 1. The          [5]
       previous indentation for level 2 goes.

            This paragraph has indentation 8, but      [6]
            is at level 2 again.

          This paragraph has indentation 6, which      [7]
          is a mistake...
    

As we process the paragraphs we find:

paragraph indentation level {level:indent}
1 0 0 {0:0}
2 0 0 {0:0}
3 3 1 {0:0, 1:3}
4 6 2 {0:0, 1:3, 2:6}
5 3 1 {0:0, 1:3}
6 8 2 {0:0, 1:3, 2:8}
7 6 invalid {0:0, 1:3, 2:8}

Paragraph 7 is a problem because its indentation does not match any appropriate preceding indentation (it can't "see" paragraph 4). This mimics the way that Python requires indentation to be consistent.

Thus we now have a tree structure where the level 0 paragraphs are at the children of the "document" as a whole, the level 1 paragraphs are their children, and so on.

On learning that paragraph relationships are determined by relative indentation, some people begin to worry that they will end up having to type most of their text preceded by a large number of spaces. I would say two things to this: firstly, they haven't lived if they haven't had to indent program code by hand in an editor without support for such (paraphrased: "tough luck"), and more importantly, in fact it isn't that bad, for reasons which will become apparent as more of the syntax is explained.


Initial paragraph specialisation

Once the basic paragraph tree has been obtained, there are some specialised "optimisations" which are conveniently done.

Callable objects

For a docstring associated with a "callable" object (that is, a function or method), then the start of the text should provided a signature and a summary.

The signature is intended as a "wordy" description of how to call the function or method, suitable for use as a pop-up in an IDE (this is how, for instance, IDLE uses it).

It is made up of the name of the callable item, followed by a description of its argument list, and if it returns a value, followed by whitespace, followed by "-->", followed by whitespace, followed by a description of its return value.

Examples might be:

    len(object) --> integer
    List.pop([index]) --> item
	  

The summary is intended as a short summary of what the function or method does. It should fit on a single line.

Examples might be:

    Return the "length" of the object
    Remove the "top" item from the stack and return it.
	  

The signature and summary may be specified together on the first line of the docstring, as:

start of paragraph followed by signature followed by whitespace followed by -- followed by whitespace followed by summary followed by end of paragraph

for example:

    len(object) --> integer -- Return the "length" of the object.
		

or the first line of the docstring must be a paragraph consisting of:


signature

and the next paragraph must be a single line containing the summary - for example:

    len(object) --> integer

    Return the length of the object, in some sense.
	  

Note that the signature and summary do not get rendered as <p> and <dl> elements in the DOM tree, as one might expect. Instead, they are treated as <signature> and <summary> elements. The example would thus be treated as:

    <signature>
        len(object) --> integer
    </signature>
    <summary>
        Return the length of the object, in some sense.
    </summary>
      

Literal paragraphs

If a non-literal paragraph ends with the characters "::", then any subparagraphs are literal.

The "::" will be rendered by the formatter as a single ":" (with the possible exception of languages where that is not a sensible convention, in which case a language-specific rendering should be adopted), unless there is no preceding text (i.e., the "::" is the entirety of the paragraph), in which case it will not be rendered.

This last means that:

    ::

        Literal text.
	  

is allowed, and gets rendered as if the "::" paragraph were "invisible".

The indentation of the lines within a literal paragraph is taken to be relative to the indentation of the parent paragraph. Thus if we have:

    This is the introduction, with indentation 4::

        This literal paragraph has absolute indentation 8,
        and so is rendered with indentation 4 (8-4) by the
        formatter.
    

Then we would expect the result (in HTML) to look something like:

    <p>This is the introduction, with indentation 4:

    <pre>
        This literal paragraph has absolute indentation 8,
        and so is rendered with indentation 4 (8-4) by the
        formatter.
    </pre>
    

Note that the content of literal paragraphs is not parsed, and will be passed through to the formatter without modification.

Python code

If a paragraph starts with (optional whitespace followed by) ">>>" then it is considered to be Python code, and flagged as such.

Python code paragraphs are a special case of literal paragraphs. They need not be a sub paragraph, but will still be rendered with indentation relative to the preceding paragraph. This means that if they are a subparagraph, then they will behave as for a "::" delimited paragraph, and if they are not, they will have whatever indentation the preceding paragraph had (typically zero).

See Tim Peter's docstring testing code (available here, but also see Cameron Laird's personal notes on doctest, which contains a mirror, and python.faqts 4276 - why isn't this in the Tools directory for Python 2.0?)

Note that doctest itself states that a blank line terminates the section of Python code, so we need not consider subparagraphs (if any).

Also, since Python code paragraphs are detected after "::" subparagraphs, it is possible to have a literal paragraph that starts with ">>>" and is not Python code - for instance:

    Here is some introductory text::

       >>> And this is NOT Python code
       >>> despite how it starts
    

Header paragraphs

If a non-literal paragraph has subparagraphs, and consists of a single line, with no trailing punctuation (that is, it ends with an alphanumeric character), then it is a header.

Thus, for instance:

    Chapter 1: The garden                              [1]

       In the garden there are plants. These include:  [2]

          - Primulas                                   [3]
          - Roses
    

Paragraph [1] is a header, since it does not end with punctuation. Paragraph [2] is not a header, as although it has a subparagraph, it ends with punctuation. Paragraph [3] is not a header, because it does not have a subparagraph.

This rule appears to work well in STClassic. I have some doubts in the context of docstrings, since (a) one line paragraphs are probably much commoner, and (b) we allow lists to start paragraphs, so we could envisage someone doing:

    - Trees
      * Cherry trees
      * Apple trees
      

Unfortunately, the first line is both a paragraph and a header, by the simple interpretation of the rules. It may be sufficient to modify the rule to "with subparagraphs that do not start with a list item", but life is beginning to get a bit too complicated.

Also, I know some people object to having to render headed sections with indentation (this makes some sense, as structurally none of the HTML-related family, nor the LaTeX family, enclose the headed text "within" the header (that is, for instance, it is:

    <h2>Header</h2>                    \section{Header}

    <p>Text</p>                        Text
      

and not:

    <h2>Header                         \begin{section}{Header}
       <p>Text</p>                        Text
    </h2>                              \end{section}
      

It may be that we should drop this rule, and consider adopting one of the "proposed" header rules below. (add reference)

Any remaining indentation

We have dealt with indented paragraphs in the following cases:

However, since we are building a DOM tree, and attempting to follow XML/XHTML type rules for the construction of that tree (didn't I say that? well, we are), then there is no real place for "nested" paragraphs. In STClassic and STNG, text of the form:

    This is the first paragraph.

        This is the second paragraph.
    

will lead to a datastructure of the (notional) form:

    <p>This is the first paragraph.
        <p>This is the second paragraph.</p>
    </p>
    

I don't see this as being very meaningful, since I can't see an obvious use for it, and it certainly doesn't match my (admittedly limited) understanding of how XML/XHTML trees are typically structured (does DocBook, for instance, admit of something like this?).

However, I think there is an obvious use for this "extra" indentation, and that is to allow the user to mean indentation. I would thus propose that indentation that is not being "used" for other purposes (and I think it is obvious enough when that is!) be used to produce "true" indentation. In this circumstance, the above example would (in HTML) become:

    <p>This is the first paragraph.</p>
    <blockquote>
        <p>This is the second paragraph.</p>
    </blockquote>
    


Lists

There are three forms of list item: descriptive, unordered and ordered. Desscriptive list items are checked for first - the order of checking for the others is not defined.

Descriptive list items

A descriptive list item is represented as:

    title -- text
    

Specifically, a descriptive list item is recognised as:

start of line, followed by some text (which may specifically include the "literal markup" pattern), followed by whitespace, followed by --, followed by whitespace, followed by some text

The text before the -- is the item's "title", and that after the item's "text". Both may include markup (see below).

There is a case for not allowing newline within the title (well, specifically, STNG doesn't). I think this is the correct choice, and aim to change the implementation to follow it.

Note that the case of:

    ' -- ' -- An awkward descriptive list item.
    

is specifically intended to be allowed.

Thus, for instance:

    Fred -- a descriptive item
    

might translate as the HTML:

    <dt> Fred
    <dd> a descriptive item
    

or the LaTeX:

     \item[Fred] a descriptive item
    

Unordered list items

An unordered list item is represented by a "bullet" character (currently one of o, - and *, but it has been pointed out the first can be a problem in Spanish, where "o" is a word, so + may get substituted), followed by whitespace and then some text.

Specifically, an unordered list item is recognised as:

start of line, followed by o or - or * (but see the caveat above), followed by whitespace, followed by some text

The actual bullet character used is remembered as an attribute of the list item, and may be taken as a "strong hint" to the formatter (although see the text about list representation in general below).

Thus, for instance:

    - a list item
    

might translate as the HTML "<li> a list item" or the LaTeX "\item a list item".

Ordered list items

An ordered list item is represented by an "enumerator sequence", followed by whitespace and then some text.

An enumerator sequence is simply one or more enumerators strung together without intervening whitespace, whilst an enumerator is:

One or more digits (0..9), or a letter (A..Z or a..z) or one or more Roman digits (ivxlcdmIVXLCDM), followed by a dot (.)

An ordered list item is then recognised as:

start of line, followed by an enumerator sequence, followed by whitespace, followed by some text

The actual enumerator sequence used is remembered as an attribute of the list item, and may be taken as a "strong hint" to the formatter (although see the text about list representation in general below).

Thus, for instance:

    1.a.iv a list item
    

might translate as the HTML "<li> a list item" or the LaTeX "\item a list item".

Note that STNG does not require a trailing dot on an enumerator sequence. This has the disadvantage, when combined with our "recognise a list item as starting a paragraph" approach, of making the text:

        I am sure that you know that
        I prefer tea.
      

render as two list items... The solution (requiring the trailing dot) seems both obvious and allowable.

List representation

The representation of lists by the formatter is partly straightforward, and partly a matter of compromise.

Adjacent list items of the same type and indentation are grouped together into a list of the appropriate type. Thus:

    Fred -- Some text.
    1. A first item
       i. Some subtext
    2. A second item
    - Something else
    

would be treated as broadly equivalent to the HTML:

    <dl>
      <dt>Fred
      <dd>Some text.
    </dl>
    <ol type="I">
      <li>A first item
      <ol type="r">
       <li>Some subtext
      </ol>
    <li>A second item
    </ol>
    <ul>
      <li>Something else
    </ul>
    

When formatting an ordered list, the formatter should (if possible in the output format) take the type of list from the first list item (thus, if the first list item has a (first) enumerator of type capital letter, then so should the output list in HTML or LaTeX, which both support that distinction).

When formatting an unordered list, the formatter should (again, if possible in the output format) take the type of bullet from the first list item. Here, it is not possible to specify the translation, but it should be consistent for the same bullet character, and should match "visually" as far as possible.

I am not entirely convinced by this "assume a list is continuous when the bullet/enumerator type changes" idea. Perhaps it would be better to split the list - how often does a user want to do, for instance:

    1. First item
    2. Second item
    A. Another first item
      

where they intended two separate lists? If this is required, the implementation described above would not allow a user to attain it...

Note that, in the current implementation, no account is taken of the number of enumerators in an enumerator sequence when "partitioning" list items into lists. Thus the sequence:

    1. A first item
    2.1 A second item
    

(where there is no relative indentation between the two list items) would be treated as broadly equivalent to the HTML:

    <ol>
      <li> A first item
      <li> A second item
    </ol>
    

I'm not sure that I would ever intend to make the parser "complicate" lists by following the "depth" of the enumerator sequence.

Of course, a formatter can look at the attribution of the list item if it wishes to do something more sophisticated, but this is neither required nor expected.


Colourisation / markup

Order of colourisation

Literal markup is done before any other form of markup. The order of the other colourisations is not defined, and may be changed without any warning.

Python literals

Python literals (#...#) are colourised first.

Specifically, a Python literal is recognised as:

start of paragraph, or whitespace, followed by #, followed by any text not including # or newline, followed by # and either whitespace, ending punctuation or newline.

This means that #'He said, "Hello"'# is a valid Python literal. Text within Python literals is not subsequently colourised.

Typical formatter results are HTML <code>...</code>, and LaTeX \verb#...#.

We could actually relax the before/after requirements, since about the only place a # occurs in "natural" text is as a "number sign" - for instance, as #22. Thus the "ending" of a Python literal could be # followed by, well, anything that isn't a digit. Certainly we could allow any non-alphanumeric as a pre- and post-cursor.

Literals

Literals ('...') are colourised second.

Specifically, a literal is recognised as:

start of paragraph, or whitespace, followed by ', followed by any text not including ' or newline, followed by ' and either whitespace, ending punctuation (but not including ') or newline.

This means that 'Emphasis is *text*' is a valid literal. Text within literals is not subsequently colourised.

Typical formatter results are HTML <code>...</code>, and LaTeX \verb'...'.

STNG allows various punctuation characters to come immediately after the closing quote of a literal, but not " or ). Disallowing the former means that one can type (for instance) "He said 'she said'" (although not "Did he say 'she said'?"). It seems to me that this is possibly a case for disallowing punctuation of any sort immediately after a closing literal quote...

It is necessary to disallow alphabetics after a closing literal quote so that one can do can't and 1950's without mistakenly getting literals.

Emphasised text

Emphasised text is recognised as:

start of paragraph, or whitespace, followed by *, followed by any text not including a terminating *, followed by * and either whitespace, ending punctuation or end of paragraph.

Typical formatter results are HTML <em>...</em>, and LaTeX {\em ...}.

There is a case for not allowing whitespace after the opening "*" or before the closing "*" (well, specifically, STNG doesn't). I think this is the correct choice, and aim to change the implementation to follow it.

Strong text

Strong text is recognised as:

start of paragraph, or whitespace, followed by **, followed by any text not including a terminating **, followed by ** and either whitespace, ending punctuation or end of paragraph.

Typical formatter results are HTML <strong>...</strong>, and LaTeX {\bf ...}.

There is a case for not allowing whitespace after the opening "**" or before the closing "**" (well, specifically, STNG doesn't). I think this is the correct choice, and aim to change the implementation to follow it.

Hyperlinks - relative or absolute

Relative or absolute hyperlinks may be represented as:

start of paragraph, or whitespace, followed by """, followed by some text not including """, followed by "":", followed by a URL, followed by either ending punctuation followed by whitespace or by whitespace or end of paragraph

For example:

    One line leads to "Zope":http://www.zope.org/ and another
    leads to "the parent directory":http://..
    

which would be formatted (in HTML) as:

    One line leads to <a href="http://www.zope.org/">Zope</a> and another
    leads to <a href="http://..">the parent directory</a>
    

Note that the complex ending rules (have I got them right?) are to allow URLs of the form:

    "Zope":http://www.zope.org/?answer=fred and
    "the parent directory":http://..
    

Hyperlinks - absolute

Absolute hyperlinks may also be represented as:

start of paragraph, or whitespace, followed by """, followed by some text not including """, followed by "",", followed by whitespace, followed by a URL which is "absolute", followed by either ending punctuation followed by whitespace or by whitespace or end of paragraph

For example:

    "mail me", mailto:tibs@tibsnjoan.co.uk
    

which would be formatted (in HTML) as:

    <a href="mailto:tibs@tibsnjoan.co.uk">mail me</a>
    

Nesting of markup

In theory, emphasised and strong text may include other markup, so that one could do something like "*Emphasis containing a 'literal' and some **Strong***". In practice, this is not available in the initial implementation. It is intended that it should be available in a future version of the code.


The resultant DOM tree

To be completed

I want, ultimately, to put some explanation of the DOM tree in here, so that anyone trying to process it has some hope of predicting what they've got to do. For the moment, referring to STpy DTD below may be the best one can do...


Proposed extensions - headings

STClassic provides a means of indicating a heading, but it is perhaps error prone in an STpy context (see insert reference above), and it requires that the text "within" the header section be indented, which some people find awkward.

Also see Tagged paragraphs below.

Named headings

Coming from a TeX/LaTeX background originally, I was used to "named" sections (as opposed to the clunky numbered headings of HTML). For those who don't know, LaTex provides the following pre-defined headings:

In "human readable text" these have some advantages over the HTML approach.

The proposal (which is obviously related to some of the others! - insert appropriate references) is to allow a paragraph of the form:

start of paragraph followed by one of the keywords "Chapter", "Section", "Subsection", "Subsubsection", "Clause", "Subclause" or "Appendix", optionally followed by whitespace and an enumerator sequence, followed by a ":", and optionally followed by whitespace and some text, followed by end of paragraph

Whether there is a required ordering of these, which of the others "Appendix" is a "synonym" for in that hierarchy, and whether we need "Subsubclause" and "Part", are left for later discussion.

If this mechanism were to be adopted, then we would need some means of specifying how to say these "words" in another language ("Chapter" would look silly in German, I imagine). I don't see that as a particular problem.

Examples would be:

    Chapter 1:
    Section 2: The garden
    Clause 5.4.3: Type II pedants.
    Appendix A: Surprise!
    

Note that, unlike STClassic header paragraphs, this does not require the "headed" text to be subparagraphs (i.e., it need not be indented), and also that trailing punctuation is allowed. It is my belief that texts matching this pattern are pretty unlikely to occur in normal text in any other meaning.

Underlined headings

Another approach to avoiding the indentation of paragraphs after a heading (also based on observed practice) is to allow the underlining of headers.

This is a specified a little differently:

If a paragraph has two lines, the first line is optional whitespace followed by some text, and the second line is composed of the same amount of optional whitespace, followed by one "underline" character repeated sufficient times to underline all of the some text, then this is a header paragraph, and the heading is the some text.

The obvious underline characters are "-", "=" and "~" (well, they're the ones I tend to use!).

Thus for example:

    This is a heading
    =================

        In the garden!
        ~~~~~~~~~~~~~~
    

It may be useful to assign some significance to which underline character is used - of the ones given, I would assume that "=" was something like a section, "-" a subsection, and "~" a subsubsection.

Personally, whilst I can see that this idea makes sense, and more than one person (heh, me included) has suggested it in the past, I'm not sure I really like it. But it warrants examination.


Other proposed extensions

Out-of-line URLs

A proposed representation for out-of-line URLs is:

A "reference", specified as:

start of line or whitespace followed by """, followed by some text, not including """, followed by "":[", followed by an alphanumeric, followed by one or more alphanumeric characters, "-" or "_", followed by "]", followed by ending punctuation or whitespace or end of paragraph

For instance:

    This text refers to "Fred":[1] and also to "Jim":[jim].
    

A "link", specified as:

start of line followed by "..[" followed by an alphanumeric, followed by one or more alphanumeric characters, "-" or "_", followed by "]", followed by whitespace, followed by a URL, followed by end of line

For instance:

    ..[1]   http://www.fred.com/
    ..[jim] ftp://ftp.jim.org/
    

Note that allowing these not to be separated by blank lines would mean a slight adjustment to the paragraphing algorithm (well, not necessarily, but it might be friendly).

A switch would be required for stpy.py to indicate whether the formatted representation "fold in" the URLs - so that the example would format (in HTML) as:

    This text refers to <a href="http://www.fred.com/">Fred</a> and
    also to <a href="ftp://ftp.jim.org/">Jim</a>.
    

or to leave them as "references", giving something more like:

    This text refers to <a href="#1">Fred</a> and also to
    <a href="#jim">Jim</a>.

    <p><a name="1"><a href="http://www.fred.com/">[1]
       http://www.fred.com/</a></a>
    <p><a name="jim"><a href="ftp://www.jim.org/">[jim]
       ftp://www.jim.org/</a></a>
    

I believe this to be (more or less) the same as the mechanism currently implemented in StructuredTextClassic, and the same as the proposal at OutOfLineURL.

(NB: also see DefaultURLInterpretation for another relevant suggestion, although one of less interest to us, I think.)

Python code references

The docstring processor needs to be able to create "references" (links) from the names of Python modules, classes, functions/methods, values and so on to the relevant Python entity.

Two forms of this are wanted:

  1. "guessing"
  2. "explicit"

The "guessing" technique takes the highlighted entity, and attempts to deduce what it is (a class name, method name, function name, etc.) by reference to the module including the docstring, using the normal Python rules for resolving a name. It then creates a link to the answer (or, perhaps, to the most likely answer).

The "explicit" technique needs to allow the docstring author to indicate what type of Python entity is being linked to - this can become necessary, for instance, if one has a class and method of the same name - from within the class's docstring, it is not clear which is required, and it may be undesirable to say #self.name#.

Note that a Python programmer will be willing to lose a little "naturalness" in the text for the sake of achieving this linkage.

The main proposals for delimiting the entity name were:

Proposed means of indicating what type of thing is to be referenced still to be annotated, but I can't help thinking that something along the lines of  "#name#:Class(in pseudo-parallel with URL references) might work...

Tagged paragraphs

An earlier attempt at resolving the docstring problem made extensive use of markup by "tagged paragraphs", allowing one to say, for instance:

    Author: Guido van Rossum
    

or

    History:
        0.1: June 1920
	0.2: July 1919
    

or even

    Arguments:
	name -- The person's full name
	age  -- Their age
    

The ":" at the end of the "tag" was not finalised - in context of the time "::" was also considered (I think a single ":" would do, actually, and we can't use "::" without causing some confusion!).

The idea also made heavy use of the ability not to require "spurious" blank lines between content lines, and it was assumed that some of the parsing of the content would be dependent upon the actual tag.

A list of tags would need to be made - proposals made included:

Argument[s], Returns, Exception[s], Author, History, Version, and maybe Code (this may be redundant) and Links (intended to introduce the links section of an OutOfLineURL, but probably redundant).

Note that in the DOM tree (and any resultant XML output), these would map directly to element nodes - for example:

    Arguments:
	name -- The person's full name
	age  -- Their age
    

would map to:

    <Arguments>
      <dl>
        <dt>name</dt>
        <dd>The person's full name</dd>
        <dt>age</dt>
        <dd>Their age</dd>
      </dl>
    </Arguments>
    

My gut feeling is that this is a Really Good Idea, but that it should wait for the second phase of implementation.

Reference: see David Ascher's email of 1999-11-29 in the DocSIG, [Doc-SIG] docstring grammar

Escaping characters

See the Zwiki discussion page http://www.zope.org/Members/jim/StructuredTextWiki/NGEscaping for more information on this.

In normal use of ST[NG], there is almost no need to be able to "escape" a character.

Given this, and given the requirement for StructuredText to read naturally of itself, Jim Fulton is reluctant to introduce general "quoting" of special characters, unless it can be justified very carefully.

When is it actually needed? In traditional ST usage, I suspect only in the context of 'literal text', when one might want to insert a ', and this is both relatively unlikely, and probably possible to write "around".

In a docstring environment, however, it is a certainty that one day someone will want to do the equivalent of

    'And the single quote is "'"'

(hmm - actually, that's legal. Oh well.)

If we have HASH quoting for Python code fragments, the same problem would arise with:

    #print "#",#

(which I'm too lazy to parse for legality).

A suggestion was made of using !, since it is used to escape Zwiki names, but this doesn't look very good, and Jim Fulton doesn't like it. He has come up with some suggestions for alternative (balanced) quoting schemes that I don't entirely follow (see NGEscaping again). I don't believe it addressed the Python-specific problems, anyway.

I have in the past suggested the "~" character for such escaping (doubling it to get the actual character), but that was more in a sense of desparation than anything else, and I'd prefer to reserve tilde for non-breaking space. We need to consider what characters are rarely, if ever, used in Python, and how they look (for instance, "@" would be a bad choice visually).

It would be Very Nice (although I suppose not essential) if we also solved the problem for "mainstream" STNG. Regardless, this is another discussion I would encourage people to participate in on the Zwiki side.

Non-breaking spaces

That is, what would be represented as "~" in TeX, or "&nbsp;" in HTML.

One can't use a literal string, because of the rules of how they start and end (so "Number' '17" doesn't work).

Coming from a TeX background, I personally regard these as essential. If we don't end up using "~" as an escape character, I would propose it as a reasonable compromise, with precedent in the TeX world.

Enforced line break

The equivalent of "\newline" and variants in LaTeX, or "<br>" in HTML.

A new paragraph is not exactly the same, sometimes one wants a forced line break. I have no suggestions for syntax, and need to come up with an example of why I think it is useful, so there isn't a very good case for this one as yet...

(Hmm - weakening my own case, I see I doubted this was necessary in STNG-format.html!)

Tables

A mechanism for representing tables would be good. Of course, we're not alone in thinking this, and indeed a mechanism was provided in StructuredTextClassic, but it does not read naturally in the "bare", and Jim Fulton has dropped it from STNG (it was apparently snuck into STClassic when he wasn't looking).

(The Wiki page for discussing the proposal to reintroduce that representation, or indeed an alternative, is StructuredTextTables. If you're interested, please participate in the discussion there - I think it is very important that the same solution be used for STNG as for STpy.)


STpy DTD

To be completed

Fred Drake proposed a "basic" DTD for StructuredTextNG of the form:

<!-- This DTD should provide enough to support everything available in
     StructuredTextNG.
  -->

<!ELEMENT document (section|title|p|ol|ul|pre)*>

<!ELEMENT section (title?, (section|p|ol|ul|pre)*)>
<!ELEMENT title (#PCDATA|a|code|em|strong|u)*>

<!ELEMENT ol (li+)>
<!ELEMENT ul (li+)>

<!ELEMENT li (p|ol|ul|pre|section)+>

<!ELEMENT p (#PCDATA|a|code|em|strong|u)*>

<!ELEMENT pre (#PCDATA)>
<!ATTLIST pre xml:space (default|preserve) #FIXED "preserve">

<!ELEMENT a (#PCDATA|code|em|strong|u)*>
<!ELEMENT code (#PCDATA|a|em|strong|u)*>
<!ELEMENT em (#PCDATA|a|code|strong|u)*>
<!ELEMENT strong (#PCDATA|a|code|em|u)*>
<!ELEMENT u (#PCDATA|a|code|em|strong)*>

<!ATTLIST a href CDATA #REQUIRED>
      

If I'm understanding how DTDs are specified, I think we need:

<!-- A document is a package, module or docstring.
     A package is one or more modules.
     A module is one or more docstrings.
     * means zero or more
     + means one or more -->

<!ELEMENT document (package|module|docstring)>
<!ELEMENT package (module+)>
<!ELEMENT module (docstring+)>

<!ELEMENT docstring (signature?, summary?, (section|p|pre|dl|ol|ul)*)>

<!ELEMENT signature (#PCDATA)>
<!ELEMENT summary  (#PCDATA|a|code|em|strong)*>

<!ELEMENT section (title?, (section|p|pre|dl|ol|ul)*)>
<!ELEMENT title (#PCDATA|a|code|em|strong)*>

<!ELEMENT dl (dt,dd)+>
<!ELEMENT ol (li+)>
<!ELEMENT ul (li+)>

<!ELEMENT dt (#PCDATA|a|code|em|strong)*>
<!ELEMENT dd (#PCDATA, (section|p|pre|dl|ol|ul)*)>

<!ELEMENT li (#PCDATA, (section|p|pre|dl|ol|ul)*)>
<!ATTLIST li bullet (o|*|-) #IMPLIED>
<!ATTLIST li sequence CDATA #IMPLIED>

<!ELEMENT p (#PCDATA|a|code|em|strong)*>

<!ELEMENT pre (#PCDATA)>
<!ATTLIST pre xml:space (default|preserve) #FIXED "preserve">
<!ATTLIST pre python (yes|no) "no">

<!ELEMENT code (#PCDATA)*>
<!ATTLIST code xml:space (default|preserve) #FIXED "preserve">
<!ATTLIST code python (yes|no) "no">

<!ELEMENT a (#PCDATA|code|em|strong)*>
<!ATTLIST a href CDATA #REQUIRED>
<!ATTLIST a id NMTOKEN #REQUIRED>

<!ELEMENT em (#PCDATA|a|code|strong)*>
<!ELEMENT strong (#PCDATA|a|code|em)*>
      


Author: Tibs (tibs@tibsnjoan.co.uk)

Last modified: Wed Dec 20 12:13:58 GMT Standard Time 2000