This document is being written as a contribution to the work for PEP-2016 (Docstring Format).
It is an attempt to describe the format of StructuredTextNG more precisely than at StructuredTextNGRules.
Note that it is not an official statement, but is derived from trying to deduce what is meant, and also from looking at the source code to see what is (or might be) done.
NB: see NGDocumentation for how one uses the StructureTextNG system to process text into structured text.
Text highlighted:
like this |
is quoted from StructuredTextNGRules.
\t
) need not be included, as tabs have already
been changed to spaces earlier in processing.
\f
) or vertical-tab
(\v
) within paragraphs - I occasionally use form-feed
for navigation within a document (since I use XEmacs this can be
useful - see this HTML text, for instance), but I don't see it as
having meaning in the actual text, so I think it should be
filtered out (or converted to a "page break" token) during
"paragraphisation".
\r
)? If we do, subsume it into the pattern for
NEWLINE (i.e., only allow it in places that would makse sense in
combination with \n
) - I've tentatively done that.
My "todo list" (or "things not to forget to think about"):
--
text
A quick perusal of the code suggests that things denoted by regular expressions are found as many times as they occur, so unless there is a "start of string" clause in the RE (where the "string" is likely to be the individual paragraph, but not, I think, necessarily so), then multiple occurrences of such might be allowed. This is interesting to us because we would quite like to be able to write:
arg1 -- The first argument arg2 -- The second argument arg3 -- The third argumentand have it work as one might expect. Since this block of text is only a single paragraph, and one doesn't really want to muck with the separation of determining paragraphs and "colourinsing" text, this is an important question.
!
" is a Zwiki escape, used to escape words that are
not Zwiki links (but look like them) (or something like
that).
A structured string consists of a sequence of paragraphs separated by one or more blank lines. Each paragraph has a level which is defined as the minimum indentation of the paragraph. A paragraph is a sub-paragraph of another paragraph if the other paragraph is the last preceding paragraph that has a lower level. |
The basic input to StructureTextNG is a string representing the text to be processed. The output is a structure representing a hierarchy of paragraphs (specifically, a list of StructuredTextParagraph objects for the "top level" paragraphs, each of which may have a list of sub-paragraphs, each of which...).
The effect is as if the following Python were obeyed (actually, this essentially is the Python that is obeyed, but rendered more verbosely):
import re,string # Remove any confusing tabbing - follow Python rules! text = string.expandtabs("\n\n"+text+"\n\n") # Split into paragraphs blankline = re.compile("\n\s*\n") paragraphs = blankline.split(text)
Any "empty" paragraphs (i.e., those which consist solely of whitespace) are discarded.
We now have a list of "paragraphs" - note that these are not lists of lines, but still just "chunks of text", with embedded NEWLINE characters.
It is important to note that the indentation of a paragraph depends only on the indentation of its "first line" (i.e., its start). Any internal indentaion is invisible to the processor, since it does not look at the innards of the paragraph.
Thus:
This is a single line
and:
* This is a list element with continuation that is indented the way I would do it
and
* This is a list element with continuation that is indented rather less
are all considered to have the same paragraph indentaion.
Given this structure, one can then form a list of level 0 paragraphs, with the level 1 paragraphs as "children" of the level 0 paragraphs, and so on.
This is the basic structure used by StructuredTextNG.
Notes:
(Alternatively, it may be that colourisation of paragraphs works in a way which makes this issue moot - that would be a good thing if so.)
In the following, I use these terms with specific meaning:
ASTERISK - The character " *
"COLON - The character " :
"COMMA - The character " ,
"DOT - The character " .
"DOUBLEQUOTE - The character " "
"EXCLAMATION - The character " !
"HYPHEN - The character " -
"NEWLINE - The character " \n
".
Possibly also the sequence carriage-return, newline ("\r\n
")SLASH - The character " /
"QUOTE - The character " '
"SEMICOLON - The character " ;
"SMALL-O - The character " o
"SPACE - The character " "
TILDE - The character " ~
"QUERY - The character " ?
"AT - The character " @
"L-BRACKET - The character " [
"R-BRACKET - The character " ]
"whitespace - any of SPACE, NEWLINE letter - any of a-z
orA-Z
digit - any of 0-9
roman-digit - any of ivxlcdmIVXLCDM
alphanumeric - any letter or digit enumeration - either a single letter or one or more digits or one or more roman-digits punctuation - any of ,.;:!?
URL - A URL (e.g., http://www.zope.org/
) - the exact specification of this is left unspecified for the moment...not( FRED
)- Any characters except that (or those) specified - for instance, not( QUOTE
) would mean any character except QUOTE.begpara - This is intended to match the beginning of a paragraph (which is, of course, just the start of the paragraph's "string"). endpara - This is intended to match the end of a paragraph (which is, of course, just the end of the paragraph's "string"). begline - This is intended to match either begpara or NEWLINE
Items within "[" and "]" are optional. Items on either side of "|" are alternatives (and "|" only applies to the items immediately adjacent to it).
Once one has the text organised as hierarchical paragraphs, one can start to process the result to obtain "proper" structured text. This is done by the code in DocumentClass.py, and is termed "colourising" (well, OK, I guess it's termed "colorizing", but I tend to spell that wrong).
StructuredTextNGRules presents the rules for colorizing StructuredText in an informal manner. The following is my interpretation of a stricter version of these rules, presented slightly more formally. Note that it does not necessarily reflect anyone else's actual intent.
Note that it is not clear to me whether/how colorized regions can be embedded within each other. My assumptions (or, putting it another way, what I would like) is that it works so that:
- Anything within example paragraphs is not colorized.
- Text within QUOTE and QUOTE is not colorized.
- Otherwise, things may be nested arbitrarily, and in particular:
*title* -- text
and
**This is *nested* emphasis**
and
*This is **nested** emphasis*
are all legal. I am not sure if it is possible to take this all the way and allow:
**This is *nested emphasis***
(indeed, I'm not sure if the example before that is possible with REs), although I consider it desirable...
A single-line paragraph with subparagraphs is treated as a header. |
This is perfectly clear.
A paragraph that begins with a - , * , or
o is treated as an unordered list (bullet) element.
|
Any of the sequences:
is treated as the start of an unordered (bullet) list element. Which type of "bullet" was used will be remembered.
Note: I would like to replace begpara with begline.
A paragraph that begins with a sequence of digits followed by a
white-space character is treated as an ordered list element.
A paragraph that begins with a sequence of sequences, where each sequence is a sequence of digits or a sequence of letters followed by a period, is treated as an ordered list element. |
Any of the sequences:
is treated as the start of an ordered list element. The enumerations for the list element will be remembered.
Note: I would like to replace begpara begline.
A paragraph with a first line that contains some text, followed by
some white-space and -- is treated as
a descriptive list element. The leading text is treated as the
element title.
|
The sequence:
NEWLINE
) whitespace
HYPHEN HYPHEN whitespace
is treated as the start of a descriptive list element. The
not(NEWLINE
) text is treated as the element
title, and may not be of zero length.
Note: I would like to replace begpara with begline.
Sub-paragraphs of a paragraph that ends in the word
example or the word examples , or
:: is treated as example code and is output as in::
For instance :: <table border=0> <tr> <td> Foo </table> |
If a paragraph with subparagraphs ends with COLON COLON, those subparagraphs will be treated as "example" text, and will not be colourised.
(Note that the words example
or examples
are not, in fact, treated specially at the end of a paragraph in
STNG.)
Note that the COLON COLON will be kept as such, so will appear in the output, unless the formatting tool treats them specially.
For example:
Code:: We can have *fred* here if we like... |
might appear as:
Code:: We can have *fred* here if we like... |
Text enclosed single quotes (with white-space to the left of the
first quote and whitespace or puctuation to the right of the second
quote) is treated as example code::
For example: '<dtml-var foo>'. |
In-line example text is represented as:
QUOTE,NEWLINE
) QUOTE
whitespace | punctuation | endpara
The text within QUOTEs will not be colorized.
Text surrounded by * characters (with white-space to
the left of the first * and whitespace or puctuation
to the right of the second * ) is emphasized.
|
Emphasized text is represented as:
ASTERISK
)
ASTERISK
whitespace | punctuation | endpara
The text within ASTERISKs is emphasized.
Text surrounded by ** characters (with white-space to
the left of the first ** and whitespace or puctuation
to the right of the second ** ) is made
strong.
|
Strong text is represented as:
double ASTERISK
)
ASTERISK ASTERISK whitespace |
punctuation | endpara
The text within double ASTERISKs is strong.
Note that the sequence
**Some *bold* text**
is allowed, but the sequence
**Some *bold text***
may or may not be allowed (I'd like
it if it were).
Text encloded by double quotes followed by a colon, a URL, and
concluded by punctuation plus white space, or just white
space, is treated as a hyper link.
For example, "Zope":http://www.zope.org/ is interpreted as
Zope
Note: This works for relative as well as absolute URLs. |
Hyperlinks may be represented in several ways. The first is:
DOUBLEQUOTE
)
DOUBLEQUOTE COLON URL
[ punctuation ]
whitespace | endpara
The text
"Zope":http://www.zope.org/
will be represented as Zope.
This mechanism is designed to work for both relative and absolute URLs.
Text enclosed by double quotes followed by a comma, one or more
spaces, an absolute URL and concluded by punctuation plus white
space, or just white space, is treated as a hyper link.
For example: "mail me", mailto:amos@digicool.com is
interpreted as mail me
|
An alternative representation for URLs is:
DOUBLEQUOTE
)
DOUBLEQUOTE COMMA
whitespace URL
[ punctuation ] whitespace | endpara
The text
"Zope", http://www.zope.org/
will be represented as Zope.
This mechanism is only designed to work for absolute URLs.
SGML text is ignored and outputed as is. |
This is implicit in that "SGML" text is not mentioned in the other paragraphs.
Inspection of the source code for (a recent version of) STNG, dated about 2000-11-09, appears to show the following implementation details. These are, of course, not necessarily final, but since I assume each implementation "oddity" has a sensible reason for having come into being, they are worth studying - they may be what STNG finally ends up providing...
The actual regular expression used is '\s*[-*o]\s+'
The actual regular expression used is
'\s*'
'(([a-zA-Z]|[0-9]+|[ivxlcdmIVXLCDM]+)\.)*'
'([a-zA-Z]|[0-9]+|[ivxlcdmIVXLCDM]+)\.?'
'\s+'
The STNG descriptive text is a bit ambiguous as it stands, as the second clause talks about a sequence of letters, but the first one doesn't.
Observation in Classic StructuredText (i.e., via the Zwiki interface) shows that it accepts alphabetic sequencing, but always emits numeric sequencing. I haven't tried to figure out if STNG does this better (I think it does), but regardless that is an implementation issue.
Also note that it is not stated that StructureTextNG actually has any concept of a list which contains the paragraphs so delimited. Given this absence, it is not unreasonable that it refrains from enforcing any order on the sequences of adjacent paragraphs at the same level. (In other words, do what one is told, without trying to be too clever!)
The actual regular expression used is '\s+--\s+'
, i.e.,
There is some checking done on the element title - specifically:
r'[^\0- ]'
- I read that as null
bytes (surely not?), hyphens and spaces. The first seems pointless,
the second over-restrictive - have I misinterpreted it? The third
seems a bit odd to me, but I guess might be deliberate, although I
would argue against it, especially in a Python context.
All paragraphs are checked to see if they end with
::
(no trailing spaces allowed), and if they do, then any
sub-paragraphs are examples, as indicated.
I can't find anywhere in the STNG code where the word
example
or examples
is treated specially - I
assume that this has "gone away" and is a mistake in the documentation.
Even though I've argued elsewhere in this document that we should
allow things that traditionally start at begpara to start
instead at begline, I don't think I would make a similar
argument for ::
and end-of-paragraph/end-of-line. But
maybe I'm wrong...
The actual regular expression used is:
"(?:\s|^)'"
"([^ \t\n\r\f\v']|[^ \t\n\r\f\v'][^\n']*[^ \t\n\r\f\v'])"
"'(?:\s|[,.;:!?]|$)"
Why start with 0 or more whitespace, and not 1 or more (as advertised)? - presumably because it actually wants to be beginning of line and/or whitespace...
It is not clear to me why the text inside the example string is not
just '[^'\n\r\f\v]'
. I definitely think spaces should be
allowed (and converted to "non-breaking" spaces in the final output).
The actual regular expression used is
'\s*\*([ \na-zA-Z0-9.:/;,\'\"\?]+)\*(?!\*|-)'
Same comment about starting whitespace, and I don't see why the
enclosed text is so restricted - why not just
'[^*\n\r\f\v]'
.
Why may it not be followed by -
? And what about ending
at "end of string"?
The actual regular expression used is '\s*\*\*([ \na-zA-Z0-9.:/;\-,!\?\'\"]+)\*\*'
Same comments as above, although I can't be bothered to write an RE that says "no vertical whitespace, and one asterisk is OK but two aren't".
Is it intentional that no requirement is made for what follows the
terminating **
? Surely not...
Also, can an RE allow one to allow "**This is *more***"?
The actual regular expression used is
'\"([ a-zA-Z0-9.:/;,\n\~]+)\"'
':'
'([a-zA-Z0-9.:/;,\n\~]+)(?=(\s+|\.|\!|\?))'
Again, I'm not clear on the RE used for the content of the
string. I believe it should be '[^"\r\f\v]'
(specifically, allowing NEWLINE).
I'll take the URL RE on trust, although not much trust (but that's because RE's are my tool-of-last-resort!).
Why is only that subset of punctuation chosen to end it? And what about "end of string"?
The actual regular expression used is
'\"([ a-zA-Z0-9./:]+)\"'
',\s+'
'([ a-zA-Z0-9@.:/;]+)(?=(\s+|\.|\!|\?))'
Same comments as above.
This is obviously meant to allow the user to add extra "formatting" at the expense of no longer using "pure" StructuredText - this is currently the only way of getting tables in, for instance.
In fact, what this means is that there is no special handling of "SGML" text at all, so it should get treated as normal text and passed through transparently. Personally, I'd have left this statement off, or added it as a "by the way...".
<br>
).
I assume that StructuredText users have avoided this problem by just inserting a new paragraph. That seems like the sensible option to me as well.
Perhaps the most obvious way this affects use is in enumerated lists. If a user has a list of the form:
1. First item B. Second item Non-list paragraph. 3. Third item?
then STNG has nothing to say about the meaning of "1", "B" and "3", nor does it care if their order makes sense. It is up to the final processor (or possibly to some "lintian" tool) to peruse the paragraph hierarchy and decide if it makes sense, and how to intrerpret it (one might imagine, for instance, looking at adjacent list paragraphs at the same level to see if they have ascending sequence numbers of the same type, and if so aggregating them together into a list).
Similarly, one should be careful about how much meaning one applies to the parent/child relationship between paragraphs - in:
1. The first paragraph of a list item. The second, neatly aligned.
the obvious alignment of the second paragraph leads to it being a child of the first (which is a Good Thing, and obviously deliberate), but it doesn't really mean that the content of the first paragraph is any more special.
Or is this all very obvious to people used to handling hierarchical document structures? (I'm used to thinking in terms of the meaning of textual elements and their relationships being encoded, which was why I initially wanted to use a TeX or similar approach for docstrings.)
DOUBLEQUOTE
)
DOUBLEQUOTE L-BRACKET
not(R-BRACKET
)
R-BRACKET whitespace | punctuation
CLOSE-BRACKET
)
CLOSE-BRACKET whitespace URL
The first item occurs in the body of the text, and the second item occurs on a line by itself, and provides the URL.
Thus "Zope"[zp]
would be represented as
Zope if there was a later line
of the form:
.. [zp] http://www.zope.org/
I believe this to be (more or less) the same as the mechanism currently implemented in StructuredTextClassic, and the same as the proposal at OutOfLineURL.
(NB: also see DefaultURLInterpretation for another relevant suggestion, although one of less interest to us, I think.)
'text'
mechanism for escaping literal text is
considered by some to be unnatural for Python code because of Python's
tendency to use '
as a string delimiter.
(The same argument doesn't apply to "
because it has no
meaning for a "bare" string, only for strings used as URL
descriptors.)
Can we do without this? (No, say I and some others.)
The main proposal that resulted was to enclose Python code fragments
within #
(HASH) delimiters, as follows:
HASH,NEWLINE
)
ASTERISK whitespace |
punctuation | endpara
(Note that Python comments within example code would still be safe, since example code is not colourised.)
Some discussion ensued about how that related to the next item, Python code references.
Two forms of this are wanted:
The "guessing" technique takes the highlighted entity, and attempts to deduce what it is (a class name, method name, function name, etc.) by reference to the module including the docstring, using the normal Python rules for resolving a name. It then creates a link to the answer (or, perhaps, to the most likely answer).
The "explicit" technique needs to allow the docstring author to indicate what type of Python entity is being linked to - this can become necessary, for instance, if one has a class and method of the same name - from within the class's docstring, it is not clear which is required, and it may be undesirable to say #self.name#.
Note that a Python programmer will be willing to lose a little "naturalness" in the text for the sake of achieving this linkage.
The main proposals for delimiting the entity name were:
#code fragment#
notation, and try to find
references if that contains a single name.
#code fragment#
notation, and try to find
references for any Python entities therein.
^name
(i.e., immediately precede it by a CARET). This
would not be allowed within #
delimiters (or if it
were, one would need a way of escaping ^
if it were
needed in it Python meaning - see the discussion on escaping
characters!).
Proposed means of indicating what type of thing is to be
referenced still to be annotated, but I can't help thinking that
something along the lines of
#name#:Class
(in pseudo-parallel with URL references) might work...
doctest
utility will run Python code
examples embedded in docstrings (for testing purposes). To do this, it
needs to identify said code examples. It has been proposed
that the docstring utility should have the same functionality, and
treat such code examples as examples (without the need for the
preceding paragraph to end in ::
.
As an example, consider the following:
"""A simple docstring for a simple function. Arguments arg1 -- an integer arg2 -- another integer, to be added to the first In normal usage, one just adds numbers together: >>> add(1,3) 4 >>> add(-1,-2) -3 Unfortunately, we don't cope with real numbers: >>> add(1.0,2.0) ERROR: can't add floating point numbers """
I'm not sure if this is in Python 2.0 as standard, but see (for instance) FAQTS - Docstring-driven testing for details.
Note: I regard this as a very important thing to include in docstring support - to my mind it is more important than the issue of references into the code (above).
It is worth noting that by Tim's rules about what can be considered as test text (i.e., that any blank lines must be absent) this already counts as a paragraph in the STNG sense, so should be very easy to detect...
In normal use of ST[NG], there is almost no need to be able to "escape" a character. Given this, and given the requirement for StructuredText to read naturally of itself, Jim Fulton is reluctant to introduce general "quoting" of special characters, unless it can be justified very carefully.
When is it actually needed? In traditional ST usage, I suspect only
in the context of 'literal text'
, when one might want to
insert a '
, and this is both relatively unlikely, and
probably possible to write "around".
In a docstring environment, however, it is a certainty that one day someone will want to do the equivalent of
'And the single quote is "'"'
If we have HASH quoting for Python code fragments, the same problem would arise with:
#print "#",#
A suggestion was made of using !
, since it is used to
escape Zwiki names, but this doesn't look very good, and Jim Fulton
doesn't like it. He has come up with some suggestions for alternative
(balanced) quoting schemes that I don't entirely follow, but the text
is on the Zwiki page I can't find. I don't believe it addressed the
Python-specific problems, anyway.
I have in the past suggested the TILDE character for such escaping (doubling it to get the actual character), but that was more in a sense of desparation than anything else. We need to consider what characters are rarely, if ever, used in Python, and how they look (for instance, AT would be a bad choice visually).
It would be nice (although not essential) if we also solved the problem for "mainstream" STNG...
Author: Guido van Rossum
or
History:
0.1: June 1920
0.2: July 1919
or even
Arguments:
name -- The person's full name
age -- Their age
The :
at the end of the "tag" was not finalised - in
context of the time ::
was also considered (I think a
single COLON would do, actually, and we can't use ::
without causing some confusion!).
The idea also made heavy use of the ability not to require "spurious" blank lines between content lines, and it was assumed that some of the parsing of the content would be dependent upon the actual tag.
A list of tags would need to be made - proposals made included:
Argument[s], Returns, Exception[s], Author, History, Version, and
maybe Code (this may be redundant) and Links (intended to introduce
the links section of an OutOfLineURL, but probably redundant).
My gut feeling is that this is a Really Good Idea, and that we should leave it alone for the moment.
Reference: see David Ascher's email of 1999-11-29 in the DocSIG, [Doc-SIG] docstring grammar
(The Wiki page for discussing the proposal to reintroduce that
representation, or indeed an alternative, is
StructuredTextTables.
My personal feeling is that this is likely to take a long while
to find a solution for, since tables are difficult to describe with
ASCII text in a way that everyone finds palatable.)
The StructuredText module appears (ignoring the residual StructureTextClassic content) to work as follows:
The __init__.py for the StructuredText module provides the following (again, ignoring Classic stuff):