pytext-fat release 0.2/"onwards, ever onwards..."
(backquotes for literals, reduced context dependency)
The previous version of this text is in fat1.html. A version with amended text marked up like this is in fat1-diff.html.
For the purposes of this document, I will call the format that is being documented "pytext". This is not to be taken as a final name, but I need to have something to use as a hook herein.
Edward Loper is likely to be working on a "minimal" markup specification, which I will refer to as the "thin" spec. This is thus the "fat" spec. It tries to retain those things from STpy that I believe had been sought by previous turns round the Doc-SIG loop, and which seem (to me) worthwhile.
pytext is an attempt at a moderately minimal markup system for Python
docstrings. This is something that has been vexing the Doc-SIG since 1997
or earlier. There is a definite demand for the ability to place
some markup in docstrings. Some of this demand is for
presentation markup (things like emphasis) and some for semantic
markup (things like #python_code#
).
Early on it was realised that most programmers will not use "heavyweight" markup (examples of this would be HTML, SGML or XML-derivatives, TeX variants, etc.), and Guido himself declared that this would be a bad idea [ref to be provided - email to Doc-SIG in 1997].
One alternative proposed was "setext" [ref to be provided], which was intended to be "markup for emails". From this, Jim Fulton and other Zopistas produced StructuredText (or ST).
ST has its problems - not least a very informal and imprecise definition, and an implementation that is (or so I am told) rather unpredictable. Nonetheless, it has been used to great success in a variety of Zope related arenas, by many people.
The Zopistas at Digicool have been working on a replacement, STNG, and a while back it looked as if the Doc-SIG was going to adopt a relative, developed "in-house", with a more precise definition, tentatively called STpy. Part of the motivation for this adoption was a perceived need for compatibility with what was going on within the Zope world.
Recently, however, Guido himself has intimated that he does not like ST (he appears to have been bitten by the implementation), but more importantly, that he does not regard it as necessary to maintain compatibility with the ST family. This generally throws the cat amongst the pigeons, but also makes it possible to consider some more radical changes that we had wished for.
So, on to the fat version of pytext...
pytext is primarily intended for use in Python docstrings. It may also be used for other short documents, but that is an incidental convenience. It is specifically not targetted at long documents, books, or even articles. It is intended to provide a minimal amount of markup, and relatively little control of presentation - if someone writing a docstring is worrying about presentation, they're probably worrying about the wrong thing.
pytext takes a docstring (or other short text) as input, and produces a DOM tree as output. The decision to output a DOM tree was taken so that simple processing of the output would be possible (for instance, to find all Python literals, for cross-referencing purposes) - Python itself provides both a minimal DOM implementation (minidom) and more advanced tools (via fourthought, etc. [ref]). Also, it seemed preferable to choose a well-know datastructure rather than create yet another ad-hoc one. Finally, producing a DOM tree with a known DTD (specified informally in this document, and still to be finalised) means that one can mix and match pytext parsers and output formatters.
pytext carefully does not assume that it knows anything particular about the final formatter that will be used to represent its data. It is known that all of HTML, XHTML, XML, LaTeX, ST, pytext and PDF are likely to be produced in the future. It is quite possible that an application might never actually output a representation, but just use the DOM tree for some other purpose.
This decision is the other main reason for not making the markup too clever - allowing for all of the possible target systems would make this rather difficult.
As part of the work of developing this specification (and early versions, in the STpy family), I have produced a module called docutils, which provides parsing code for pytext.
Note: at time of writing, still nearing first alpha release, and still needing to be forced sideways from STpy to pytext - but for simplicity I will talk about it in the present tense rather than the near future...
It also supplies a simple command line driven frontend, which allows users to experiment with different possible choices that might be maken as to the details of pytext (these will be described at appropriate places below), and a very simplistic (and horribly colourful) HTML formatter. It is important to remember that this utility is not a production tool, it is a testbed. Also note that if it's easy to program, it may have been left out (can I trust those on the Python lists to know what is easy to program and might thus have been left out? Let's assume so).
Note that despite the fact that pytext does not provide/require a formatter of its own, examples will be presented in this document to illustrate the sort of output that a user would be justified to expect.
The following characters or character sequences are special in pytext-fat:
<
- start delimiter for URIs
>
- end delimiter for URIs
..[
- start delimiter for anchors
[
- start delimiter for localrefs
]
- end delimiter for anchors and localrefs
#
- 'quote' symbol for Python literals
`
- 'quote' symbol for literals
*
- 'quote' symbol for emphasised text, and also one of
the unordered list bullets
::
- marker for literal blocks
---
- marker for descriptive list items
@
- this character is reserved for future use as an
"escape" character, just in case
(See also a possible use for
~
for delimiting non-descriptive list items, below.)
The above characters or character sequences may only appear in literal
blocks (where they will not be interpreted as markup), in literals
(ditto, but note a literal cannot contain `
, a
Python literal cannot contain #
and a URI cannot contain
>
), and as markup signifiers
in colourised text (save for *
, which can also appear as a
list bullet).
Spaces and newlines are discussed below. Tabs gets converted to spaces. Other "non-printing" characters are undefined in their action, and may be flagged as "unexpected" by an implementation.
So, pytext takes as input a docstring or other relatively short and simple document. I shall use the term "document" for the text, for simplicity.
DOM The top element of the DOM tree representing a document has the tag "pytext". It may have multiple children, representing the top level blocks within the document.
The document is first processed to replace all tabs, using the "normal Python conventions" (i.e., the appropriate string routine). This means that we can assume that there are no tabs left in the document. It also means that the document cannot use tabs for any specialised purpose.
The document is considered to consist of a series of lines, each containing zero or more text characters followed by a newline (i.e., the normal Python convention). The last line of the document is treated as if it possessed a newline, whether this is actually true or not.
The behaviour of non-printing characters [define more carefully] is undefined - specifically, odd things like vertical tab and so on are not considered whitespace.
A line containing only spaces, or having zero length (ignoring the newline) is a blank line. All other lines are text lines.
Each text line has an indentation - this is simply the number of spaces at the beginning of the line. If the document is a docstring, and the first text line is followed immediately by a blank line, and has an indentation of zero, then it is given the same indentation as the following text line (this is the normal procedure for docstring indentation, but phrased slightly differently than normal).
Trailing spaces are not significant, and are removed from all lines.
Text within the document is gathered into blocks. There are two forms of block - one form will be colourised (that is, it may contain markup characters and they will be interpreted), and the other form will not be colourised (that is, characters will not be interpreted as markup).
All characters which are not being interpreted as markup will be passed through untouched, and should be represented "as is" by the final formatter, with the following exceptions:
Note in particular that the characters &
,
<
, >
(all significant in
HTML) are not special in pytext, and will represent as themselves.
The following rules are used to split the document into blocks (my apologies that these involve some forwards references):
In practice, this does much what you would expect. For instance:
This is a block - it is a paragraph. This is another paragraph - it follows a blank line. 2. This is a list item. * So is this - and thus it too starts a block, even though it does not have a blank line in front of it. |
which might be formatted as:
This is a block - it is a paragraph. This is another paragraph - it follows a blank line.
|
Each block has an indentation associated with it. This is determined by the first line of the block. In colourised blocks, this is the only indentation that is deemed significant, and all other leading spaces are regarded as "soft" (i.e., they follow the "many spaces go to one" rule).
For non-colourised blocks, the indentation of each internal line is significant (although not quite directly), but the indentation of the block is still the indentation of the first line.
Conceptually (although not in the final DOM tree), blocks are formed into a hierarchical tree based upon their relative indentations. This happens in a manner which should be fairly familiar to a Python programmer...
Specifically, each block is assigned a level (the first paragraph arbitrarily gets level 0) as follows:
Given a block P at indentation I and level L:
- If the following block has the same indentation, then it is at the same level, and is a sibling of P.
- If the following block has more indentation, then it is at the next level (L+1), and is considered a child of P. The indentation corresponding to the new level is remembered.
- If the following block has less indentation, then the corresponding level is looked up.
If there is no corresponding level, then an error has occurred. An implementation may choose to continue after such an error, in which case the erroneous block shall be assigned the maximum level corresponding to an indentation that is less than the "bad" indentation.
The block is added to the hierarchical structure as a sibling of the previous block with the new level, and the indentations corresponding to all greater levels are forgotten.
This makes more sense when explained with an example:
The first paragraph is at level 0. [1] This second paragraph is also at level 0. [2] 1. This list item has indentation 3 and is [3] at level 1. This paragraph has indentation 6 and is [4] thus at level 2. This paragraph is back at level 1. The [5] previous indentation for level 2 goes. This paragraph has indentation 8, but [6] is at level 2 again. This paragraph has indentation 6, which [7] is a mistake... |
As we process the paragraphs we find:
paragraph | indentation | level | {level:indent} |
---|---|---|---|
1 | 0 | 0 | {0:0} |
2 | 0 | 0 | {0:0} |
3 | 3 | 1 | {0:0, 1:3} |
4 | 6 | 2 | {0:0, 1:3, 2:6} |
5 | 3 | 1 | {0:0, 1:3} |
6 | 8 | 2 | {0:0, 1:3, 2:8} |
7 | 6 | invalid | {0:0, 1:3, 2:8} |
Paragraph 7 is a problem because its indentation does not match any appropriate preceding indentation (it can't "see" paragraph 4). This mimics the way that Python requires indentation to be consistent.
One would expect an implementation to output some form of error for paragraph 7, and if it continued, to assign it level 1.
docutils performs in this manner, and the example formatter produces some stunningly obvious (or perhaps obnoxious) representation around the offending block. "Real" utilities should preferably be subtler...
Thus we now have a tree structure where the level 0 blocks are children of the "document" as a whole, the level 1 blocks are their children, and so on.
Later in the document the terms "child" and "parent" will be used to identify the relationship of blocks within this hierarchy.
If you have been exposed to ST or STNG in the past, you may be worrying about the uses to which indentation is put within the document, and how important this hierarchy actually is. Please don't worry - it is used much less than in the ST family, and in a much simpler fashion.
A colourised block which is not a list item, header, anchor, label or other specialised structure is (sensibly enough) termed a paragraph.
DOM A simple paragraph is represented as a "para" element. It may not contain any other block elements.
There are three types of list in pytext - descriptive, ordered and unordered. When colourised blocks are being identified, text lines are checked for them in that order (so if there is ambiguity between a descriptive list item and an ordered list item, the list item will be descriptive).
List items do not occur in non-colourised blocks.
Note: an old proposal to make ordered and unordered list items less ambiguous (particularly the "this '1.' is part of the previous sentence" problem) has been re-suggested. This would prefix all list bullets and sequences with a specialised markup character, to unambiguously indicate their nature.
For instance, if
~
were chosen (which is fairly "low visibility", and not commonly used in English text), then non-descriptive lists would look like:
The next line starts a list item. ~1. The first item ~2. The second item. ~a. Another ~* Heh, that was a bullet ~+ And anotherI'm not convinced, personally, but the idea is there.
A descriptive list item is composed of a key (or title),
followed by one or more spaces, followed by three hyphens (i.e.,
---
), and then optionally followed by one or more spaces and
some text.
The key may not contain a newline, but is otherwise unconstrained. The key may contain markup characters.
The three hyphens (and their delimiting spaces) are not considered to be part of the key or the text, and are thus not retained in the DOM tree,
The sequence ---
may not occur within colourised text other
than in a descriptive list item. It may occur within a Python
literal, a literal or a URI.
Any child paragraphs of the descriptive list item are considered to be part of that list item.
A contiguous sequence of descriptive list items (at the same level) will be aggregated together into a single descriptive list.
For example:
Descriptive lists look like the following: Key --- and some text. And this paragraph is a child. Another key -- and some more text This key -- Has text only in a child paragraph. ` --- ` --- is what is used to delimit descriptive list items |
which might be represented as:
Descriptive lists look like the following:
|
or perhaps as:
Descriptive lists look like the following:
|
Note: The ST family uses a double hyphen to delimit descriptive lists. Various people (including Guido and Edward Loper) dislike this because they use double hyphens within text -- like this. Three hyphens together is not a normal usage, so should be safe.
Are there other, better alternatives worth considering?
DOM The key is rendered as a "key" element, and the text (if present) as a "paragraph". The example would be stored as follows:
<dlist> <ditem> <key> Key <para> and some text <para> And this paragraph is a child <ditem> ...
An ordered list item is composed of an enumeration, optionally followed by one or more spaces and some text.
An enumeration sequence is a single letter (upper or lower case), or a number, followed by a dot.
Any child paragraphs of the ordered list item are considered to be part of that list item.
A contiguous sequence of ordered list items (at the same level, and of the same enumeration type) will be aggregated together into a single ordered list. Note that upper case and lower case letters are not considered of the same type.
For example:
Ordered lists look like the following: 1. This is the first item 3. This is the second item (yes it is) a. This is a new list b. This list has disjoint text. A. and a sublist |
which might be represented as:
Ordered lists look like the following:
|
DOM The enumeration is rendered as a "sequence" attribute, and the text (if present) as a "paragraph". The example would be stored as follows:
<olist> <oitem sequence="1"> <para> This is the first item <oitem sequence="3"> ...
Note that allowing letters in enumerations means that ambiguity is possible - for instance:
Who am I. Me.which will be parsed as:
Who am
- Me.
(that is, the "I" is interpreted as introducing a list item).
So, should we only allow numbers? One can still have the same problem, of course!
On the whole, this is something people will need to learn to work around.
An unordered list item is composed of a bullet, optionally followed by one or more spaces and some text.
A bullet is one of "*
", "-
" or
"+
".
It is a historical oddity that
*
may be a bullet character, but is also used to delimit emphasis. Unfortunately, it's use for both purposes is firmly established in many people's fingers (including mine) and it seems a pity to waste that fact...
Any child paragraphs of the unordered list item are considered to be part of that list item.
A contiguous sequence of unordered list items (at the same level, and with the same bullet) will be aggregated together into a single unordered list.
For example:
Unordered lists look like the following: * This is the first item * This is the second item - This is a new list - This list has disjoint text. * and a sublist |
which might be represented as:
Unordered lists look like the following:
|
DOM The bullet is rendered as a "bullet" attribute (of the unordered list), and the text (if present) as a "paragraph". The example would be stored as follows:
<ulist bullet="*"> <uitem> <para> This is the first item <uitem> ...
Note that allowing plus and minus as bullets is potentially confusing if someone is doing lots of maths. I think this is an acceptable risk.
As one might hope, lists can be intermingled in the natural manner, with the obvious results.
It can be useful to split a document up into named sections. Three levels of section are provided (which should be more than enough).
A block is a heading if:
=
", "-
" or "~
".
=
", "-
" or "~
".
A level 1 heading looks like this:
A title ======= |
A level 2 heading looks like this:
A subtitle ---------- |
A level 3 heading looks like this:
A subsubtitle ~~~~~~~~~~~~~ |
In each case, at least 2 of the "underlining" character must be present
(ideally, the right number, of course, but it seems overly pedantic to
check, and two seems like a reasonable compromise - it also stops the
"-
" case being misinterpreted as starting a list item).
A heading block starts a new section of the appropriate level, with the first line of the heading as its title. Level 1 is the "top" level. A section continues until another heading of the same level. The blocks within a section need not be indented more than the header block (but they should not be indented less).
The representation of sections will normally only be evident in the
representation of the section title (unless, for instance, sections were
relatively indented in the output format). One might expect an HTML
formatter to choose (for instance) <h3>
,
<h4>
and <h5>
.
DOM Sections are represented by "section1", "section2" and "section3" elements, which may (optionally) contain a "title" element, followed by the (top level) elements for that section. For instance: <section1> <title> A title <para> The first paragraph of that section.
If, for some reason, a user has indented a section, then that indentation should be taken as meaningful, and I think that this means that the first block of lesser indentation should end the section. Whether this is useful or not, I'm not entirely sure...
So, for instance, if the user types:
Here is some text. Section 1 --------- Section 1 text. This text is not in section 1.then an implementation should treat this as:
<para> Here is some text. <section2> <title> Section 1 <para> Section 1 text. <para> This text is not in section 1.
If the user specifies a heading of level N before a heading for level N-1 has occurred, an untitled occurrence of a level N-1 section will be inserted "around" the level N section, just to keep the DOM tree pretty.
Headings and sections are optional within a document, and the first heading may occur at any point within the document.
Anchor blocks are colourisable blocks which start with an anchor definition.
An anchor definition starts with ..[
and ends with
]
. The character ]
may not occur within an anchor
definition. The character sequence ..[
and the character
]
may not occur in colourised text other than as the start or
end of anchor definitions, or (in the case of ]
) as the end of
a local reference - they may occur within a Python literal, a
literal or a URI (well, except that the relevent RFC doesn't allow them in
the last).
The text between the [
and ]
is termed an
anchor.
An anchor is either a sequence of one or more characters, starting with a letter or an underline, and continuing with zero or more letters, digits, underlines, hyphens or ampersands, or a number.
For instance:
..[Tibs] My home page is <http://www.tibsnjoan.co.uk/> ..[K&R] Many people regard this as the standard reference for the C programming language. ..[3] Gosh, reference number 3. |
Anchors should be represented "as is", but omitting the initial two dots - for instance:
[Tibs] My home page is <http://www.tibsnjoan.co.uk/> [K&R] Many people regard this as the standard reference for the C programming language. [3] Gosh, reference number 3. |
Anchors are the "far end" of local references, described below in the section on colourising.
DOM The anchor itself (without the square brackets) is held as an attribute on the "anchor" element. For instance: <anchor name="Tibs"> <para> My home page is ...
Question: Should anchors support the use of the "inline" element, like labels, so that the formatter can detect that the text for an anchor fitted on one line? I'm not sure...
Label blocks are colourisable blocks which start with a valid label, followed by a colon, optionally followed by spaces and some text.
A label is a sequence of one or more characters, starting with a letter or an underline, and continuing with zero or more letters, digits, underlines or hyphens. Labels are case-insensitive.
A label is only valid if it is in the current set of defined labels. This defaults to:
Question: do we need/want Author(s), Version and History, given the (possible) existence of
__author__
,__version__
andhistory__
(I believe at least the first two are fairly standard)?If pytext is primarily intended for use within a tool such as pydoc or HappyDoc, it would make sense to use the interrogatable variables rather than the embedded-in-text values.
If an implentation finds text that looks like a label (i.e., appropriate text followed by colon and space), but is not a valid label, then it should be able to provide the user with an appropriate warning.
For each valid label, three properties are defined.
Firstly, a label must state to what DOM element it translates. It is allowed for different labels (for instance, "Author" and "Authors") to translate to the same DOM element (for instance, "author").
Secondly, some labels may be presented in either of two forms:
Authors: Guido van Rossum and Tim Peters Author: * Guido van Rossum * Tim Peters |
In the first form, the label block must be one line long, must have text after the colon (and space) and may not have children. In the second form, the label may not have text after the colon, and must have child paragraphs.
DOM The examples above would be represented as: <author> <inline> Guido van Rossum and Tim Peters <author> <ulist bullet="*"> <uitem> Guido van Rossum <uitem> Tim PetersThe "inline" element is used to allow the formatter to know that this label block was presented as a single line. This could equally be done by having an "inline" attribute on the label's tag, and using a "para" for the "inline" text, but that would make specifying whether a label could have inline data more complex (see below).
The implementation must provide a way of defining if the "one line" form is allowed for a particular label. It must also provide a way of specifying which elements (in the DOM sense) may be present as children of the label.
As an illustration, consider the docutils implementation. This has a dictionary which defines the valid labels and their translations:
label_dict = {"Author":"author", "Authors":"author", "Arguments":"arguments"}
and another dictionary which indicates the allowable forms and child elements:
label_children = {"Author":["inline","para"], "Authors":["inline","para","ulist","olist"], "Arguments":["dlist"]}
This is not a required or even recommended way of holding the data, it is merely intended as an illustration. One could imagine using a DTD for the same purpose.
An implementation should check that the child blocks of a label are valid according to their specification, and produce an appropriate warning if they are not.
Note that the child elements specified are the immediate child elements - a "dlist" may still contain "para" elements internally, for instance.
Why have this construct? The Doc-SIG perceived the need to allow semi-arbitrary DOM elements (well, they were talking SGML/XML at the time, but the principle stands), with some control over their content. Particular examples given were "author" and "arguments". The latter was felt to be especially important.
It would be possible to use headings instead, but the markup for headings is generally wrong for how people lay out these items, and also the constraint on content would be missing.
The matter of which characters are to be allowed in a valid label is still open for debate - there is a case for multinational labels, for instance.
Literal blocks are one of the two forms of non-colourisable block.
A literal block is started when:
The children form the literal block.
Specifically, if a colourisable block with indentation N ends with two colons, and the next block appears to be a child, then the literal block will extend until the start of a block which has indentation N or less.
It is an error for a colourisable block to end with two colons and not have any children.
Note that the two colons at the end of the parent block are replaced by a single colon after the literal block has been recognised.
Within a literal block, blank lines are retained (and the correct number of blank lines is retained). List items, anchors and labels are not recognised within a literal block.
Within a literal block, the indentation of each line is remembered.
When formatting a literal block, if a line has indentation L and the parent (colourisable) block has indentation N, then the line will be output with indentation 'N - L'.
For instance:
This would be the paragraph introducing literal text:: This is the first part of the literal block. This is another literal "paragraph" at the same indentation (of course, in practice it is actually still part of the *same* literal block, despite the blank lines above). *This* "paragraph" is not allowed by the Pythonic indentation rules, but is perfectly OK as a *literal* "paragraph", since it is still "under" the introductory ("main") paragraph. This paragraph is no longer literal, as it has the same indentation as the "main" paragraph. |
which might be formatted as:
This would be the paragraph introducing literal text: This is the first part of the literal block. This is another literal "paragraph" at the same indentation (of course, in practice it is actually still part of the *same* literal subparagraph, despite the blank line above). *This* "paragraph" is not allowed by the Pythonic indentation rules, but is perfectly OK as a *literal* "paragraph", since it is still "under" the introductory ("main") paragraph. This paragraph is no longer literal, as it has the same indentation as the "main" paragraph. |
If the indentation of a line within the literal block extends to the left of the 'N - L' margin, then the literal block is invalid. An implementation should warn the user of the problem. A good implementation will still try to present the literal text in a suitable manner - perhaps by shifting the whole block rightwards by an appropriate number of spaces.
For instance:
Here is some literal text:: This is clearly the literal text, but its *internal* indentation is created to go too far to the left - naughty, naughty. |
might be presented as if it were actually:
Here is some literal text: This is clearly the literal text, but its *internal* indentation is created to go too far to the left - naughty, naughty. |
There is no absolute requirement for a literal block's parent block to have any content. Specifically, it is allowed to write:
1. This is a list item. :: This is a literal block. |
which would produce something like:
|
This is (a) not worth preventing, and (b) occasionally useful ((explain why)).
DOM <para> This would be the paragraph introducing... <litblock> This is the first part of the literal...
Doctest blocks are one of the two forms of non-colourisable block.
A doctest block starts with the characters >>>
,
and continues until the next blank line (or end of file).
Within a doctest block, list items, anchors and labels are not recognised.
Doctest blocks are intended to be "tested" by the Python doctest utility, whose documentation (see the Python library documentation) describes their form and purpose in detail.
Since doctest blocks are clearly valuable and since their use is encouraged, it makes sense to recognise them so they can be presented nicely in documentation (it helps that they're easy to recognise, too).
Within a doctest block, the indentation of each line is remembered.
When formatting a doctest block, if a line has indentation L and the preceding (colourisable) block has indentation N, then the line will be output with indentation 'N - L'.
For example:
The following block is a doctest block: >>> 1+1 2 But this block isn't... |
which might be formatted as:
The following block is a doctest block: >>> 1+1 2
|
Note that literal blocks are detected before doctest blocks, so that:
This is not a doctest block:: >>> # maybe the bash shell? |
can be written.
DOM <para> The following block is a doctest block <doctest> >>> 1+1 2 <indent> <para>But this block isn't
We have seen how block indentation is used to produce a block hierarchy, which is used to identify parents and children in the context of list items, literal blocks and so on.
However, sometimes a user wishes to use indentation solely as indentation. The simplest rule that appears to work reasonably well is that if a "para" element has children, then they should be indented with respect to that "para". So, for instance:
An ordinary paragraph. This is a child. It is indented. 1. This list item is indented as well. |
might be represented in the DOM tree as:
DOM <para> An ordinary paragraph. <indent> <para> This is a child... <olist> <oitem sequence="1"> This list item...
which might look like:
An ordinary paragraph.
|
The case where a paragraph is immediately followed by an indented list item is one that deserves consideration - I am not entirely convinced that:
This is a paragraph. 1. And a list item |
should be formatted differently than:
This is a paragraph. 1. And a list item |
On the other hand, if the "indent" element is present in the DOM tree, the formatter might be deemed at liberty to "optimise it out" in such circumstances.
After the document structure has been determined, the colourisable blocks are, well, colourised. Specifically, legal markup is located and converted to appropriate structures in the DOM tree.
Markup detection is done in the following order:
In the first release of pytext, markup may not be nested. Thus, whilst it is not forbidden to write:
This is an *emphasised `literal`* - also see "*Tibs*": <http://www.tibsnjoan.co.uk/>. |
the result will be:
This is an *emphasised |
and a nice implementation should attempt to warn the author that something is probably amiss.
In future versions of pytext, nested markup may become available, and at that time the fragment above would be expected to be rendered as something like:
This is an emphasised |
(The provision of nested markup is not regarded as of high priority.)
Literal Python strings start and end with #
. The character
#
may not occur within a Python literal. The character
#
may not occur in colourised text other than as the start
or end of a Python literal - it may occur within a literal or
URI.
Python literals may not contain newlines, or any other non-printing character. Spaces in Python literals are "hard" - that is, they may not be "stretched" or replaced by newlines when presented in formatted output.
For example:
#fred# and #`spam`# |
might be presented as:
|
Note that a tool using pytext may be expected to try to derive cross-references (links) for Python literals.
Question: If referring to a callable (but not specifing arguments), do we have a preferred style between the following?
#callable#
#callable()#
(Personally, I like to add the
()
to indicate it is a callable, and of course this might be used by a tool doing cross-referencing as a disambiguation aid. But (a) should it be encouraged, or (b) should it be discouraged (or, of course, (c) should we just not mention the matter!).)(Hmm - if is is an object (i.e., class instance), and I refer to it as a callable, what does one link to - its class's
__call__
method?)
DOM <python> fred <text> and <python> `spam`
Literals start and end with `
. The character
`
may not occur within a literal. The character
`
may not occur in colourised text other than as the start
or end of a literal - it may occur within a Python literal or
URI.
Literals may not contain newlines, or any other non-printing character. Spaces in literals are "hard" - that is, they may not be "stretched" or replaced by newlines when presented in formatted output.
For example:
`fred` and `#spam` |
might be presented as:
|
Note that a "hard" space may be inserted in text using an appropriate literal - for instance:
This` `text` `contains` `hard` `spaces... |
DOM <literal> fred <text> and <literal> #spam
Emphasised text starts and ends with *
. The character
*
may not occur within emphasised text. The character
*
may not occur in colourised text other than as the start or
end of emphasised text, or as a bullet for an unordered list item - it
may occur within a Python literal, a literal or a URI.
Emphasised text may contain newlines, but not any other non-printing character.
For example:
One can emphasise *one* or *more than one*. |
might be presented as:
One can emphasise one or more than one. |
DOM <text> One can emphasise <emph> one <text> or <emph> more than one <text> .
Note: Within the context of docstrings, there seems little need for more than one form of emphasis.
URIs start with <
and end with >>
.
The character >
may not occur within a URI. The characters
<
and >
may not occur in colourised text
other than as the start or end of a literal - they may occur
within a Python literal or within a literal.
URIs may not contain newlines, or any other non-printing character. Spaces in URIs are "hard" - that is, they may not be "stretched" or replaced by newlines when presented in formatted output.
The use of
<
and>
to delimit URIs is intended to be similar to the practice in email headers, and is preferred to trying to recognised URIs by themselves, which is quite difficult to do, especially in the presence of trailing punctuation.Note that the relevant RFCs have words to say about what characters are allowed in valid URIs.
URIs may be presented "bare":
See <http://www.tibsnjoan.co.uk/>. |
or with a representation text:
See "Tibs-and-Joan": <http://www.tibsnjoan.co.uk/>. |
The representation text form is written as a double quote, some text (anything except a double quote), a double quote, optional spaces and/or newlines, and a URI.
The examples should be presented as follows (if the format being output supports URI links, of course):
See http://www.tibsnjoan.co.uk/. See Tibs-and-Joan. |
Note that the <
and >>
are not
preserved, nor are the double quotes around the representation text.
An implementation should use the Python module urlparse.py
to check that a URI is plausible, and warn the user if it does not appear
so. The nature of URIs is sufficiently complex, however, that this
warning should be optional, and the implementation should allow even a
doubtful URI to be retained in the DOM tree.
DOM The two examples would be represented as: <reference uri="http://www.tibsnjoan.co.uk/"> http://www.tibsnjoan.co.uk/ <reference uri="http://www.tibsnjoan.co.uk/"> Tibs-and-Joan
Local references are the "other end" of anchors [reference]. They are intended to look somewhat like footnotes or citations.
A local reference starts with [
and ends with
]
. The character ]
may not occur within a local
reference. The characters [
and ]
may not occur
in colourised text other than as the start or end of local references, or
(within the) start or end of anchors - they may occur within a
Python literal, a literal or a URI (well, except that the relevent RFC
doesn't allow them in the last).
Local references may not contain newlines, or any other non-printing character. Their content follows the same rules as anchors [qv].
For example:
My name is [Tibs]. Personally, I'm not too keen on [K&R]. Note that local references can also be numbers [3]. |
might be presented as:
My name is [Tibs]. Personally, I'm not too keen on [K&R]. Note that local references can also be numbers [3]. |
DOM The examples would be presented as: <localref anchor="Tibs"> [Tibs] <localref anchor="K&R> [K&R] <localref anchor="3"> [3]
Author: Tibs (tibs@tibsnjoan.co.uk)
Last modified: Fri Mar 30 09:45:56 GMT Daylight Time 2001