A proposed meta-language for use with mxTextTools

Version 0.3 (tiepin) of 1999-11-15


Contents


Introduction

Marc-André Lemburg provides mxTextTools, a package of fast text manipulation tools for use with Python. The tagging engine within that package takes as input a set of tuples, the Tag Table, which tells it what to do. Although simple in format, these tuples aren't particularly readable, and indeed Marc-André himself says:

"I admit, these tables don't look very elegant. In fact I would much rather write them in some meta language that gets compiled into these tables instead of handcoding them. But I've never had time to do much research into this, so unless someone volunteers, it will stay like this for some time."

The following describes my implementation of a simple language which can be used to represent mxTextTools tag tables, which I believe to be easier to read and write than the underlying tuples, and which is transformable into said tuples by simple textual manipulations.

Note that the implementation is not yet complete, and whilst it is now possible to write all of the parsing I need for the translator in the language, there are still tuples that it is not possible to represent.

One outstanding matter is to decide on an extension for the text files which contain examples of this little language. For the moment, I'm using ".tag", although I can't help feeling that's probably used for other purposes elsewhere. ".mal" did occur to me, but is perhaps too obscure. ".mxtext" or ".mxtag" might have been better solutions, but are perhaps a bit overlong...

Back to Contents


General aims

Informally, my aims are:

Back to Contents


Basic description

For obvious reasons, I'm going to assume that you already understand how to write mxTextTools tag tables. This means that we can dive straight in with the first example from the mxTextTool homepage:
    tag_table = (('lowercase',AllIn,a2z,+1,+2),
                 ('upper',AllIn,A2Z,+1),
                 (None,AllIn,white+newline,+1),
                 (None,AllNotIn,alpha+white+newline,+1),
                 (None,EOF,Here,-4)) # EOF 
  

which translates to:
    tag_table = Table is:
        <top>
        'lowercase' = AllIn a2z F:next T:<ignore>
        'upper' = AllIn A2Z F:next
        <ignore>
        AllIn white+newline F:next
        AllNotIn alpha+white+newline F:next
        EOF Here F:<top>
  

The first line indicates that we are producing a tag table (as opposed to an individual tuple - this will be discussed later on) and assigning it to tag_table. The use of "is:" to delimit the start of a new block scope (i.e., a new table or tuple) seems to me to make it clearer what is going on, and the colon has the added advantage of meaning "new block" in Python as well (which also means that Python mode handles the subsequent indentation rather nicely).

The "<top>" is a label. I wanted to get rid of the "on success go to the tuple at position +5" sort of logic from the tag tables, since this is (a) difficult to work out, and (b) prone to error if one ever has to add new tuples to the table.

Offsets of plus one, minus one and zero are common enough that they have the special keywords next, previous and repeat available - having to use a label would be overkill in these cases, although it is, of course, still possible if the user deems it to be clearer.

(The label goes on a line by itself for the simple reason that
        <top> 'lowercase' = AllIn a2z F:next T:<ignore>

doesn't look nice when indented.)

The "F:" and "T:" conditions are translations of the jump_no_match and jump_match elements. Obviously, specifying a label means "jump to that label". Be aware that there is no whitespace between the "F:" or "T:" and its target, and that the order is required (i.e., the "false" target must come first, if present - this is the same ordering as in the underlying tuples).

As our next example, consider:
    ctrlsymbol = Table is:
        Is "'":                             # hexquote
            IsInSet hex
            IsInSet hex
        IsInSet notalpha F:next T:MatchOk   # other

The general notation:
     <condition>:
         <suite>
     <statement>
(where <suite> is one or more statements) is exactly equivalent to:
     <condition> F:<labelN> [T:next]
         <suite>
     <labelN>
     <statement>

and is the preferred notation for that form. Note that technically there is absolutely no reason not to have labels in the <suite> which are jumped to from outside the <suite> (or vice versa) - an 'if' block does not introduce a new scope. However, such usage may make it harder to read the code, so think carefully before introducing it.

Sometimes it is useful to define a single tuple by itself - for instance:
    whitespace = (None,AllIn,' \t')
    opt_whitespace = whitespace + (MatchOk,)

This can be represented as follows:
    whitespace is:
        AllIn ' \t'
    opt_whitespace is:
        whitespace F:MatchOk

I find that this form is most commonly used for introducing tables that are used in more than one place (or conveniently separated out for clarity) - for instance:
    # Comments are fairly simple
    t_comment is:
        'comment' = Table is:
            Is '#'
            AllNotIn '\n\r' F:MatchOk

To illustrate most of the forms which can be used, here is a rather silly example:

The meta language The translation
# -*-python-*-
from TextTools import *

t_string is:
    'str' = Word "string"

t_whitespace is:
    AllIn ' \t'

# A very silly tag table
# Note that the two <label> lines are in different
# "scopes" (not a usage I tend to recommend!)
tagtable = Table is:
    <label>
    Word "infinite_loop" F:next T:repeat
    Word "back_to_start" F:next T:<label>

    Is "a":
        Skip back
    
    Table is:
        'fred' = IsIn "abc":
            <label>
            'jim' = Word "thingy" T:MatchOk
            Word "bingo" F:<label>
        'fred' = IsIn "b"


    'table' = Table is:
        t_string:
            t_whitespace
            t_string
# -*-python-*-
from TextTools import *

t_string = \
    ('str',Word,"string")

t_whitespace = \
    (None,AllIn,' \t')

# A very silly tag table
# Note that the two <label> lines are in different
# "scopes" (not a usage I tend to recommend!)
tagtable = (
    # <label>
    (None,Word,"infinite_loop",+1,0),
    (None,Word,"back_to_start",+1,-1),

    (None,Is,"a",+2,+1),
        (None,Skip,-1),

    (None,Table,(
        ('fred',IsIn,"abc",+3,+1),
            # <label>
            ('jim',Word,"thingy",MatchFail,MatchOk),
            (None,Word,"bingo",-1),
        ('fred',IsIn,"b"),
    )),

    ('table',Table,(
        t_string + (+3,+1),
            t_whitespace,
            t_string,
    )),
)

Back to Contents


Download and usage

Download

The following are available for download:

You will also require mxTextTools v1.1.1 or later.

Usage

The Translate module provides a class File which may be used to translate a tag table (parsed using the table t_file which is defined in Translate_tags.tag) into a text file. An example of its use is available in the main() function in Translate.py.

Used as a command line program, Translate.py can be used to translate tag tables written in the metalanguage into straight Python files. Its normal usage is explained by invoking the "-help" switch:
Usage: Translate [switches] <infile> [<outfile>]

        -stdout         -- write to standard output instead of a file
        -force          -- write to the <outfile> even if it already
                           exists (overwrite any existing file)

        -import         -- import tag table from Translate_tags.py,
                           instead of using the internal table

        -compare        -- compare the imported and internal tag tables
                           (development option!)

        -test           -- use our internal test data and write to stdout
        -pytag          -- use the interpreted tagging engine
        -debug          -- if -pytag, enable its debugger
        -diag           -- enable general debugging
                           Beware that this currently also writes line
                           numbers to the start of each line in the output,
                           so it doesn't emit legal Python...

        -help           -- show this text
        -history        -- show the module history
        -version        -- show the module version

If <outfile> is not specified, <infile> will be used with its extension
replaced by ".py".

The -import switch is used to prove that the translator can indeed "bootstrap" itself, and will become the normal mode of operation in the next release.

Python code

The translator attempts to determine which code "belongs" to it, and should thus be translated, and which code is Python code and should thus be transmitted untouched. This is so that one can do useful things like:
    chars = "abcdef"
    if morechars:
        chars = chars + "ABCDEF"
  

It also has the advantage of allowing one to mix tag tables with normal Python code, as one might naturally do if one were programming the tag tuples directly. Of course, it is slightly dangerous, as is any preprocessing of a file with mixed terminologies, and I don't guarantee to accept more than a subset of proper Python code within the final version of the translator.

(NB: random code is not allowed within tag tables themselves - it will be detected and commented out, with an appropriate error message.)

Appearance of the output

The output retains the indentation of the input, even though this may not conform (for instance) to the indentation that [X]Emacs Python mode would produce. This is mainly to make the equivalence of metalanguage and translation more obvious, which also aids in diagnosis of any translation problems.

Note that a comma is output after each entry within a tag table tuple, including the final entry - this follows the normal "safe" convention when writing lists, but can optionally be changed by changing the value of the variable WANT_LAST_COMMA in Translate.py (I would note that one of the major benefits of the language to me is that one does not need to remember those trailing commas after each tuple - leaving them out results in unobvious errors!).

When closing parentheses are written at the end of a table, they are written to a line by themselves, indented to match the indentation of the start of the tuple. This introduces an "extra" line (i.e., a line that does not correspond to anything in the input tag data), but on the whole seems clearer than writing extra parentheses to the preceding line.

Inline comments are retained and output. By default they are output at column 40 (variable COMMENT_COLUMN in Translate.py), unless there is already text there. This gives some chance that they will line up, but obviously I'm not going to try too hard. Future versions of the translator may make this more customisable.

The "end of line" character is output as "\n", and the output is opened in text mode, so the result will depend on the platform being used.

Error correction

Some attempt is made to cope with "common errors" (read: "things I find myself doing wrong too frequently"). Specifically:

Target MatchOK
The target "MatchOk" has a lower case letter at the end. I can see that this might be a pitfall, so the erroneous target "MatchOK" will generate a message, but will otherwise be treated as if it were spelt correctly.

Missing "=" in table block
I've been known to write:
        'fred' IsIn jim
	

instead of:
        'fred' = IsIn jim
	  

If the translator finds the former situation (and note that the tag object is optional, as normal), it will output an appropriate error message, but continue as if the "correct" code had been found.

Note that whilst I'm not entirely convinced that "=" is the best character to use here, I do feel the need for a delimiter here, and if one reads the normal Python "=" as "references", then one should read this operator as "tags".

Missing "is" before colon
I've been known to write:
        'fred' = Table:
	

instead of:
        'fred' = Table is:
	

If the translator finds the former situation (and note that the tag object is optional, as normal), it will output an appropriate error message, but continue as if the "correct" code had been found.

Note that the "is" is required to distinguish a table definition visually from an if block - basically, "is: translates as "start a new tuple" whereas a bare ":" means "start an if block".

Back to Contents


Metalanguage summaries

The following table gives a brief summary of how the metalanguage works:

"Tuple" representation Metalanguage representation
tag_table = (tuples) tag_table = Table is:
    tuples
(tagobj,command,argument,jfalse,jtrue) tagobj = command argument F:.. T:..
(None,command,argument,jfalse,jtrue) command argument F:.. T:..
(None,command,argument,jfalse) command argument F:..
(None,command,argument,MatchFail,jtrue) command argument T:..
(None,command,argument) command argument
(None,Jump,To,jfalse) Jump To ..
(tagobj,Table,(tuples)) tagobj = Table is:
    tuples
(None,command,argument,+1,+3),
tuple1,
tuple2,
tuple3,
command argument F:next T:<label>
tuple1
tuple2
<label>
tuple3
(None,command,argument,+3,+1),
tuple1,
tuple2,
tuple3,
command argument:
    tuple1
    tuple2
tuple3

The following is an expansion of the "Tagging Commands" table in mxTextTools:

Command Arguments that go with it Metalanguage representation
Fail Here Fail Here
Jump To Jump To <label>
(although the more 'literal' Jump To F:<label> will also work)
AllIn string AllIn string
AllNotIn string AllNotIn string
AllInSet set AllInSet set
Is character Is character
IsNot character IsNot character
IsIn string IsIn string
IsNotIn string IsNotIn string
IsInSet set IsInSet set
Word string Word string
WordStart string WordStart string
WordEnd string WordEnd string
sWordStart search-object sWordStart search-object
sWordEnd search-object sWordEnd search-object
sFindWord search-object sFindWord search-object
Call function Call function
CallArg (function,[arg0,...]) CallArg(function,[arg0,...])
(not yet available)
Note that whitespace around the parentheses is optional.
Table ThisTable or
tagtable-name or
tagtable
Table ThisTable or
Table tagtable-name or
Table is:
    tagtable
SubTable ThisTable or
tagtable-name or
tagtable
SubTable ThisTable or
SubTable tagtable-name or
SubTable is:
    tagtable
TableInList (list_of_tables,index) TableInList(list_of_tables,index)

(not yet available)
Note that whitespace around the parentheses is optional.
It is likely that only table names will be allowed in the list, and not embedded subtables (i.e., [Sub]Table is:).

SubTableInList (list_of_tables,index) SubTableInList (list_of_tables,index) or
SubTableInList(list_of_tables,index)

(not yet available)
Note that whitespace around the parentheses is optional.
It is likely that only table names will be allowed in the list, and not embedded subtables (i.e., [Sub]Table is:).

EOF Here EOF Here
Skip offset Skip offset or
Skip back (equivalent to Skip -1)
Move position Move position or
Move ToEOF
Loop count Loop count
LoopControl Break/Reset LoopControl Break or
LoopControl Reset

Appending flags to the command (as in +<flag>, where <flag> is one of CallTag, AppendMatch, AppendToTagobj or AppendTagobj) is not yet supported.

Back to Contents


Limitations and Restrictions

Restrictions

These are things that the user must bear in mind, which are not likely to change.

Identifiers
Identifiers follow the normal Python rules (i.e., must start with a letter or underline, and may continue with letters, digits or underlines). Case is significant. Python keywords and mxTextTool commands (e.g., "Word", etc.) are not allowed. Somewhat arbitrarily, things like "Here", "MatchFail" and so on are currently allowed, but this potential ambiguity will likely be removed in a future version (I'm minded to disallow all the mxTextTools "special" words, for simplicity).
Labels
The text within a label (i.e., between < and >) is an identifier, and thus follows the appropriate rules.
Indentation
When calculating indentation, the same assumption is made as within the Python language itself - i.e., that any tabs are to be interpreted as moving to the next multiple of eight (8) spaces. As with normal Python, it is thus recommended not to mix spaces and tabs, just in case.
No spaces after T: or F:
This is not a bug, but beware that there is no whitespace between the T: or F: and its target. This is purely because I think it looks better that way. Some future version of the translator may attempt to give a useful error message if the user gets this wrong.

(This is about the only place where I intend to restrict the addition of "non-significant" whitespace. In particular, whitespace around "+" signs is legal, and whitespace before and after "(" and ")" will be allowed when I support function calls, etc.)

Limitations

These are likely to be resolved in future versions of the translator.

Strings
Strings are only supported in a simple manner. Specifically:

The first two limitations will be removed, probably in the next release of the software. The last probably will not (unless it turns out to be trivial).

Line continuation
The use of "/" to continue a line is not currently supported. Nor are parentheses recognised as introducing automatic line continuation. Thus all tuple definitions must fit on a single line.
Tag objects
At the moment, tag objects can only be identifiers, strings or (unsigned) integers.
Ultimately, most (if not all) object references should be allowable as tag objects (specifically, things like a[b]).
Tag table assignment
For the moment, tag tables (at the outer level) can only be assigned to identifiers. For example:
        fred = Table is:
           ...tag table definition...
       

The intention is ultimately to allow any Python entity that would be valid on the left hand side of an assignment (which is, of course, what we generate when we translate the above).

(Actually, the same parse is applied for "outer" tag tables as for inner, so one could parse an assignment to a string or integer. Of course, Python itself would then grumble later on. A future version of the translator will catch this potential error and grumble.)

Commands and arguments
There is almost no intelligence in dealing with commands and arguments as yet, save that the argument "back" gets translated to "-1" (this is for use with "Skip").

Arguments may currently be any of the words "Here", "ToEOF", "To", "ThisTable" or "back", or they may be a string, an (unsigned) integer or an identifier.

It is possible to "add" arguments with "+", so that 'abc' + "def" is a legitimate argument (note that this does not apply for "[Sub]Table is:" lines).

No check is made that any particular argument makes sense with its command.

Exceptions
There are still a few "error" conditions in translation that result in exceptions. The intention is that these will be removed and replaced by error messages, with the translator continuing "as best it can".

Back to Contents


Still to do

Some of the following are also mentioned in Limitations above.

Handling tables with T/F jumps
When writing tag tuples directly, it is possible, for example, to do something like:
        ('fred',Table,(...),+3,MatchFail)
	

where false and true jumps are defined for an embedded Table. This is not currently possible with the metalanguage, because there is no obvious place to put the "F:" and "T:" clauses.

One might represent this "directly" as:
	'fred' = Table:
	    ...
        F:next T:<target>
	

but this would be breaking the Python convention that there are no block closing keywords (so, of course, Python mode also wouldn't support it, which is a Bad Thing), and anyway it looks really ugly.

Unfortunately, it looks like the "best" approach is just to move the true/false jumps to the start of the table definition:
	'fred' = Table F:next T:<target> is:
	    ...
	

I dither a little over whether a little more syntactic sugar would be nice here - for instance:
	'fred' = Table with F:next T:<target> is:
	    ...
	

but suspect in the end I'll err on the side of simplicity.

Note that there is still the issue of what one should do if the conditions are of the form "F:<label> T:next", which would normally be represented as the start of an if block - my inclination is to say at this point "well, that's getting too complex and we just won't cope", but I reserve the right to change my mind later on when I've worried about it a bit more...

Modifed [Sub]Table commands
At the moment, it is possible to have "+" modifications for commands in a 'normal' tuple - for instance:
        'fred' = Table + CallTag
	

(although mostly by coincidence, if you actually read the tag tables!), but it is not possible to do this in an "embedded" table definition, such as:
        Table + CallArg is:
            ....
	
Function calls
It is not currently possible to recognised lines which include a function call as the argument to a command - for instance as in the following tuples:
        'text' = sWordStart BMS('-->') F:next
        Skip len('')
	

It is also not possible to translate CallArg, TableInList and SubTableInList for similar reasons.
Continuation lines
It is not possible to continue lines with "/" (as in normal Python).
Proper string handling
Strings with escaped quote characters, and triple-quoted strings, need to be implemented.
Additional checking of TuplePlus
When doing a TuplePlusLine, it would be nice to check if we knew the tuple we're adding to, and if so whether it already (i.e., by mistake) has a F or T condition. This would necessitate keeping a log of Tuples defined (but we couldn't guarantee to recognise a tuple by "name", of course - e.g., given "tuple[3] = ..." we would not know what to do if we later referred to that as "tuple[a]".

Also, if the user does:
        a = Table is:
            bbb
	

then it would be nice to check that "bbb" is a tuple (or tupleplus) if possible (this is actually almost the same point).

Distinguish assignables
It would be a Good Thing to distinguish assignables (e.g. for Tables) from tagobjs (i.e., as used in tuples) so we can grumble if the user uses the wrong sort at the wrong time.

The translator also needs to handle more valid Python targets for both assignment (of tag tables) and also for tag objects. I need to think about exactly how to do this - I don't really want to start having to be able to parse all legitimate Python source code explicitly "by hand".

Back to Contents


Examples

Finally, I've translated all of the examples from the Examples directory in the mxTextTools distribution (version 1.0.2), with some added comments in appropriate places. They can be seen (colourised with py2html) here (or as straight tag table, with no colourisation, here), and the results of translation are here (or, again, as unvarnished Python, here).

Back to Contents


Author: Tibs (tibs@tibsnjoan.co.uk)

Last textual change: Tue Nov 16 09:17:38 GMT Standard Time 1999
Last modified: Thu Jun 19 14:33:54 BST 2003