Version 0.3 (tiepin) of 1999-11-15
Marc-André Lemburg provides mxTextTools, a package of fast text manipulation tools for use with Python. The tagging engine within that package takes as input a set of tuples, the Tag Table, which tells it what to do. Although simple in format, these tuples aren't particularly readable, and indeed Marc-André himself says:
"I admit, these tables don't look very elegant. In fact I would much rather write them in some meta language that gets compiled into these tables instead of handcoding them. But I've never had time to do much research into this, so unless someone volunteers, it will stay like this for some time."
The following describes my implementation of a simple language which can be used to represent mxTextTools tag tables, which I believe to be easier to read and write than the underlying tuples, and which is transformable into said tuples by simple textual manipulations.
Note that the implementation is not yet complete, and whilst it is now possible to write all of the parsing I need for the translator in the language, there are still tuples that it is not possible to represent.
One outstanding matter is to decide on an extension for the text files which contain examples of this little language. For the moment, I'm using ".tag", although I can't help feeling that's probably used for other purposes elsewhere. ".mal" did occur to me, but is perhaps too obscure. ".mxtext" or ".mxtag" might have been better solutions, but are perhaps a bit overlong...
Back to Contents
Informally, my aims are:
Back to Contents
For obvious reasons, I'm going to assume that you already understand how to write mxTextTools tag tables. This means that we can dive straight in with the first example from the mxTextTool homepage:
The first line indicates that we are producing a tag table (as opposed to an
individual tuple - this will be discussed later on) and assigning it to
tag_table
. The use of "is:
" to delimit the start of a
new block scope (i.e., a new table or tuple) seems to me to make it clearer
what is going on, and the colon has the added advantage of meaning "new block"
in Python as well (which also means that Python mode handles the subsequent
indentation rather nicely).
The "<top>" is a label. I wanted to get rid of the "on success go to the tuple at position +5" sort of logic from the tag tables, since this is (a) difficult to work out, and (b) prone to error if one ever has to add new tuples to the table.
Offsets of plus one, minus one and zero are common enough that they have the special keywords next, previous and repeat available - having to use a label would be overkill in these cases, although it is, of course, still possible if the user deems it to be clearer.
(The label goes on a line by itself for the simple reason that
The "F:" and "T:" conditions are translations of the jump_no_match and jump_match elements. Obviously, specifying a label means "jump to that label". Be aware that there is no whitespace between the "F:" or "T:" and its target, and that the order is required (i.e., the "false" target must come first, if present - this is the same ordering as in the underlying tuples).
As our next example, consider:
The general notation:
and is the preferred notation for that form. Note that technically there is absolutely no reason not to have labels in the <suite> which are jumped to from outside the <suite> (or vice versa) - an 'if' block does not introduce a new scope. However, such usage may make it harder to read the code, so think carefully before introducing it.
Sometimes it is useful to define a single tuple by itself - for instance:
I find that this form is most commonly used for introducing tables that are used in more than one place (or conveniently separated out for clarity) - for instance:
To illustrate most of the forms which can be used, here is a rather silly example:
The meta language | The translation |
# -*-python-*- from TextTools import * t_string is: 'str' = Word "string" t_whitespace is: AllIn ' \t' # A very silly tag table # Note that the two <label> lines are in different # "scopes" (not a usage I tend to recommend!) tagtable = Table is: <label> Word "infinite_loop" F:next T:repeat Word "back_to_start" F:next T:<label> Is "a": Skip back Table is: 'fred' = IsIn "abc": <label> 'jim' = Word "thingy" T:MatchOk Word "bingo" F:<label> 'fred' = IsIn "b" 'table' = Table is: t_string: t_whitespace t_string |
# -*-python-*- from TextTools import * t_string = \ ('str',Word,"string") t_whitespace = \ (None,AllIn,' \t') # A very silly tag table # Note that the two <label> lines are in different # "scopes" (not a usage I tend to recommend!) tagtable = ( # <label> (None,Word,"infinite_loop",+1,0), (None,Word,"back_to_start",+1,-1), (None,Is,"a",+2,+1), (None,Skip,-1), (None,Table,( ('fred',IsIn,"abc",+3,+1), # <label> ('jim',Word,"thingy",MatchFail,MatchOk), (None,Word,"bingo",-1), ('fred',IsIn,"b"), )), ('table',Table,( t_string + (+3,+1), t_whitespace, t_string, )), ) |
Back to Contents
The following are available for download:
.py
file which can be imported by
Translate.py (if the -import
switch is chosen) for use as
tag definitions for translating the metalanguage. I think that makes
sense...
You will also require mxTextTools v1.1.1 or later.
The Translate module provides a class main()
function in Translate.py.
Used as a command line program, Translate.py can be used to translate tag
tables written in the metalanguage into straight Python files. Its normal usage
is explained by invoking the "-help
" switch:
The -import
switch is used to prove that the translator can
indeed "bootstrap" itself, and will become the normal mode of operation in the
next release.
The translator attempts to determine which code "belongs" to it, and should thus be translated, and which code is Python code and should thus be transmitted untouched. This is so that one can do useful things like:
It also has the advantage of allowing one to mix tag tables with normal Python code, as one might naturally do if one were programming the tag tuples directly. Of course, it is slightly dangerous, as is any preprocessing of a file with mixed terminologies, and I don't guarantee to accept more than a subset of proper Python code within the final version of the translator.
(NB: random code is not allowed within tag tables themselves - it will be detected and commented out, with an appropriate error message.)
The output retains the indentation of the input, even though this may not conform (for instance) to the indentation that [X]Emacs Python mode would produce. This is mainly to make the equivalence of metalanguage and translation more obvious, which also aids in diagnosis of any translation problems.
Note that a comma is output after each entry within a tag table tuple,
including the final entry - this follows the normal "safe" convention when
writing lists, but can optionally be changed by changing the value of the
variable WANT_LAST_COMMA
in Translate.py (I would note that one of
the major benefits of the language to me is that one does not
need to remember those trailing commas after each tuple - leaving them
out results in unobvious errors!).
When closing parentheses are written at the end of a table, they are written to a line by themselves, indented to match the indentation of the start of the tuple. This introduces an "extra" line (i.e., a line that does not correspond to anything in the input tag data), but on the whole seems clearer than writing extra parentheses to the preceding line.
Inline comments are retained and output. By default they are output at
column 40 (variable COMMENT_COLUMN
in Translate.py), unless there
is already text there. This gives some chance that they will line up, but
obviously I'm not going to try too hard. Future versions of the translator may
make this more customisable.
The "end of line" character is output as "\n"
, and the
output is opened in text mode, so the result will depend on the platform
being used.
Some attempt is made to cope with "common errors" (read: "things I find myself doing wrong too frequently"). Specifically:
MatchOK
MatchOk
" has a lower case letter at the end. I
can see that this might be a pitfall, so the erroneous target
"MatchOK
" will generate a message, but will otherwise
be treated as if it were spelt correctly.
=
" in table block
If the translator finds the former situation (and note that the tag object is optional, as normal), it will output an appropriate error message, but continue as if the "correct" code had been found.
Note that whilst I'm not entirely convinced that "=
" is
the best character to use here, I do feel the need for a delimiter here,
and if one reads the normal Python "=
" as "references",
then one should read this operator as "tags".
is
" before colon
If the translator finds the former situation (and note that the tag object is optional, as normal), it will output an appropriate error message, but continue as if the "correct" code had been found.
Note that the "is
" is required to
distinguish a table definition visually from an if block - basically,
"is:
translates as "start a new tuple" whereas a bare
":
" means "start an if block".
Back to Contents
The following table gives a brief summary of how the metalanguage works:
"Tuple" representation | Metalanguage representation |
tag_table = (tuples) |
tag_table = Table is:
tuples |
(tagobj,command,argument,jfalse,jtrue) |
tagobj = command argument F:.. T:..
|
(None,command,argument,jfalse,jtrue) |
command argument F:.. T:..
|
(None,command,argument,jfalse) |
command argument F:..
|
(None,command,argument,MatchFail,jtrue) |
command argument T:..
|
(None,command,argument) |
command argument
|
(None,Jump,To,jfalse) |
Jump To ..
|
(tagobj,Table,(tuples)) |
tagobj = Table is:
tuples |
(None,command,argument,+1,+3),
tuple1, tuple2, tuple3, |
command argument F:next T:<label>
tuple1 tuple2 <label>
tuple3 |
(None,command,argument,+3,+1),
tuple1, tuple2, tuple3, |
command argument:
tuple1 tuple2 tuple3 |
The following is an expansion of the "Tagging Commands" table in mxTextTools:
Command | Arguments that go with it | Metalanguage representation |
Fail | Here |
Fail Here
|
Jump | To |
Jump To <label>
(although the more 'literal' Jump To F:<label>
will also work)
|
AllIn | string |
AllIn string
|
AllNotIn | string |
AllNotIn string
|
AllInSet | set |
AllInSet set
|
Is | character |
Is character
|
IsNot | character |
IsNot character
|
IsIn | string |
IsIn string
|
IsNotIn | string |
IsNotIn string
|
IsInSet | set |
IsInSet set
|
Word | string |
Word string
|
WordStart | string |
WordStart string
|
WordEnd | string |
WordEnd string
|
sWordStart | search-object |
sWordStart search-object
|
sWordEnd | search-object |
sWordEnd search-object
|
sFindWord | search-object |
sFindWord search-object
|
Call | function |
Call function
|
CallArg | (function,[arg0,...]) |
CallArg(function,[arg0,...])
(not yet available) Note that whitespace around the parentheses is optional. |
Table | ThisTable or
tagtable-name or tagtable |
Table ThisTable or
Table tagtable-name or
Table is:
tagtable
|
SubTable | ThisTable or
tagtable-name or tagtable |
SubTable ThisTable or
SubTable tagtable-name or
SubTable is:
tagtable
|
TableInList | (list_of_tables,index) |
TableInList(list_of_tables,index)
(not yet available)
|
SubTableInList | (list_of_tables,index) |
SubTableInList (list_of_tables,index) or
SubTableInList(list_of_tables,index)
(not yet available)
|
EOF | Here |
EOF Here
|
Skip | offset |
Skip offset or
Skip back (equivalent to Skip -1 )
|
Move | position |
Move position or
Move ToEOF
|
Loop | count |
Loop count
|
LoopControl | Break/Reset |
LoopControl Break or
LoopControl Reset
|
Appending flags to the command (as in +<flag>
, where
<flag> is one of CallTag
,
AppendMatch
,
AppendToTagobj
or
AppendTagobj
) is not yet supported.
Back to Contents
These are things that the user must bear in mind, which are not likely to change.
<
and
>
) is an identifier, and thus follows the appropriate
rules.
T:
or F:
T:
or F:
and its target. This is purely
because I think it looks better that way. Some future version of the
translator may attempt to give a useful error message if the user gets
this wrong.
(This is about the only place where I intend to restrict the addition of "non-significant" whitespace. In particular, whitespace around "+" signs is legal, and whitespace before and after "(" and ")" will be allowed when I support function calls, etc.)
These are likely to be resolved in future versions of the translator.
"'"
, but not '\''
or
"\""
).
The first two limitations will be removed, probably in the next release of the software. The last probably will not (unless it turns out to be trivial).
a[b]
).
The intention is ultimately to allow any Python entity that would be valid on the left hand side of an assignment (which is, of course, what we generate when we translate the above).
(Actually, the same parse is applied for "outer" tag tables as for inner, so one could parse an assignment to a string or integer. Of course, Python itself would then grumble later on. A future version of the translator will catch this potential error and grumble.)
back
" gets translated to
"-1
" (this is for use with "Skip
").
Arguments may currently be any of the words
"Here
", "ToEOF
", "To
",
"ThisTable
" or
"back
", or they may be a
string, an (unsigned) integer or an identifier.
It is possible to "add" arguments with "+
", so that
'abc' + "def"
is a legitimate argument (note that this does
not apply for "[Sub]Table is:
" lines).
No check is made that any particular argument makes sense with its command.
Back to Contents
Some of the following are also mentioned in Limitations above.
F:
" and "T:
" clauses.
One might represent this "directly" as:
Unfortunately, it looks like the "best" approach is just to move the true/false jumps to the start of the table definition:
I dither a little over whether a little more syntactic sugar would be nice here - for instance:
Note that there is still the issue of what one should do if
the conditions are of the form
"F:<label> T:next
",
which would normally be represented as the start of an if block - my
inclination is to say at this point "well, that's getting too complex
and we just won't cope", but I reserve the right to change my mind later
on when I've worried about it a bit more...
+
" modifications for
commands in a 'normal' tuple - for instance:
CallArg
,
TableInList
and SubTableInList
for similar
reasons.
/
" (as in normal
Python).
F
or T
condition. This would necessitate
keeping a log of Tuples defined (but we couldn't guarantee to
recognise a tuple by "name", of course - e.g., given
"tuple[3] = ...
" we would not know what to do if
we later referred to that as "tuple[a]
".
Also, if the user does:
bbb
" is a tuple (or
tupleplus) if possible (this is actually almost the same point).
Table
s) from tagobjs (i.e., as used in tuples) so we can
grumble if the user uses the wrong sort at the wrong time.
The translator also needs to handle more valid Python targets for both assignment (of tag tables) and also for tag objects. I need to think about exactly how to do this - I don't really want to start having to be able to parse all legitimate Python source code explicitly "by hand".
Back to Contents
Finally, I've translated all of the examples from the Examples directory in the mxTextTools distribution (version 1.0.2), with some added comments in appropriate places. They can be seen (colourised with py2html) here (or as straight tag table, with no colourisation, here), and the results of translation are here (or, again, as unvarnished Python, here).
Back to Contents
Author: Tibs (tibs@tibsnjoan.co.uk)
Last textual change: Tue Nov 16 09:17:38 GMT Standard Time 1999