Re: Parsing Question…

This is a very old article. It has been imported from older blogging software, and the formatting, images, etc may have been lost. Some links may be broken. Some of the information may no longer be correct. Opinions expressed in this article may no longer be held.

cjl wrote:

Short of writing a parser, which is clearly beyond me, what are some reasonable approaches to handling user input that will be executed?

Writing a parser is the best option in the long-run. If you were to attempt to interpret the user input some other way, like pure regular expressions, then you would fall into a lot of traps, and your interpreter would behave oddly in many cases.

A full parser is a much better option: it will behave far more reliably and would be a lot easier to extend, should you feel the need to add extra features to the language at a later date.

Although it’s a lot of work, there are some fairly well established methods on writing them. What you basically need to write is three fairly independent components: a tokeniser, a parser and an interpreter. None of these share any code in common, except for the definitions of a few constants and classes.

Firstly, a tokeniser, which reads the user input and splits it into a long list of tokens. Each token should have the form:

Such that when you tokenize the following code:

echo “Foo”;

You end up with something like this (though imagine the inner arrays are actually Token objects!):

Note the $line and $char which contain the line number and character number where this token was found? That helps when a later stage of your program needs to print an error message — it can inform the user of the exact location where the error occurred.

Writing a tokeniser is probably the easiest step. The only slightly difficult bits are things like \”dealing with strings that contain \\\”special\\\” characters\”, but even they are not too difficult!

Your tokeniser then passes this list over to the parser. The parser is probably the hardest part you have to write. You have to convert the stream of tokens into an “abstract syntax tree”.

First you need to define the classes you’ll build the AST out of. PHP 5’s object oriented features will be very useful here.

token = $t;
}
abstract public function evaluate($machine);
}

class AstNode_Script extents AstNode
{
public $statements;
public function evaluate($machine)
{
foreach ($this->statements as $s)
$s->evaluate($machine);
}
}

class AstNode_If extends AstNode
{
public $condition_expression;
public $execution_block;

public function evaluate()
{
if ($this->condition_expression->evaluate($machine))
$this->execution_block->evaluate($machine);
}
}

class AstNode_Constant_False extends AstNode
{
public function evaluate($machine) { return FALSE; }
}
// etc
1?>

Then write the parser itself, which takes the form:

tokens = $T;
else
throw new Exception(‘Argh!’);
}

public function next()
{
return array_shift($this->tokens);
}

public function peek()
{
return $this->tokens[0];
}

public function get($type, $hissy_fit=FALSE)
{
$next = $this->peek;
if ($next->token_type==$type)
return $this->next();
elseif ($hissy_fit)
throw new Exception(‘hissy fit’);
else
return FALSE;
}

public function parseScript()
{
$ast = new AstNode_Script($this->peek());
$ast->statements = $this->parseCommand();
while ($this->peek())
{
$ast->statements = $this->parseCommand();
}
return $ast;
}

// And then you write parseCommand, which in turn probably
// calls things like parseConditional, parseExpression,
// parseFunctionCall and so forth.
}
1?>

The third part of the job is interpreting the AST, but if you look at my AstNode_* classes above, you’ll see they have the logic built into them. All you then need to do is:

evaluate($machine); 1?>

Where machine is an object capable of keeping track of things like variable values, function definitions and so forth.

It’s quite a bit of work, but it’s certainly do-able. It helps if you have a good book on compilers — I’d recommend Watt & Brown “Programming Language Processors in Java”. As you might guess from the title, it teaches you to write parsers, compilers and interpreters in Java, but the same techniques can easily be applied to any object-oriented language, and with a little more imagination, to non-OO languages too.

A few months back, partly as an experiment, but partly because I thought it would be useful for a project of mine, I designed my own scripting language and wrote a tokeniser, parser and machine for it in PHP. It supports variables (numeric, string and multi-dimensional arrays), functions, comments, and has all the normal numeric, string and array operators built-in. Scalar (non-array) variables, are automatically typecast as arrays (such that they become single-element arrays) and array variables are automatically typecast as scalars (the first value in the array is used, the rest are discarded).

The reason I wrote it is that it would allow user-supplied code to run in a “sandbox” environment, so that if it crashed, or tampered with variables, or whatever, it wouldn’t cause any problems for the rest of the site.

It’s half-finished, the syntax is sort of crazy and it needs improving, which is why I’ve not foisted it upon the general public. But if you want a copy, I’d be happy to send you one, licensed under the GPL.

Here’s an example of using it: