473,508 Members | 2,344 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parsing Question...

cjl
As a learning exercise, I am trying to write a web-based version of
'drawbot' in PHP.

See: http://just.letterror.com/ltrwiki/DrawBot

I am interested in hearing ideas about how to approach the user input
parsing problem. I would like to allow people to type in simple code
and have it executed, but I need to limit the code they can write to a
few pre-defined drawing functions, as well as control structures like
loops, if thens, etc...

Short of writing a parser, which is clearly beyond me, what are some
reasonable approaches to handling user input that will be executed?

Thanks in advance,
-CJL

Feb 10 '07 #1
4 1603
cjl wrote:
Short of writing a parser, which is clearly beyond me, what are some
reasonable approaches to handling user input that will be executed?
Writing a parser is the best option in the long-run. If you were to
attempt to interpret the user input some other way, like pure regular
expressions, then you would fall into a lot of traps, and your interpreter
would behave oddly in many cases.

A full parser is a much better option: it will behave far more reliably
and would be a lot easier to extend, should you feel the need to add extra
features to the language at a later date.

Although it's a lot of work, there are some fairly well established
methods on writing them. What you basically need to write is three fairly
independent components: a tokeniser, a parser and an interpreter. None of
these share any code in common, except for the definitions of a few
constants and classes.

Firstly, a tokeniser, which reads the user input and splits it into a long
list of tokens. Each token should have the form:

class Token
{
var $token_type; // integer
var $token_value; // mixed
var $line; // integer
var $char; // integer
}

Such that when you tokenize the following PHP:

echo "Foo";

You end up with something like this (though imagine the inner arrays are
actually Token objects!):

array(
array(TOKEN_BUILTIN, "echo", 1, 1),
array(TOKEN_STRING_DQUOTED, "Foo", 1, 6),
array(TOKEN_TERMINATOR, NULL, 1, 11)
);

Note the $line and $char which contain the line number and character
number where this token was found? That helps when a later stage of your
program needs to print an error message -- it can inform the user of the
exact location where the error occurred.

Writing a tokeniser is probably the easiest step. The only slightly
difficult bits are things like "dealing with strings that contain
\"special\" characters", but even they are not too difficult!

Your tokeniser then passes this list over to the parser. The parser is
probably the hardest part you have to write. You have to convert the
stream of tokens into an "abstract syntax tree".

First you need to define the classes you'll build the AST out of. PHP 5's
object oriented features will be very useful here.

abstract class AstNode
{
public $token;
final public function __construct($t)
{
$this->token = $t;
}
abstract public function evaluate($machine);
}
class AstNode_Script extents AstNode
{
public $statements;
public function evaluate($machine)
{
foreach ($this->statements as $s)
$s->evaluate($machine);
}
}
class AstNode_If extends AstNode
{
public $condition_expression;
public $execution_block;

public function evaluate()
{
if ($this->condition_expression->evaluate($machine))
$this->execution_block->evaluate($machine);
}
}
class AstNode_Constant_False extends AstNode
{
public function evaluate($machine) { return FALSE; }
}
// etc

Then write the parser itself, which takes the form:

class Parser
{
private $tokens;

public function __construct($T)
{
if (is_array($T))
$this->tokens = $T;
else
throw new Exception('Argh!');
}

public function next()
{
return array_shift($this->tokens);
}

public function peek()
{
return $this->tokens[0];
}

public function get($type, $hissy_fit=FALSE)
{
$next = $this->peek;
if ($next->token_type==$type)
return $this->next();
elseif ($hissy_fit)
throw new Exception('hissy fit');
else
return FALSE;
}

public function parseScript()
{
$ast = new AstNode_Script($this->peek());
$ast->statements = $this->parseCommand();
while ($this->peek())
{
$ast->statements = $this->parseCommand();
}
return $ast;
}

// And then you write parseCommand, which in turn probably
// calls things like parseConditional, parseExpression,
// parseFunctionCall and so forth.
}

The third part of the job is interpreting the AST, but if you look at my
AstNode_* classes above, you'll see they have the logic built into them.
All you then need to do is:

$ast->evaluate($machine);

Where machine is an object capable of keeping track of things like
variable values, function definitions and so forth.

It's quite a bit of work, but it's certainly do-able. It helps if you have
a good book on compilers -- I'd recommend Watt & Brown "Programming
Language Processors in Java". As you might guess from the title, it
teaches you to write parsers, compilers and interpreters in Java, but the
same techniques can easily be applied to any object-oriented language, and
with a little more imagination, to non-OO languages too.

A few months back, partly as an experiment, but partly because I thought
it would be useful for a project of mine, I designed my own scripting
language and wrote a tokeniser, parser and machine for it in PHP. It
supports variables (numeric, string and multi-dimensional arrays),
functions, comments, and has all the normal numeric, string and array
operators built-in. Scalar (non-array) variables, are automatically
typecast as arrays (such that they become single-element arrays) and array
variables are automatically typecast as scalars (the first value in the
array is used, the rest are discarded).

The reason I wrote it is that it would allow user-supplied code to run in a
"sandbox" environment, so that if it crashed, or tampered with variables,
or whatever, it wouldn't cause any problems for the rest of the site.

It's half-finished, the syntax is sort of crazy and it needs improving,
which is why I've not foisted it upon the general public. But if you want
a copy, I'd be happy to send you one, licensed under the GPL.

Here's an example of using it:

<?php
$p = <<<PROG

/* Function names can be arbitrary strings. No parentheses used. */
function "my concatenation function", $a, $b
{
/* Uses "let VAR := EXPR" for assignment. A bit like Turing. */
let $result := $a . $b;

/* Perlish syntax for returning function results. */
$result;
}

let $quux = call "my concatenation function", "foo", "bar";

/* Print automatically appends a new line, a la Python */
print $quux;

PROG;

$r = eval_programme($p);

?>

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 10 '07 #2
cjl
Toby:

That is the single best response ever given to a newsgroup post. Thank
you.

Obviously, I have got a lot of reading to do.

I'm wondering if there is a simpler approach...after all, I want the
user input to be valid php, I just want to limit what they can type to
a few functions I write ( circle(), line(), etc..) and a few control
structures. Maybe I could create an object which includes member
functions which override all native php functions, and have the user
input actually be calls to that objects methods, and only pass through
the ones that I want to allow?

As far as the approach you are suggesting, some googling showed:
http://greg.chiaraquartet.net/archiv...Generator.html

Which maybe can help me?

Anyway, back to the drawing board.

-CJL

Feb 10 '07 #3
cjl wrote:
I'm wondering if there is a simpler approach...after all, I want the
user input to be valid php, I just want to limit what they can type to
a few functions I write ( circle(), line(), etc..) and a few control
structures.
If the user input is to be valid PHP, the "obvious" solution is eval(),
but this will totally destroy your security. You could use regular
expressions to check for "naughty" functions (like SQL queries, file
system manipulation, TCP sockets, etc), but then you end up:

(a) playing catch-up with the features of PHP itself. As new
functions are added to the language, you'll need to evaluate
how naughty they are, and add them to the block list.

(b) naively blocking more innocent PHP like:
print "fopen";

(It's worth mentioning, that you'll also need to include in your block
list "naughty" functions from any of your own or third-party libraries you
use.)
Maybe I could create an object which includes member functions which
override all native php functions, and have the user input actually be
calls to that objects methods, and only pass through the ones that I
want to allow?
Aye -- that is indeed the dream: the ability to have an eval() function
that works within a single object, such that any function calls are
silently re-written to "$this->function()", any globals to
"$this->variable" and any constants to "self::CONSTANT".

Although PHP doesn't have such a "safe eval" function built in, it
shouldn't be too difficult to build one. As your language follows PHP
syntax rules, you can use PHP's built-in tokeniser:

$tokens = token_get_all($source);

Then loop through that list, looking for all tokens of type T_VARIABLE and
re-pointing them at object members; finding T_STRING (which despite the
name is an "non-$-identifier" token, so could be either a function call or
a constant) and heuristically (e.g. UPPERCARE is assumed to be a
constant; MixedOr_lower_case is assumed to be a method.) re-pointing it at
a class constant or object method; and finally finding T_EVAL and
replacing it with T_ECHO. You would then need to loop through the token
list and re-assemble it as source code before passing it through an eval()
function wrapper within the object.

Sounds complicated; but is simpler than implementing your own real parser
and interpreter; and could probably be done in less than 50 lines of code.

I wouldn't be happy running it on a production system though without
substantial hack-testing!
As far as the approach you are suggesting, some googling showed:
http://greg.chiaraquartet.net/archiv...Generator.html
Which maybe can help me?
Quite possibly -- it does look quite good. If I'd known of its existence
when I started my scripting language, I might not have attempted to write
a scripting language. But I certainly learnt a lot --especially about OO
PHP -- from doing so, so I don't regret it.
That is the single best response ever given to a newsgroup post. Thank
you.
No problem -- I'd guessed that nobody else in this group had been crazy
enough to attempt a scripting language parser and interpreter in PHP, so
if I didn't help you, nobody would!

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 12 '07 #4
cjl
Toby:

Once again, thanks.

After reading up on the (very) little information I could find about
the tokenizer, I am going to try and give it a shot by parsing the
tokens. However, instead of a blacklist, I was thinking about using a
whitelist, which should be slightly more hack-proof than a blacklist?

Anyway, back to the drawing board. If I end up with anything that
works, I'll repost to this thread.

-CJL

Feb 12 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2166
by: Todd Moyer | last post by:
I would like to use Python to parse a *python-like* data description language. That is, it would have it's own keywords, but would have a syntax like Python. For instance: Ob1 ('A'): Ob2...
4
2288
by: silviu | last post by:
I have the following XML string that I want to parse using the SAX parser. If I remove the portion of the XML string between the <audit> and </audit> tags the SAX is parsing correctly. Otherwise...
16
2856
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
6
2108
by: Ulrich Vollenbruch | last post by:
Hi all! since I'am used to work with matlab for a long time and now have to work with c/c++, I have again some problems with the usage of strings, pointers and arrays. So please excuse my basic...
13
1955
by: 31337one | last post by:
Hello everyone, I am writing an application that uses a command line interface. It will be configurable by passing arguments on the command line. The program is going to run in windows and...
5
5495
by: bmichel | last post by:
Hey, What I'm doing is the following: - Load XML data a file - Parsing the XML data - Printing some parsed content The problem is that the script execution is stopping before all the...
6
5905
by: jackwootton | last post by:
Hello everyone, I understand that XML can be parsed using JavaScript using the XML Document object. However, it is possible to parse XHTML using JavaScript? I currently listen for DOMMutation...
0
1496
by: Ole Nielsby | last post by:
(sorry, wrong button, here is the real post:) I'm working on a C++ parser which is to be used for various code analysis and transformation tools. (It's part of my PILS programming system which...
13
4474
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
3
2060
by: dimasteg | last post by:
Hi all C. Nead some help with string "on the fly" parsing, how it can be realized ? Any ideas? I got some of my own, but it's interesting to get other points of view . Regards.
0
7233
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7410
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7505
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5650
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5060
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3215
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3201
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1570
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
774
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.