469,290 Members | 1,866 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,290 developers. It's quick & easy.

Hierarchical query

Reposting with more clarification (as Jan asked).

Suppose I have a BNFgrammar and a source text parsed into a tree. How
would I query an
identifier declaration?

All the XQuery tutorials (where I went to gather some ideas) start
with simpleminded examples like browsing all the descendants of /
bookstore/book (and the bookstore XML design is such wrong idea to
boot).

Perhaps some example is needed. A simplified grammar:

statement_block:
'declare'
declaration_item_list
'begin'
statements
'end'
;

statements:
(statement_block | assignment) statements |
;

assignment:
identifier ':=' (identifier | number) ';'
;

declaration_item_list:
identifier 'integer' ';'
;

Suppose we parse the following text

declare -- token #1
i integer; -- tokens 2,3,4
begin
i := 1; -- tokens 6,7,8,9
end;

So that we get the derivation tree like this:

1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';
Now, given a derivation tree and the leaf node identifier 1.4.1.1
corresponding to the token #6 which is variable i, how do we find the
node 1.2.1 that declares it?

Jun 13 '07 #1
30 2147
On 13 jun, 18:11, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
Reposting with more clarification (as Jan asked).

Suppose I have a BNFgrammar and a source text parsed into a tree. How
would I query an
identifier declaration?

All the XQuery tutorials (where I went to gather some ideas) start
with simpleminded examples like browsing all the descendants of /
bookstore/book (and the bookstore XML design is such wrong idea to
boot).

Perhaps some example is needed. A simplified grammar:

statement_block:
'declare'
declaration_item_list
'begin'
statements
'end'
;

statements:
(statement_block | assignment) statements |
;

assignment:
identifier ':=' (identifier | number) ';'
;

declaration_item_list:
identifier 'integer' ';'
;

Suppose we parse the following text

declare -- token #1
i integer; -- tokens 2,3,4
begin
i := 1; -- tokens 6,7,8,9
end;

So that we get the derivation tree like this:

1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';

Now, given a derivation tree and the leaf node identifier 1.4.1.1
corresponding to the token #6 which is variable i, how do we find the
node 1.2.1 that declares it?
Ok. I'm going to assume the following DTD (in a notation of my own
making to make it a bit more readable) for the syntax tree. (It uses
no attributes to keep things simple):

<stat_bl--<decl_kw<decl_item_list<begin_kw<statements>
<semicol_kw<end-kw>
<decl_kw--EMPTY
<begin_kw--EMPTY
<end_kw--EMPTY
<semicol_kw--EMPTY
<statements--( <stat_bl| <assignment)+
<assignment--<var<assign_kw( <var| <number) <semicol_kw>
<var--PCDATA
<assign_kw--EMPTY
<number--PCDATA
<decl_item_list--( <var<type<semicol_kw)+
<type--PCDATA

For starters I'll first do the reverse query, so I will assume there
is a variable $dvar that contains a <varelement that describes a
variable in a declaration. The XPath expression that walks to all the
<varnodes in an assignment that are in the scope of $dvar is as
follows:

(1) $dvar/(
(2) ../../statements//assignment/var[string() = $dvar/string()]
(3) minus
(4) ../../statements//stat_bl[decl_item_list/var/string() = $dvar/
string()]/statements//var
(5) )

The idea is quite simple: the path expression in line (2) walks to all
variables that are nested within the statement block of the
declaration $dvar, and the path expression in line (4) subtracts those
for which there is another intermediate declaration that supersedes
the one of $dvar.

Now the query that you asked for, which in some sense simply the
reverse. We assume now that $uvar contains the <varelement that is
used in an assignment and the path expression walks to the <var>
elements in a declaration that binds the <varelement in $uvar:

(1) $uvar/(
(2) ancestor::stat_bl/decl_item_list/var[string() = $uvar/
string()]
(3) minus
(4) ancestor::stat_bl[decl_item_list/var/string() = $uvar/
string()]/ancestor::stat_bl/decl_item/var
(5) )

Of course there are many many different ways of expressing these
queries, but these are the most compact ones I could think of. Note
that I actually didn't have to take into account the order of the
declarations, just how they are nested. But this is because of the way
you defined the syntax and scoping rules. However, if you would extend
the syntax and the scoping rules such that order also maters then
extending the queries to take that into account should not be a big
problem.

-- Jan Hidders

Jun 13 '07 #2
On Jun 13, 9:19 am, Jan Hidders <hidd...@gmail.comwrote:
Ok. I'm going to assume the following DTD (in a notation of my own
making to make it a bit more readable) for the syntax tree. (It uses
no attributes to keep things simple):

<stat_bl--<decl_kw<decl_item_list<begin_kw<statements>
<semicol_kw<end-kw>
<decl_kw--EMPTY
<begin_kw--EMPTY
<end_kw--EMPTY
<semicol_kw--EMPTY
<statements--( <stat_bl| <assignment)+
<assignment--<var<assign_kw( <var| <number) <semicol_kw>
<var--PCDATA
<assign_kw--EMPTY
<number--PCDATA
<decl_item_list--( <var<type<semicol_kw)+
<type--PCDATA
All right, the DTD is a bastardized [context free?] grammar describing
a language that XML document is an element of. Although the adjectives
"cumbersome" and "ugly" still apply, I grudgingly admit that the idea
that a grammar fits into DTD effortlessly is quite powerful (so I'm
removing the "XML sucks" image from my homepage:-)
For starters I'll first do the reverse query,
so I will assume there
is a variable $dvar that contains a <varelement that describes a
variable in a declaration. The XPath expression that walks to all the
<varnodes in an assignment that are in the scope of $dvar is as
follows:

(1) $dvar/(
(2) ../../statements//assignment/var[string() = $dvar/string()]
(3) minus
(4) ../../statements//stat_bl[decl_item_list/var/string() = $dvar/
string()]/statements//var
(5) )

The idea is quite simple: the path expression in line (2) walks to all
variables that are nested within the statement block of the
declaration $dvar, ...
I have trouble comprehending this line

(2) ../../statements//assignment/var[string() = $dvar/string()]

should I read it left to right? Then the ".." selects the parent of
the current node, and what is the current node? Or do we start from
some node that mathes "statements", then go two levels up? The
statement block <stat_blis only one level up!
Jun 13 '07 #3
On Jun 13, 9:19 am, Jan Hidders <hidd...@gmail.comwrote:
Ok. I'm going to assume the following DTD (in a notation of my own
making to make it a bit more readable) for the syntax tree. (It uses
no attributes to keep things simple):

<stat_bl--<decl_kw<decl_item_list<begin_kw<statements>
<semicol_kw<end-kw>
<decl_kw--EMPTY
<begin_kw--EMPTY
<end_kw--EMPTY
<semicol_kw--EMPTY
<statements--( <stat_bl| <assignment)+
<assignment--<var<assign_kw( <var| <number) <semicol_kw>
<var--PCDATA
<assign_kw--EMPTY
<number--PCDATA
<decl_item_list--( <var<type<semicol_kw)+
<type--PCDATA
All right, the DTD is a bastardized [context free?] grammar describing
a language that XML document is an element of. Although the adjectives
"cumbersome" and "ugly" still apply, I grudgingly admit that the idea
that a grammar fits into DTD effortlessly is quite powerful (so I'm
removing the "XML sucks" image from my homepage:-)
For starters I'll first do the reverse query,
so I will assume there
is a variable $dvar that contains a <varelement that describes a
variable in a declaration. The XPath expression that walks to all the
<varnodes in an assignment that are in the scope of $dvar is as
follows:

(1) $dvar/(
(2) ../../statements//assignment/var[string() = $dvar/string()]
(3) minus
(4) ../../statements//stat_bl[decl_item_list/var/string() = $dvar/
string()]/statements//var
(5) )

The idea is quite simple: the path expression in line (2) walks to all
variables that are nested within the statement block of the
declaration $dvar, ...
I have trouble comprehending this line

(2) ../../statements//assignment/var[string() = $dvar/string()]

should I read it left to right? Then the ".." selects the parent of
the current node, and what is the current node? Or do we start from
some node that mathes "statements", then go two levels up? The
statement block <stat_blis only one level up!

Jun 13 '07 #4
Vadim Tropashko wrote:
I have trouble comprehending this line

(2) ../../statements//assignment/var[string() = $dvar/string()]

should I read it left to right?

Find a good tutorial on XPath. Yes, it evaluates left to right. From the
context node, find its parent; find the parent of that; find its
<statementschildren; find their <assignmentdescendants (at any
depth), find their <varchildren whose string value is the same as ...
hmmm. actually I'm not sure what $dvar/string() was intended to mean,
and I don't have time right now to check that.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jun 13 '07 #5
On Jun 13, 9:11 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';
Actually, here is quite simple method. Write down the parse tree as a
set of paths, using grammar varibles and constants (terminals and
nonterminals) as path elements. In the example above we get:

statement_block
statement_block.'declare'
statement_block.declaration_item_list
statement_block.declaration_item_list.identifier"i "
statement_block.declaration_item_list.'integer'
.... and so on ...

Query this set with regular expressions.

XML, good bye!

Jun 13 '07 #6
On 13 jun, 21:10, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 9:19 am, Jan Hidders <hidd...@gmail.comwrote:
Ok. I'm going to assume the following DTD (in a notation of my own
making to make it a bit more readable) for the syntax tree. (It uses
no attributes to keep things simple):
<stat_bl--<decl_kw<decl_item_list<begin_kw<statements>
<semicol_kw<end-kw>
<decl_kw--EMPTY
<begin_kw--EMPTY
<end_kw--EMPTY
<semicol_kw--EMPTY
<statements--( <stat_bl| <assignment)+
<assignment--<var<assign_kw( <var| <number) <semicol_kw>
<var--PCDATA
<assign_kw--EMPTY
<number--PCDATA
<decl_item_list--( <var<type<semicol_kw)+
<type--PCDATA

All right, the DTD is a bastardized [context free?] grammar describing
a language that XML document is an element of. Although the adjectives
"cumbersome" and "ugly" still apply, I grudgingly admit that the idea
that a grammar fits into DTD effortlessly is quite powerful (so I'm
removing the "XML sucks" image from my homepage:-)
Of course, it combines ugliness with great power. Welcome to the Dark
Side. :-)
For starters I'll first do the reverse query,
so I will assume there
is a variable $dvar that contains a <varelement that describes a
variable in a declaration. The XPath expression that walks to all the
<varnodes in an assignment that are in the scope of $dvar is as
follows:
(1) $dvar/(
(2) ../../statements//assignment/var[string() = $dvar/string()]
(3) minus
(4) ../../statements//stat_bl[decl_item_list/var/string() = $dvar/
string()]/statements//var
(5) )
The idea is quite simple: the path expression in line (2) walks to all
variables that are nested within the statement block of the
declaration $dvar, ...

I have trouble comprehending this line

(2) ../../statements//assignment/var[string() = $dvar/string()]

should I read it left to right? Then the ".." selects the parent of
the current node, and what is the current node?
First a small correction, the 'minus' I used should actualy be
'except', which has the semantics of the set minus.

What is the current node? Note that the global structure is p_1/(p_2
except p_3) where (p_2 except p_3) means that from the context node
you evaluate p_2 which gives you a set (or rather sequence) of nodes
and from that you subtract the set of nodes that you get when you
evaluate p_3 starting from the context node. So in this case the
"current node" that you ask about is for both subexpressions the same,
namely the node in $dvar.

Btw., if you forget about sequence order, document order and value
comparisons for a moment, the core of XPath is actually a relatively
neat calculus of binary relations:
- p_1/p_2 is the concatenation of binary relations
- p_1[p_2] is the selection of pairs in p_2 whose right-hand side
matches a left-hand side in p_2 (semijoin anyone?)
- p_1 union p_2 is the set union of two binary relations
- p_1 intersect p_2 is the set intersection of two binary relations
- p_1 except p_2 is the set difference between binary relations

Did I mention Tarski already? :-) Of course we had to wait until
XPath2.0 until the three set operations were all allowed everywhere,
but we have them now.

-- Jan Hidders

Jun 13 '07 #7
On 13 jun, 21:37, Joe Kesselman <keshlam-nos...@comcast.netwrote:
Vadim Tropashko wrote:
I have trouble comprehending this line
(2) ../../statements//assignment/var[string() = $dvar/string()]
should I read it left to right?

Find a good tutorial on XPath. Yes, it evaluates left to right. From the
context node, find its parent; find the parent of that; find its
<statementschildren; find their <assignmentdescendants (at any
depth), find their <varchildren whose string value is the same as ...
hmmm. actually I'm not sure what $dvar/string() was intended to mean,
and I don't have time right now to check that.
It retrieves the text node children of the nodes in $dvar. Actually it
is unnecessary, in this case I might just have well have written the
predicate as [. = $dvar].

-- Jan Hidders

Jun 13 '07 #8
On 13 jun, 22:58, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 9:11 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';

Actually, here is quite simple method. Write down the parse tree as a
set of paths, using grammar varibles and constants (terminals and
nonterminals) as path elements. In the example above we get:

statement_block
statement_block.'declare'
statement_block.declaration_item_list
statement_block.declaration_item_list.identifier"i "
statement_block.declaration_item_list.'integer'
... and so on ...

Query this set with regular expressions.

XML, good bye!
Not so fast. An enumeration of the paths in a tree does not always
give you enough information to reconstruct the tree, so you would be
losing expressive power. Never mind that you ignore the order, which
may be a factor for the scope of declarations.

-- Jan Hidders

Jun 13 '07 #9
On Jun 13, 2:44 pm, Jan Hidders <hidd...@gmail.comwrote:
On 13 jun, 22:58, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:


On Jun 13, 9:11 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';
Actually, here is quite simple method. Write down the parse tree as a
set of paths, using grammar varibles and constants (terminals and
nonterminals) as path elements. In the example above we get:
statement_block
statement_block.'declare'
statement_block.declaration_item_list
statement_block.declaration_item_list.identifier"i "
statement_block.declaration_item_list.'integer'
... and so on ...
Query this set with regular expressions.
XML, good bye!

Not so fast. An enumeration of the paths in a tree does not always
give you enough information to reconstruct the tree, so you would be
losing expressive power. Never mind that you ignore the order, which
may be a factor for the scope of declarations.
Derivation tree was the confusing part. When we query nodes on a tree
what exactly are we doing? Then, your reply was a critical for me
understanding that the tree structure is unnecessary, the derivation
is essentially a language -- a set of words (which includes both
terminals and nonterminals) and this set of words can be quieried
solely with the language theory means. Formally, a query is a language
intersection.

The order on the tree comes from the order in the Kleene algebra, and
of course the information about the order is still there in the
grammar rules (or should I say inequalities?).

The "reqular expression" exclamation in my previous is
oversimplification, of course.

Jun 13 '07 #10
On Jun 13, 2:25 pm, Jan Hidders <hidd...@gmail.comwrote:
Btw., if you forget about sequence order, document order and value
comparisons for a moment, the core of XPath is actually a relatively
neat calculus of binary relations:
- p_1/p_2 is the concatenation of binary relations
- p_1[p_2] is the selection of pairs in p_2 whose right-hand side
matches a left-hand side in p_2 (semijoin anyone?)
- p_1 union p_2 is the set union of two binary relations
- p_1 intersect p_2 is the set intersection of two binary relations
- p_1 except p_2 is the set difference between binary relations

Did I mention Tarski already? :-) Of course we had to wait until
XPath2.0 until the three set operations were all allowed everywhere,
but we have them now.
Yes, and you have been credited for that (by me removing that XML
banner:-).

Well, the question originated from the world of Kleene algebras (which
Tarski relation algebra is an instance of), idempotent semirings, all
the other language theory goodies. I was temorarily confused by the
concept of the "query". Now that it it no longer the case, and I can
operate back into the language theory space, what do I need XML for?

Jun 13 '07 #11
On 13 jun, 23:56, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
>
[...] Then, your reply was a critical for me
understanding that the tree structure is unnecessary, the derivation
is essentially a language -- a set of words (which includes both
terminals and nonterminals) and this set of words can be quieried
solely with the language theory means. Formally, a query is a language
intersection.
Of course, all computation, including RDBMS and XML querying and
transformation, is ultimately just string manipulation. Doesn't mean
that Turing Machines are always the most appropriate formalism for
describing them, does it? :-)

-- Jan Hidders

Jun 13 '07 #12
On Jun 13, 3:47 pm, Jan Hidders <hidd...@gmail.comwrote:
On 13 jun, 23:56, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
[...] Then, your reply was a critical for me
understanding that the tree structure is unnecessary, the derivation
is essentially a language -- a set of words (which includes both
terminals and nonterminals) and this set of words can be quieried
solely with the language theory means. Formally, a query is a language
intersection.

Of course, all computation, including RDBMS and XML querying and
transformation, is ultimately just string manipulation. Doesn't mean
that Turing Machines are always the most appropriate formalism for
describing them, does it? :-)
You seem to imply that language theory reduces to trivial string
manipulation. I expected you to request clarification why a query is a
language intersection; is there any nonvacuous idea behind this
statement? And insist on demonstating that the reg exp method does
work:-)

Consider a simple expression grammar

expr :
expr '+' expr
| '(' expr ')'
| '1'

and the string

(1 + 1) + 1

When we ask if this string belongs to the language defined by the
grammar, we are interested to know if the intersection of the infinite
language defined by the above expression grammar and finite language
coonsisting of a single string is empty.

The derivation that proves that a string belongs to a language is a
chain of inequalities (where inequality is understood to be the usual
language subset relation):

(1 + 1) + 1 < (expr + 1) + 1 < ... < (expr + expr) + expr < expr +
expr < expr

Now we have derivation tree:

1 expr
1.1 expr
1.1.1 (
1.1.2 expr
1.1.2.1 expr
1.1.2.1.1 1
1.1.2.2 +
1.1.2.3 expr
1.1.2.3.1 1
1.1.3 )
1.2 +
1.2 expr
1.2.1 1

Next if we encode tree nodes as paths of the grammar symbols, then not
only the order is lost (as you correctly observed), but the duplicate
paths are collapsed:

1 --expr
1.1 --expr expr
1.1.1 --expr expr (
1.1.2 --expr expr expr
1.1.2.1 --expr expr expr expr
1.1.2.1.1 --expr expr expr expr expr
1.1.2.2
1.1.2.3
1.1.2.3.1
1.1.3
1.2
1.2 --expr expr (same as 1.1!)
1.2.1



Jun 13 '07 #13
On Jun 13, 2:25 pm, Jan Hidders <hidd...@gmail.comwrote:
On 13 jun, 21:10, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 9:19 am, Jan Hidders <hidd...@gmail.comwrote:
Ok. I'm going to assume the following DTD (in a notation of my own
making to make it a bit more readable) for the syntax tree. (It uses
no attributes to keep things simple):
<stat_bl--<decl_kw<decl_item_list<begin_kw<statements>
<semicol_kw<end-kw>
<decl_kw--EMPTY
<begin_kw--EMPTY
<end_kw--EMPTY
<semicol_kw--EMPTY
<statements--( <stat_bl| <assignment)+
<assignment--<var<assign_kw( <var| <number) <semicol_kw>
<var--PCDATA
<assign_kw--EMPTY
<number--PCDATA
<decl_item_list--( <var<type<semicol_kw)+
<type--PCDATA
All right, the DTD is a bastardized [context free?] grammar describing
a language that XML document is an element of. Although the adjectives
"cumbersome" and "ugly" still apply, I grudgingly admit that the idea
that a grammar fits into DTD effortlessly is quite powerful (so I'm
removing the "XML sucks" image from my homepage:-)

Of course, it combines ugliness with great power. Welcome to the
Dark Side. :-)
This raises some questions for me. Does the above mean we
should consider comparing XML and its associated ... whadayacallits?
XPath, XSL, etc.?, against more traditional parsing techniques?
*** What, if anything, would that leave out? ***

It seems clear enough, the idea of regarding an XML document
as a syntax tree, and the DTD as a grammar. In which case,
*XML*parsing is then the analog of ... what? Converting an
ascii version of a syntax tree into a binary syntax tree?
Sort of like parsing s-expressions?

Then XML querying and updating is comparable to operations
we would do once we had a syntax tree, which seems to
be multipass traversal and transformation. Think of an
interpreter, which traverses the tree, carrying out instructions
as it does so. The more complex case is a compiler
with multiple analysis, optimizer, and code generator passes.
These involve generating additional data structures which
may or may not be trees as well. A symbol table in a
lexically scoped language likely would be; an object file
likely would not be.

If that analysis is valid, then I'd be interested to continue to
a comparison with other techniques for accomplishing the
same thing, particularly from the standpoint of expressive
power, but also from the standpoint of convenience and
generality.

Other techniques I can think of are:

1) ad hoc traversal of an object graph in an OO language
2) pattern matching over abstract data types, as introduced
by SML. (The gold standard.)
3) "Tree walking" a la Terrance Parr / Antlr (the upstart.)

1) is what I am most familiar with. It isn't particularly pleasant,
but I suppose it's no more objectionable than anything else
in an OOPL. One ends up using the Visitor pattern a lot,
and one necessarily finds oneself having to defeat the
type system, since OOPLs are organized around extensible
structures w/ fixed operations, whereas trees work best
with fixed structures w/ extensible operations.

2) I would be very interested to hear a comparison with
pattern matching from someone who is familiar with both
techniques. One thing I can see immediately is that XPath
expressions allow one to address only some specific
part of a tree that one is interested in, whereas pattern
matching requires whole-tree analysis. How much of an
issue is this in practice?

3) is interesting. I admit that when I first heard about the
idea I was extremely skeptical. The idea that one would
write a grammar to abstract operating on a tree one had
generated by using a grammar to abstract over a language
seemed perverse and unnecessarily complex at first.
However in practice it's turned out to be the opposite:
it is extremely simple. Tree grammars are a fraction
of the size of the original grammar, possible similarly
to the size of an XPath expression to a DTD.

I suppose there are also term rewriting languages but I'm
not as familiar with them. I am also vaguely aware of
"strategies" which are abstractions of data structure
traversal techniques. I suppose further there are the
various tree encoding techniques of SQL, but I haven't found
these to be very exciting. Perhaps a more complete
relational language ...

Btw., if you forget about sequence order, document order and value
comparisons for a moment, the core of XPath is actually a relatively
neat calculus of binary relations:
- p_1/p_2 is the concatenation of binary relations
- p_1[p_2] is the selection of pairs in p_2 whose right-hand side
matches a left-hand side in p_2 (semijoin anyone?)
- p_1 union p_2 is the set union of two binary relations
- p_1 intersect p_2 is the set intersection of two binary relations
- p_1 except p_2 is the set difference between binary relations

Did I mention Tarski already? :-) Of course we had to wait until
XPath2.0 until the three set operations were all allowed everywhere,
but we have them now.
I guess we're talking about Relation Algebra here? I don't know
enough about it to know if this is just an interesting comparison
of if there's some specific result being alluded to.

My hope is that this post will stimulate an educated and informative
discussion of the relative merits of different approaches to
tree traversal, not neglecting to differentiate among the needs
of diverse use cases.

Hey, it could happen.
Marshall

Jun 13 '07 #14
On Jun 13, 4:47 pm, Marshall <marshall.spi...@gmail.comwrote:
3) is interesting. I admit that when I first heard about the
idea I was extremely skeptical. The idea that one would
write a grammar to abstract operating on a tree one had
generated by using a grammar to abstract over a language
seemed perverse and unnecessarily complex at first.
However in practice it's turned out to be the opposite:
it is extremely simple. Tree grammars are a fraction
of the size of the original grammar, possible similarly
to the size of an XPath expression to a DTD.
Do you have a reference?
http://antlr.org/article/1170602723163/treewalkers.html
insists that the "tree grammar" concept is not appeared to be coherent
to begin with...

Jun 14 '07 #15
On Jun 13, 5:35 pm, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
On Jun 13, 4:47 pm, Marshall <marshall.spi...@gmail.comwrote:
3) is interesting. I admit that when I first heard about the
idea I was extremely skeptical. The idea that one would
write a grammar to abstract operating on a tree one had
generated by using a grammar to abstract over a language
seemed perverse and unnecessarily complex at first.
However in practice it's turned out to be the opposite:
it is extremely simple. Tree grammars are a fraction
of the size of the original grammar, possible similarly
to the size of an XPath expression to a DTD.

Do you have a reference?http://antlr.org/article/1170602723163/treewalkers.html
insists that the "tree grammar" concept is not appeared to be coherent
to begin with...
Well, that paper's on Terrence's site, so probably he doesn't
see it as *too* devastating. :-)

But really, I had all those arguments in my head, until I tried it.
It turned out really easy to do.

In something I did recently, a simple expression grammar,
all the precedence, etc, went in the "regular" grammar.
So the AST captured all of that. Then I needed to translate
the AST into custom Java objects. It was ridiculously easy;
the tree grammar was just:

prog: expr*;

expr: <node1 a, b{new Node1(a, b);}
| <node2{new Node2();}
| ...
;

Perhaps my next step is getting rid of the Java AST
objects and just using the Antlr AST object.

Anyway, I've been very impressed with ANTLR.

As an aside, another idea of Terrence's was an IDE for
Antrlr, which I also thought was ridiculous when I first heard
about it. But now v3.0 is out and I've used the IDE and
it's as much a win for parsers as it is for source code.
Being able to interactively tweak the grammar and
parse programs with each revision, having a CST and AST
view, being able to single step through the parse, etc.
Marshall

Jun 14 '07 #16
Jan Hidders wrote:
>hmmm. actually I'm not sure what $dvar/string() was intended to mean,
and I don't have time right now to check that.

It retrieves the text node children of the nodes in $dvar.
No, that would be $dvar/text().

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jun 14 '07 #17
On 14 jun, 07:49, Joe Kesselman <keshlam-nos...@comcast.netwrote:
Jan Hidders wrote:
hmmm. actually I'm not sure what $dvar/string() was intended to mean,
and I don't have time right now to check that.
It retrieves the text node children of the nodes in $dvar.

No, that would be $dvar/text().
Oops, you are right of course. The function string() concatenates all
the string values of the descendant text nodes. So in this case it
simply retrieves the string value of the text node under the element
node that is in $dvar.

-- Jan Hidders

Jun 14 '07 #18
On 14 jun, 01:47, Marshall <marshall.spi...@gmail.comwrote:
>
It seems clear enough, the idea of regarding an XML document
as a syntax tree, and the DTD as a grammar. In which case,
*XML*parsing is then the analog of ... what? Converting an
ascii version of a syntax tree into a binary syntax tree?
Sort of like parsing s-expressions?
Sure, ordered node-labeled trees, S-expressions, XML documents,
recursively nested terms, et cetera, are all more or less the same
thing. Of course the typical use cases are different, which justifies
having diferent transformation / update / query / selection languages
and formalisms.
If that analysis is valid, then I'd be interested to continue to
a comparison with other techniques for accomplishing the
same thing, particularly from the standpoint of expressive
power, but also from the standpoint of convenience and
generality.

Other techniques I can think of are:

1) ad hoc traversal of an object graph in an OO language
2) pattern matching over abstract data types, as introduced
by SML. (The gold standard.)
3) "Tree walking" a la Terrance Parr / Antlr (the upstart.)
Yep, Tree automata, Tree transducers, Attribute grammars, XQuery,
XSLT... all have their typical use cases. Is that what you are
interested in?
2) I would be very interested to hear a comparison with
pattern matching from someone who is familiar with both
techniques. One thing I can see immediately is that XPath
expressions allow one to address only some specific
part of a tree that one is interested in, whereas pattern
matching requires whole-tree analysis. How much of an
issue is this in practice?
I think that question needs more context in order to be meaningful.
The pattern matching mechanism is usually embedded in another
languages (some functional programming language like Caml, XSLT or
XQuery). Depending upon that language it may or may not be a problem
if your pattern matching mechanism lacks certain expressive power.
Btw., if you forget about sequence order, document order and value
comparisons for a moment, the core of XPath is actually a relatively
neat calculus of binary relations:
- p_1/p_2 is the concatenation of binary relations
- p_1[p_2] is the selection of pairs in p_2 whose right-hand side
matches a left-hand side in p_2 (semijoin anyone?)
- p_1 union p_2 is the set union of two binary relations
- p_1 intersect p_2 is the set intersection of two binary relations
- p_1 except p_2 is the set difference between binary relations
Did I mention Tarski already? :-) Of course we had to wait until
XPath2.0 until the three set operations were all allowed everywhere,
but we have them now.

I guess we're talking about Relation Algebra here? I don't know
enough about it to know if this is just an interesting comparison
of if there's some specific result being alluded to.
Yep.

<http://staff.science.uva.nl/~marx/talks/2005/screen_icdt.pdf>

Page 12.

-- Jan Hidders

Jun 14 '07 #19
On 14 jun, 01:39, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 3:47 pm, Jan Hidders <hidd...@gmail.comwrote:
Of course, all computation, including RDBMS and XML querying and
transformation, is ultimately just string manipulation. Doesn't mean
that Turing Machines are always the most appropriate formalism for
describing them, does it? :-)

You seem to imply that language theory reduces to trivial string
manipulation.
Trivial? You think that the string manipulations expressed by Turing
Machines are trivial?
I expected you to request clarification why a query is a
language intersection; is there any nonvacuous idea behind this
statement? And insist on demonstating that the reg exp method does
work:-)
I though insisting that it cannot work would accomplish the same. But
your explanation still does not seem complete. Also you are still a
bit vague on what you are claiming. Are you saying that anything for
which one would typically use XPath can be done just as well with
regular expressions assuming that the tree is somehow encoded in a
string? This is clearly not the case for the encoding you gave. Or are
you going to extend that? Perhaps also extend the regular expressions
a little? Or did you just have a particular query in mind? That would
probably not be a very interesting observation. So which is it?

-- Jan Hidders

Jun 14 '07 #20

"Jan Hidders" <hi*****@gmail.comwrote in message
news:11**********************@d30g2000prg.googlegr oups.com...
On 13 jun, 22:58, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 9:11 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
1 statement_block
1.1 'declare' (matches token #1)
1.2 declaration_item_list
1.2.1 identifier (matches token #2)
1.2.2 'integer' (matches token #3)
1.2.3 ';' (matches token #4)
1.3 'begin' (matches token #5)
1.4 statements
1.4.1 assignment
1.4.1.1 identifier
1.4.1.2 ':='
1.4.1.3 number
1.4.1.4 ';'
1.4.2 statements (matches empty token)
1.5 'end';
Actually, here is quite simple method. Write down the parse tree as a
set of paths, using grammar varibles and constants (terminals and
nonterminals) as path elements. In the example above we get:

statement_block
statement_block.'declare'
statement_block.declaration_item_list
statement_block.declaration_item_list.identifier"i "
statement_block.declaration_item_list.'integer'
... and so on ...

Query this set with regular expressions.

XML, good bye!

Not so fast. An enumeration of the paths in a tree does not always
give you enough information to reconstruct the tree, so you would be
losing expressive power. Never mind that you ignore the order, which
may be a factor for the scope of declarations.
I don't understand why enumerating the paths ignores the order.
-- Jan Hidders

Jun 14 '07 #21
On 14 jun, 12:47, "David Cressey" <cresse...@verizon.netwrote:
"Jan Hidders" <hidd...@gmail.comwrote in message

news:11**********************@d30g2000prg.googlegr oups.com...
On 13 jun, 22:58, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 13, 9:11 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
Actually, here is quite simple method. Write down the parse tree as a
set of paths, using grammar varibles and constants (terminals and
nonterminals) as path elements. In the example above we get:
statement_block
statement_block.'declare'
statement_block.declaration_item_list
statement_block.declaration_item_list.identifier"i "
statement_block.declaration_item_list.'integer'
... and so on ...
Query this set with regular expressions.
XML, good bye!
Not so fast. An enumeration of the paths in a tree does not always
give you enough information to reconstruct the tree, so you would be
losing expressive power. Never mind that you ignore the order, which
may be a factor for the scope of declarations.

I don't understand why enumerating the paths ignores the order.
Take the trees a( b( c ), d ) and a( d, b( c) ). The *set* of paths is
in both cases: { a, a.b, a.b.c, a.d }.

Of course, making it a list and making sure you notice the end of
subtrees solves that problem. So the tree a( b( c ), b( c )) would
generate the list [a, a.b, a.b.c, a.b, a.b.c] and a( b( c, c)) would
generate the list [a, a.b, a.b.c, a.b.c].

-- Jan Hidders

Jun 14 '07 #22
On Jun 14, 2:47 am, Jan Hidders <hidd...@gmail.comwrote:
Are you saying that anything for
which one would typically use XPath can be done just as well with
regular expressions assuming that the tree is somehow encoded in a
string?
A set of strings
This is clearly not the case for the encoding you gave. Or are
you going to extend that? Perhaps also extend the regular expressions
a little?
I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
Or did you just have a particular query in mind?
Let's start from scratch again with simpler example. Given a grammar

expr :
expr '+' expr
| '(' expr ')'
| '1'

and derivation tree find all the '1's that are nested within a pair of
brackets. In the example

1 expr
1.1 expr
1.1.1 (
1.1.2 expr
1.1.2.1 expr
1.1.2.1.1 1
1.1.2.2 +
1.1.2.3 expr
1.1.2.3.1 1
1.1.3 )
1.2 +
1.2 expr
1.2.1 1

the query should return nodes 1.1.2.1.1 and 1.1.2.3.1

Jun 14 '07 #23
On Jun 14, 9:41 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
On Jun 14, 2:47 am, Jan Hidders <hidd...@gmail.comwrote:
Are you saying that anything for
which one would typically use XPath can be done just as well with
regular expressions assuming that the tree is somehow encoded in a
string?

A set of strings
This is clearly not the case for the encoding you gave. Or are
you going to extend that? Perhaps also extend the regular expressions
a little?

I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
Or did you just have a particular query in mind?

Let's start from scratch again with simpler example. Given a grammar

expr :
expr '+' expr
| '(' expr ')'
| '1'

and derivation tree find all the '1's that are nested within a pair of
brackets. In the example

1 expr
1.1 expr
1.1.1 (
1.1.2 expr
1.1.2.1 expr
1.1.2.1.1 1
1.1.2.2 +
1.1.2.3 expr
1.1.2.3.1 1
1.1.3 )
1.2 +
1.2 expr
1.2.1 1

the query should return nodes 1.1.2.1.1 and 1.1.2.3.1
Actually, this query is easy in the relational world. For an open
parenthesis node, find a parent, and then all the descendants such
that ... One may use nested sets encoding, or nested intervals in
matrix encoding (where finding parent is much more transparent).

Alternatively, one can build transitively closed relationship of all
the token precedences, then finding all the tokens nested inside some
pair of parenthesis is expressed via 2 joins query.

So, I have hard time to define a concept of query in pure language
settings...
Jun 14 '07 #24
Reminder: The best comparison for a database query, in the XML world, is
XQuery -- a synthesis of XPath and XSLT. It's structural rather than
relational, but many of the concepts are the same and XQuery is probably
the best spec to study if you want to see how they map into each other.

(The XSLT 2.0 spec, by the way, is literally generated from the same
source files as the XQuery spec; it drops a few features, adds a few
features, and uses slightly different syntax but the underlying
semantics are intended to be identical.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jun 14 '07 #25
On 14 jun, 18:41, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 14, 2:47 am, Jan Hidders <hidd...@gmail.comwrote:
Are you saying that anything for
which one would typically use XPath can be done just as well with
regular expressions assuming that the tree is somehow encoded in a
string?

A set of strings
This is clearly not the case for the encoding you gave. Or are
you going to extend that? Perhaps also extend the regular expressions
a little?

I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
?? How does one query a set of strings in a "pure language setting"?
Of course you might select certain strings from the set with something
that accepts strings from a certain language, but that is clearly
inadequate because you cannot take the "context" of the node into
account. So what did you have in mind?

-- Jan Hidders

Jun 14 '07 #26
On Jun 14, 10:35 am, Jan Hidders <hidd...@gmail.comwrote:
On 14 jun, 18:41, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.

?? How does one query a set of strings in a "pure language setting"?
Of course you might select certain strings from the set with something
that accepts strings from a certain language, but that is clearly
inadequate because you cannot take the "context" of the node into
account.
And what the node's "context" would be? A set of attributes? If so,
then we are in pure relational world. I don't feel comfortable,
however, that we use the two completely different mechanics: languges
for parsing and building the derivation tree, and relations for
querying.

The question is if all these attributes are not redundant and can't be
collapsed into a single attribute. Then language approach would become
possible.
Jun 14 '07 #27
On 14 jun, 19:13, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 14, 9:41 am, Vadim Tropashko <vadimtro_inva...@yahoo.com>
wrote:
On Jun 14, 2:47 am, Jan Hidders <hidd...@gmail.comwrote:
Are you saying that anything for
which one would typically use XPath can be done just as well with
regular expressions assuming that the tree is somehow encoded in a
string?
A set of strings
This is clearly not the case for the encoding you gave. Or are
you going to extend that? Perhaps also extend the regular expressions
a little?
I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
Or did you just have a particular query in mind?
Let's start from scratch again with simpler example. Given a grammar
expr :
expr '+' expr
| '(' expr ')'
| '1'
and derivation tree find all the '1's that are nested within a pair of
brackets. In the example
1 expr
1.1 expr
1.1.1 (
1.1.2 expr
1.1.2.1 expr
1.1.2.1.1 1
1.1.2.2 +
1.1.2.3 expr
1.1.2.3.1 1
1.1.3 )
1.2 +
1.2 expr
1.2.1 1
the query should return nodes 1.1.2.1.1 and 1.1.2.3.1

Actually, this query is easy in the relational world. For an open
parenthesis node, find a parent, and then all the descendants such
that ... One may use nested sets encoding, or nested intervals in
matrix encoding (where finding parent is much more transparent).
Indeed. In fact, there are close and interesting connections between
XPath and first order logic, presuming that the ancestor-descendant
relation is given to you (and not only the parent-child relation) and
the sibling-order relation.
So, I have hard time to define a concept of query in pure language
settings...
It's probably not the direction where you wanted to go, but the real
connection with formal language theory lies in tree automata which are
a generalization of finite automata for trees. Strings are after all
just a special kind of node-labeled tree where each node has at most
one child.

-- Jan Hidders

Jun 14 '07 #28
On 14 jun, 19:54, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 14, 10:35 am, Jan Hidders <hidd...@gmail.comwrote:
On 14 jun, 18:41, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
?? How does one query a set of strings in a "pure language setting"?
Of course you might select certain strings from the set with something
that accepts strings from a certain language, but that is clearly
inadequate because you cannot take the "context" of the node into
account.

And what the node's "context" would be?
The nodes around it, the preceding siblings, the following sibling, et
cetera. How do I ask for an "a" node that is immediately preceded by a
"b" node?

-- Jan Hidders

Jun 14 '07 #29
On Jun 14, 10:14 am, Jan Hidders <hidd...@gmail.comwrote:
On 14 jun, 19:54, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
On Jun 14, 10:35 am, Jan Hidders <hidd...@gmail.comwrote:
On 14 jun, 18:41, Vadim Tropashko <vadimtro_inva...@yahoo.comwrote:
I was going to define tree query in pure language settings, be it
regular languges, context free grammars, or else.
?? How does one query a set of strings in a "pure language setting"?
Of course you might select certain strings from the set with something
that accepts strings from a certain language, but that is clearly
inadequate because you cannot take the "context" of the node into
account.
And what the node's "context" would be?

The nodes around it, the preceding siblings, the following sibling, et
cetera. How do I ask for an "a" node that is immediately preceded by a
"b" node?
No you don't need that many, only two. One is the absolute position of
the node in the hierarchy, for example path, or 4 integers
corresponding to the matrix encoding. The second one is the node
payload -- grammar token in this case. They have completely different
structure, so I'm most likely wrong expecting to be able to operate a
single attribute.

Jun 14 '07 #30
On Jun 13, 4:47 pm, Marshall <marshall.spi...@gmail.comwrote:
>
My hope is that this post will stimulate an educated and informative
discussion of the relative merits of different approaches to
tree traversal, not neglecting to differentiate among the needs
of diverse use cases.

Hey, it could happen.
Or, you know, not.
Marshall
Jun 16 '07 #31

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by ClŠudia Morgado | last post: by
13 posts views Thread by Anton.Nikiforov | last post: by
1 post views Thread by News Server | last post: by
4 posts views Thread by Daisy | last post: by
5 posts views Thread by clintonG | last post: by
3 posts views Thread by Bennett Haselton | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.