469,608 Members | 1,508 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,608 developers. It's quick & easy.

Database design for tree structured data

I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.

The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.

Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.

Nov 24 '06 #1
10 5767

Will Honea wrote:
I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.

The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.

Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.
There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2.

http://fungus.teststation.com/~jon/t...eeHandling.htm

Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.

I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note
/Lennart

Nov 24 '06 #2
On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
>Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.

There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2.

http://fungus.teststation.com/~jon/t...eeHandling.htm

Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.

I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note
Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider. Let me think on
this a bit more... I can see where this might well fit as it
potentially abstracts the lexical construct from the input form making
the storage/search issues much more tractable.
Nov 24 '06 #3

Will Honea wrote:
On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
[...]
Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider.
I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:

create table tree (
node_id int not null primary key,
parent_id int not null references tree
)

insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)

select n,1 from iter;

isnt that wide enough?

/Lennart

Nov 24 '06 #4
On Fri, 24 Nov 2006 08:44:00 -0800, Lennart wrote:
>
Will Honea wrote:
>On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
[...]
>Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider.

I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:

create table tree (
node_id int not null primary key,
parent_id int not null references tree
)

insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)

select n,1 from iter;

isnt that wide enough?
In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.

Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.
Nov 24 '06 #5

Will Honea wrote:
[...]
In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.

Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.
I'm way out of my leage here, but is you input something like:

{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}

Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?

/Lennart

Nov 25 '06 #6
On Fri, 24 Nov 2006 18:39:09 -0800, Lennart wrote:
>
I'm way out of my leage here, but is you input something like:

{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}

Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?
The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.
Nov 25 '06 #7
Will Honea wrote:
[...]
>
The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.
Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)

Nov 25 '06 #8
> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<

Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.

Nov 25 '06 #9
On Fri, 24 Nov 2006 23:56:18 -0800, Lennart wrote:
Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)
We seem to be holding a dialogue here so let's move it to email. I'll see
if I can garner some samples tonight and send them to you.
Nov 25 '06 #10
On Sat, 25 Nov 2006 05:15:27 -0800, --CELKO-- wrote:
>> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<

Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.
Thank you, that looks useful. I think I've been going at this from the
wrong angle all along.
Nov 25 '06 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by hplloyd | last post: by
3 posts views Thread by Merlin | last post: by
2 posts views Thread by Kums | last post: by
1 post views Thread by Corinne | last post: by
44 posts views Thread by Mikito Harakiri | last post: by
reply views Thread by Solution2021 | last post: by
reply views Thread by devrayhaan | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.