Database design for tree structured data

Will Honea

I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.

The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.

Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.

Nov 24 '06 #1

Subscribe Post Reply

5933

Lennart

Will Honea wrote:

I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.

The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.

Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.

There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2.

http://fungus.teststation.com/~jon/t...eeHandling.htm

Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.

I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note
/Lennart

Nov 24 '06 #2

Will Honea

On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:

>Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.

There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2.

http://fungus.teststation.com/~jon/t...eeHandling.htm

Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.

I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note

Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider. Let me think on
this a bit more... I can see where this might well fit as it
potentially abstracts the lexical construct from the input form making
the storage/search issues much more tractable.

Nov 24 '06 #3

Lennart

Will Honea wrote:

On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:

[...]

Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider.

I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:

create table tree (
node_id int not null primary key,
parent_id int not null references tree
)

insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)

select n,1 from iter;

isnt that wide enough?

/Lennart

Nov 24 '06 #4

Will Honea

On Fri, 24 Nov 2006 08:44:00 -0800, Lennart wrote:

>
Will Honea wrote:
>On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
[...]
>Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider.

I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:

create table tree (
node_id int not null primary key,
parent_id int not null references tree
)

insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)

select n,1 from iter;

isnt that wide enough?

In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.

Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.

Nov 24 '06 #5

Lennart

Will Honea wrote:
[...]

In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.

Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.

I'm way out of my leage here, but is you input something like:

{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}

Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?

/Lennart

Nov 25 '06 #6

Will Honea

On Fri, 24 Nov 2006 18:39:09 -0800, Lennart wrote:

>
I'm way out of my leage here, but is you input something like:

{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}

Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?

The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.

Nov 25 '06 #7

Lennart

Will Honea wrote:
[...]

>
The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.

Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)

Nov 25 '06 #8

--CELKO--

> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<

Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.

Nov 25 '06 #9

Will Honea

On Fri, 24 Nov 2006 23:56:18 -0800, Lennart wrote:

Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)

We seem to be holding a dialogue here so let's move it to email. I'll see
if I can garner some samples tonight and send them to you.

Nov 25 '06 #10

Will Honea

On Sat, 25 Nov 2006 05:15:27 -0800, --CELKO-- wrote:

>> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<

Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.

Thank you, that looks useful. I think I've been going at this from the
wrong angle all along.

Nov 25 '06 #11

by: hakhan | last post by:

Hello, I need to store huge(+/- 100MB) data. Furthermore, my GUI application must select data portions from these huge data files in order to do some post-processing. I wonder in which format I...

.NET Framework

Componentizing a database application

by: hplloyd | last post by:

I am fairly new to VB.NET programming but have built many database applications in Access SQL Server etc. I need to find a good book or other reference material that will help me take my OO...

.NET Framework

Design Problem

by: Merlin | last post by:

Design Problem =============== Would appreciate any help or suggestion on this design issue. I have spent a great deal of time and effort for an elegant solution but it seems I am not getting...

C / C++

maximum size of a database/tablespace

by: Kums | last post by:

What is the maximum permissible size of a database? Is there any limitation. What is the maximum # of tablespace's allowed in a database? Thanks for your response.

DB2 Database

database table/form design

by: Corinne | last post by:

I have a database that contains the details of pupils in a school. What I would like to do may not be possible but I thought I would ask anyway. Each year the pupils move to a different class,...

Microsoft Access / VBA

Creating relationships in front/back-end database

by: Mike Turco | last post by:

What is the difference between creating relationships in the front-end vs. the back-end database? I was trying to create a relationship in a database front-end and noticed that I could not check...

Microsoft Access / VBA

The stupidest design I ever saw

by: Mikito Harakiri | last post by:

This is a continuation of the old theme, now featuring xml. <tree> <node id=0 parent_id=null label='A'> <node id=1 parent_id=0 label='B'> <node id=2 parent_id=0 label='C'> ... </tree>

.NET Framework

Getting the database specification revisited..

by: Paul H | last post by:

I am trying to get the spec for a database. The trouble is the client frequently blurts out industry jargon, speaks insanely quickly and is easily sidetracked. They are currently using around 30...

Microsoft Access / VBA

Database switching to read_only mode randomly

by: programmerx101 | last post by:

Ok, I'm looking for expert advice on this one. I have a database which keeps going into read_only mode. Sometimes it goes into read_only / single user mode. Once it was taken offline completely....

Microsoft SQL Server

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Database design for tree structured data

Similar topics