I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.
The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.
Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated. 10 5933
Will Honea wrote:
I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.
The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.
Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.
There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2. http://fungus.teststation.com/~jon/t...eeHandling.htm
Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.
I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note
/Lennart
On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
>Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? I cannot visualize a usable structure and "you can't get there from here" would be just as appropriate an answer as any. Suggestions on tackling aggregation fo this form of data would be greatly appreciated.
There are a number of ways to represent trees in a rdbms. Google for
transitive closure, nested set, adjancy list. Heres a link to my notes
from implementing a tree in db2.
http://fungus.teststation.com/~jon/t...eeHandling.htm
Very sloppy but it should give you some ideas. I implemented add, move
and delete operations in triggers. A typical tree contained of 10^5 -
10^6 nodes, and Transitive closure (described as Path in the link, I
werent familiar with the term then) prox 10 times bigger.
I think I still have the ddl lying around somewhere, so if it would be
of interest, drop me a note
Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider. Let me think on
this a bit more... I can see where this might well fit as it
potentially abstracts the lexical construct from the input form making
the storage/search issues much more tractable.
Will Honea wrote:
On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
[...]
Interesting - I had never considered it from this perspective. What you
seem to be doing is implementing a balanced tree structure although my
first thought is that I need nodes considerably wider.
I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:
create table tree (
node_id int not null primary key,
parent_id int not null references tree
)
insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)
select n,1 from iter;
isnt that wide enough?
/Lennart
On Fri, 24 Nov 2006 08:44:00 -0800, Lennart wrote:
>
Will Honea wrote:
>On Thu, 23 Nov 2006 21:37:46 -0800, Lennart wrote:
[...]
>Interesting - I had never considered it from this perspective. What you seem to be doing is implementing a balanced tree structure although my first thought is that I need nodes considerably wider.
I'm not sure what you mean. Could you explain more in detail what you
mean by wider. Assume the following table:
create table tree (
node_id int not null primary key,
parent_id int not null references tree
)
insert into tree (node_id, parent_id) values (1,1);
insert into tree (node_id, parent_id)
with iter (n) as (values 1 union all select n+1 from iter where n<1000)
select n,1 from iter;
isnt that wide enough?
In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.
Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.
Will Honea wrote:
[...]
In the general case, nodes in a tree can split into two OR MORE leaf
branches. This is of particular interest in certain cases for
deterministic keys as node width can make drastic changes in search times
when tuned for a particular data format. In my case, it also makes
implementation of statistically differetiated cases fairly simple. For
example, take the case of a sentence containing words which may be either
nouns or verbs. Multiple interpretations of the initial split into
subject and verb are likely - English is not a structurally well formed
language, to say the least - so multiple parallel descents with cumluative
statistical likelihood computed for each feasible branch allow evaluation
of each without retracing/duplicating alternate paths when the probability
of a given branch drops below a preset minimum.
Binary nodes are much simpler to manipulate in the case of insertion or
deletion where node-splitting or removal occur frequently but wider nodes
can be effective for given applications where lookup is the primay
consideration.
I'm way out of my leage here, but is you input something like:
{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}
Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?
/Lennart
On Fri, 24 Nov 2006 18:39:09 -0800, Lennart wrote:
>
I'm way out of my leage here, but is you input something like:
{S - sentence
{NP - NounPhrase
{D The}
{ADJ New}
{N Dog}
}
{VP - Barked}
}
Each noun, adj etc has certain attributes (neutrum, plural, etc).
Correct so far?
The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.
Will Honea wrote:
[...]
>
The problem is that real world text, especially in English, consists of
words whose classification can be ambiguous. We have also become a nation
of illiterate writers - ask a recent high school or even college graduate
to decompose a sentence into it's grammactical parts! The vocabulary
also contributes to the problem. As a common example, the word "fish" can
be a noun or a verb with multiple definitions in each usage. This results
in having to parse the entire expression for each viable usage and
meaning. The end result is that mechanical translations produce multiple
possible outputs with the "best" translation being chosen based upon
statistical evaluation of the likelihood given the entire context - which
is not known until the finest grain parse is revealed. It is still a best
guess. I would be overjoyed to get 85% correlation to the intended
meanings over a large sample population such as the content of a newspaper
page.
Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)
> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<
Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.
On Fri, 24 Nov 2006 23:56:18 -0800, Lennart wrote:
Could you provide a short sample of the output from your NLP program? I
dont get a good grip of your problem, but then I'm one of those
illiterate readers :-)
We seem to be holding a dialogue here so let's move it to email. I'll see
if I can garner some samples tonight and send them to you.
On Sat, 25 Nov 2006 05:15:27 -0800, --CELKO-- wrote:
>> Can someone suggest possible database designs for tree-strucured data such as this or point me to references dealing with this type of analysis? <<
Get a copy of TREES & HIERARCH8IES IN SQL from Amazon.
Thank you, that looks useful. I think I've been going at this from the
wrong angle all along. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: hakhan |
last post by:
Hello,
I need to store huge(+/- 100MB) data. Furthermore, my GUI application
must select data portions from these huge data files in order to do
some post-processing. I wonder in which format I...
|
by: hplloyd |
last post by:
I am fairly new to VB.NET programming but have built many database applications in Access SQL Server etc.
I need to find a good book or other reference material that will help me take my OO...
|
by: Merlin |
last post by:
Design Problem
===============
Would appreciate any help or suggestion on this design issue. I have
spent a great deal of time and effort for an elegant solution but it
seems I am not getting...
|
by: Kums |
last post by:
What is the maximum permissible size of a database? Is there any limitation.
What is the maximum # of tablespace's allowed in a database?
Thanks for your response.
|
by: Corinne |
last post by:
I have a database that contains the details of pupils in a school.
What I would like to do may not be possible but I thought I would ask
anyway. Each year the pupils move to a different class,...
|
by: Mike Turco |
last post by:
What is the difference between creating relationships in the front-end vs.
the back-end database? I was trying to create a relationship in a database
front-end and noticed that I could not check...
|
by: Mikito Harakiri |
last post by:
This is a continuation of the old theme, now featuring xml.
<tree>
<node id=0 parent_id=null label='A'>
<node id=1 parent_id=0 label='B'>
<node id=2 parent_id=0 label='C'>
...
</tree>
|
by: Paul H |
last post by:
I am trying to get the spec for a database. The trouble is the client
frequently blurts out industry jargon, speaks insanely quickly and is easily
sidetracked. They are currently using around 30...
|
by: programmerx101 |
last post by:
Ok, I'm looking for expert advice on this one.
I have a database which keeps going into read_only mode. Sometimes it goes into read_only / single user mode. Once it was taken offline completely....
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
| |