I have a data set which I need to analyze but I am having a problem
figuring out a structure for the database - or whether there are better
ways of attacking the problem.
The base data set is a large number of replies to a survey
questionaire. I have adapted a NLP program to produce lexically annotated
structural trees which appear to reasonably accurately reduce the text to
descending trees. These trees consist of multiple nodes each of which
consists of sub-nodes to an indeterminate level (normally 1 to 7 subnodes)
wih each node containing 1-5 branches depending on the statistical
probability of the branch of the branch. Without getting pedantic, it
quickly becomes a very complex tree and manual interpretation of the 25k
or so sentences is impractical. What I am specifically looking for is a
data structure to contain the sentences, the lexical descriptions of
phrases within the sentence with a descent into each phrase that
classifies each part of the phrase until the entire entry decomposes into
individual words classified by lexical type. I can get the information to
populate the structure - I just can't figure out a way to store the
results for aggregate study.
Can someone suggest possible database designs for tree-strucured data such
as this or point me to references dealing with this type of analysis? I
cannot visualize a usable structure and "you can't get there from here"
would be just as appropriate an answer as any. Suggestions on
tackling aggregation fo this form of data would be greatly appreciated.