468,248 Members | 1,475 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,248 developers. It's quick & easy.

How to parse/format repeated strings in data?

List,

I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)

Nov 16 '05 #1
7 1495
Here's my simple solution:

Read a line.
use String.Split(" ") to split each line up into a collection of words.
put each word in separate nodes in a hierarchical-tree collection.

When all line are entered, scan the tree and for each node with only one
child, combine parent & child.
The bad news is that .Net does not have a hierarchical-tree collection
type.

The good news is that it actually does, sort-of....
The TreeView control (used to display trees like the left panel of
Windows Explorer) will work, and there's no reason why it needs to be
displayed. It could even be used in a console app.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@l41g2000cwc.googlegro ups.com...
List,

I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)

Nov 16 '05 #2
James,

I will put this together tomorrow and see how it works. I had thought
of using the Split, but your idea of scanning the tree to look for each
node with only one child and then combining the two had not at all.
I'll fill you in on how it works.

Thanks,

Mike

James Curran wrote:
Here's my simple solution:

Read a line.
use String.Split(" ") to split each line up into a collection of words. put each word in separate nodes in a hierarchical-tree collection.
When all line are entered, scan the tree and for each node with only one child, combine parent & child.
The bad news is that .Net does not have a hierarchical-tree collection type.

The good news is that it actually does, sort-of....
The TreeView control (used to display trees like the left panel of Windows Explorer) will work, and there's no reason why it needs to be
displayed. It could even be used in a console app.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@l41g2000cwc.googlegro ups.com...
List,

I call this a "Parsing Problem", but it could be called formatting or regular expressions as well. I have a set of data that was formerly processed on an OS390 (hence a lot of horsepower). Now, it has been moved to a database from where I can call it via a web service with a C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC) APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like what is below. The key is that whatever substring is repeated across all the records becomes a heading. For example, ABLATION is common to all the rows and so is the heading for all of them. Heart is common to two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)


Nov 16 '05 #3
If you want to build your own data structure to hold this, a hash table
of hash tables (of hash tables of hash tables...) would probably be
best.

public class HashtableStringTree
{
private string name;
private bool entry;
private Hashtable table;

public HashtableStringTree(string name)
{
this.name = name;
this.table = new Hashtable();
this.entry = false;
}

public void Add(string[] splitString)
{
Add(splitString, 0);
}

public void Add(string[] splitString, int startIndex)
{
int remainingLength = splitString.Length - startIndex;
if (remainingLength == 1)
{
this.entry = true;
}
else if (remainingLength > 1)
{
HashtableStringTree nextLevel =
this.table[splitString[startIndex]];
if (nextLevel == null)
{
HashtableStringTree nextLevel = new
HashtableStringTree(splitString[startIndex]);
this.table[splitString[startIndex]] = nextLevel;
}
nextLevel.Add(splitString, startIndex + 1);
}
}

public string Collapse()
{
if (this.table != null)
{
HashtableStringTree nextLevel = null;
Hashtable newTable = new Hashtable();
foreach (DictionaryEntry de in this.table)
{
nextLevel = (HashtableStringTree)de.Value;
newTable[nextLevel.Collapse()] = nextLevel;
}
this.table = newTable;
if (!this.entry && this.table.Count == 1)
{
if (!nextLevel.entry && nextLevel.table.Count == 0)
{
this.table.Clear();
this.name = this.name + " " + nextLevel.name;
this.entry = true;
}
}
else if (nextLevel.entry == null && nextLevel.table.Count
== 1)
{
this.table = nextLevel.table;
this.name = this.name + " " + nextLevel.name;
}
}
return this.name;
}
}

You can then walk the tree recursively doing whatever you like. If a
node in the tree has a non-null "entry", it means that at least one
string terminated there. If it has a null "entry" then it means that
all strings had more words in them. For example

ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)

will result in a top-level node containing one entry in the "table":
"ABLATION". The "entry" of that node will be false because there were
no entries containing "ABLATION" and nothing more. The
HashtableStringTree stored in the "table" under "ABLATION" will have an
"entry" of true, because there was one entry that ended in the word
"PITUITARY" and went no further. However, the "table" will also contain
one entry keyed on "PITUITARY" that contains a HashtableStringTree with
a false "entry" and a "table" with one entry: "BY", which points to
another HashtableStringTree with a false "entry" and three items in the
"table": "COBALT-60", "IMPLANTATION", and "PROTON BEAM", etc.

Calling Collapse() on the top-level node should combine the "ABLATION"
and "PITUITARY" nodes into a single "ABLATION PITUITARY" node, but also
make it impossible to add new entries correctly.

[WARNING: I have not tested this code. This is "off the cuff", as it
were.]

Nov 16 '05 #4
Bruce,

I am very interested in your idea. Just trying to wrap my mind around
how it works. It may take me a few days to understand and implement
it. Thanks! I'll let you and Bruce both know how this gets solved.

Nov 16 '05 #5
James,

In thinking about your suggestion some more, I find it difficult to see
how it will work correctly. In some cases the top-level node will have
no children and in others the top-level node will be a string composed
of more than the top node and one child. Rather, for example, with
Ablation, ABLATION actually has no children and in another example from
data I did not show, a top-level node is ACROMIOPLASTY (ANAT = 44.05)
which would be a node with multiple children from the split. Any ideas
on how to get around these?

Nov 16 '05 #6
I just noticed a bug in the Collapse routine. It should read:

public string Collapse()
{
HashtableStringTree nextLevel = null;
Hashtable newTable = new Hashtable();
foreach (DictionaryEntry de in this.table)
{
nextLevel = (HashtableStringTree)de.Value;
newTable[nextLevel.Collapse()] = nextLevel;
}
if (nextLevel != null)
{
this.table = newTable;
if (!this.entry && this.table.Count == 1)
{
if (!nextLevel.entry && nextLevel.table.Count == 0)
{
this.table.Clear();
this.name = this.name + " " + nextLevel.name;
this.entry = true;
}
}
else if (nextLevel.entry == null && nextLevel.table.Count
== 1)
{
this.table = nextLevel.table;
this.name = this.name + " " + nextLevel.name;
}
}
}
return this.name;
}

Nov 16 '05 #7
Do you have a standard format for your lines? If so, you can set up a
Regular Expression and Split them that way. Since you're using this with a
Web Service anyway, what about converting it all to XML upon splitting? I
don't see the specific rules to your splitting here, but you should
definitely make sure you know exactly what you want split and how you want
it split prior to jumping in and coding it...

"Mike" <ms********@charter.net> wrote in message
news:11*********************@g14g2000cwa.googlegro ups.com...
James,

In thinking about your suggestion some more, I find it difficult to see
how it will work correctly. In some cases the top-level node will have
no children and in others the top-level node will be a string composed
of more than the top node and one child. Rather, for example, with
Ablation, ABLATION actually has no children and in another example from
data I did not show, a top-level node is ACROMIOPLASTY (ANAT = 44.05)
which would be a node with multiple children from the split. Any ideas
on how to get around these?

Nov 16 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

22 posts views Thread by Ram Laxman | last post: by
2 posts views Thread by Peter Kirk | last post: by
8 posts views Thread by moondaddy | last post: by
6 posts views Thread by Richard | last post: by
29 posts views Thread by gs | last post: by
AdrianH
1 post views Thread by AdrianH | last post: by
3 posts views Thread by Peter Duniho | last post: by
reply views Thread by NPC403 | last post: by
reply views Thread by kermitthefrogpy | last post: by
reply views Thread by zattat | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.