473,412 Members | 2,051 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,412 software developers and data experts.

How to parse/format repeated strings in data?

List,

I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)

Nov 16 '05 #1
7 1639
Here's my simple solution:

Read a line.
use String.Split(" ") to split each line up into a collection of words.
put each word in separate nodes in a hierarchical-tree collection.

When all line are entered, scan the tree and for each node with only one
child, combine parent & child.
The bad news is that .Net does not have a hierarchical-tree collection
type.

The good news is that it actually does, sort-of....
The TreeView control (used to display trees like the left panel of
Windows Explorer) will work, and there's no reason why it needs to be
displayed. It could even be used in a console app.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@l41g2000cwc.googlegro ups.com...
List,

I call this a "Parsing Problem", but it could be called formatting or
regular expressions as well. I have a set of data that was formerly
processed on an OS390 (hence a lot of horsepower). Now, it has been
moved to a database from where I can call it via a web service with a
C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH
ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like
what is below. The key is that whatever substring is repeated across
all the records becomes a heading. For example, ABLATION is common to
all the rows and so is the heading for all of them. Heart is common to
two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)

Nov 16 '05 #2
James,

I will put this together tomorrow and see how it works. I had thought
of using the Split, but your idea of scanning the tree to look for each
node with only one child and then combining the two had not at all.
I'll fill you in on how it works.

Thanks,

Mike

James Curran wrote:
Here's my simple solution:

Read a line.
use String.Split(" ") to split each line up into a collection of words. put each word in separate nodes in a hierarchical-tree collection.
When all line are entered, scan the tree and for each node with only one child, combine parent & child.
The bad news is that .Net does not have a hierarchical-tree collection type.

The good news is that it actually does, sort-of....
The TreeView control (used to display trees like the left panel of Windows Explorer) will work, and there's no reason why it needs to be
displayed. It could even be used in a console app.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@l41g2000cwc.googlegro ups.com...
List,

I call this a "Parsing Problem", but it could be called formatting or regular expressions as well. I have a set of data that was formerly processed on an OS390 (hence a lot of horsepower). Now, it has been moved to a database from where I can call it via a web service with a C# client. The data looks like this:

ABLATION ENDOMETRIAL (HYSTEROSCOPIC)
ABLATION HEART (CONDUCTION DEFECT)
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND)
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER
ABLATION LESION HEART ENDOVASCULAR APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR APPROACH ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC) APPROACH
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC APPROACH ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA)
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA)
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART
ABLATION VESICLE NECK (ANAT = 60.02)

I need to format 24000 lines of such data into a tree that looks like what is below. The key is that whatever substring is repeated across all the records becomes a heading. For example, ABLATION is common to all the rows and so is the heading for all of them. Heart is common to two and so is heading with a sub-entry. Any ideas on efficient
approaches in C# would be greatly appreciated.

ABLATION

ENDOMETRIAL (HYSTEROSCOPIC)

HEART (CONDUCTION DEFECT)

WITH CATHETER

INNER EAR (CRYOSURGERY) (ULTRASOUND)

BY INJECTION

LESION HEART

BY PERIPHERALLY INSERTED CATHETER

ENDOVASCULAR APPROACH

MAZE PROCEDURE (COX-MAZE)

ENDOVASCULAR APPROACH

OPEN (TRANS-THORACIC) APPROACH

TRANS-THORACIC APPROACH

PITUITARY

BY

COBALT-60

IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC

PROTON BEAM (BRAGG PEAK)

PROSTATE (ANAT = 59.02)

BY

LASER, TRANSURETHRAL

RADIOFREQUENCY THERMOTHERAPY

TRANSURETHRAL NEEDLE ABLATION (TUNA)

PERINEAL BY

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TRANSURETHRAL

BY LASER

CRYOABLATION

RADICAL CRYOSURGICAL ABLATION (RCSA)

TISSUE HEART - SEE ABLATION, LESION, HEART

VESICLE NECK (ANAT = 60.02)


Nov 16 '05 #3
If you want to build your own data structure to hold this, a hash table
of hash tables (of hash tables of hash tables...) would probably be
best.

public class HashtableStringTree
{
private string name;
private bool entry;
private Hashtable table;

public HashtableStringTree(string name)
{
this.name = name;
this.table = new Hashtable();
this.entry = false;
}

public void Add(string[] splitString)
{
Add(splitString, 0);
}

public void Add(string[] splitString, int startIndex)
{
int remainingLength = splitString.Length - startIndex;
if (remainingLength == 1)
{
this.entry = true;
}
else if (remainingLength > 1)
{
HashtableStringTree nextLevel =
this.table[splitString[startIndex]];
if (nextLevel == null)
{
HashtableStringTree nextLevel = new
HashtableStringTree(splitString[startIndex]);
this.table[splitString[startIndex]] = nextLevel;
}
nextLevel.Add(splitString, startIndex + 1);
}
}

public string Collapse()
{
if (this.table != null)
{
HashtableStringTree nextLevel = null;
Hashtable newTable = new Hashtable();
foreach (DictionaryEntry de in this.table)
{
nextLevel = (HashtableStringTree)de.Value;
newTable[nextLevel.Collapse()] = nextLevel;
}
this.table = newTable;
if (!this.entry && this.table.Count == 1)
{
if (!nextLevel.entry && nextLevel.table.Count == 0)
{
this.table.Clear();
this.name = this.name + " " + nextLevel.name;
this.entry = true;
}
}
else if (nextLevel.entry == null && nextLevel.table.Count
== 1)
{
this.table = nextLevel.table;
this.name = this.name + " " + nextLevel.name;
}
}
return this.name;
}
}

You can then walk the tree recursively doing whatever you like. If a
node in the tree has a non-null "entry", it means that at least one
string terminated there. If it has a null "entry" then it means that
all strings had more words in them. For example

ABLATION PITUITARY
ABLATION PITUITARY BY COBALT-60
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK)

will result in a top-level node containing one entry in the "table":
"ABLATION". The "entry" of that node will be false because there were
no entries containing "ABLATION" and nothing more. The
HashtableStringTree stored in the "table" under "ABLATION" will have an
"entry" of true, because there was one entry that ended in the word
"PITUITARY" and went no further. However, the "table" will also contain
one entry keyed on "PITUITARY" that contains a HashtableStringTree with
a false "entry" and a "table" with one entry: "BY", which points to
another HashtableStringTree with a false "entry" and three items in the
"table": "COBALT-60", "IMPLANTATION", and "PROTON BEAM", etc.

Calling Collapse() on the top-level node should combine the "ABLATION"
and "PITUITARY" nodes into a single "ABLATION PITUITARY" node, but also
make it impossible to add new entries correctly.

[WARNING: I have not tested this code. This is "off the cuff", as it
were.]

Nov 16 '05 #4
Bruce,

I am very interested in your idea. Just trying to wrap my mind around
how it works. It may take me a few days to understand and implement
it. Thanks! I'll let you and Bruce both know how this gets solved.

Nov 16 '05 #5
James,

In thinking about your suggestion some more, I find it difficult to see
how it will work correctly. In some cases the top-level node will have
no children and in others the top-level node will be a string composed
of more than the top node and one child. Rather, for example, with
Ablation, ABLATION actually has no children and in another example from
data I did not show, a top-level node is ACROMIOPLASTY (ANAT = 44.05)
which would be a node with multiple children from the split. Any ideas
on how to get around these?

Nov 16 '05 #6
I just noticed a bug in the Collapse routine. It should read:

public string Collapse()
{
HashtableStringTree nextLevel = null;
Hashtable newTable = new Hashtable();
foreach (DictionaryEntry de in this.table)
{
nextLevel = (HashtableStringTree)de.Value;
newTable[nextLevel.Collapse()] = nextLevel;
}
if (nextLevel != null)
{
this.table = newTable;
if (!this.entry && this.table.Count == 1)
{
if (!nextLevel.entry && nextLevel.table.Count == 0)
{
this.table.Clear();
this.name = this.name + " " + nextLevel.name;
this.entry = true;
}
}
else if (nextLevel.entry == null && nextLevel.table.Count
== 1)
{
this.table = nextLevel.table;
this.name = this.name + " " + nextLevel.name;
}
}
}
return this.name;
}

Nov 16 '05 #7
Do you have a standard format for your lines? If so, you can set up a
Regular Expression and Split them that way. Since you're using this with a
Web Service anyway, what about converting it all to XML upon splitting? I
don't see the specific rules to your splitting here, but you should
definitely make sure you know exactly what you want split and how you want
it split prior to jumping in and coding it...

"Mike" <ms********@charter.net> wrote in message
news:11*********************@g14g2000cwa.googlegro ups.com...
James,

In thinking about your suggestion some more, I find it difficult to see
how it will work correctly. In some cases the top-level node will have
no children and in others the top-level node will be a string composed
of more than the top node and one child. Rather, for example, with
Ablation, ABLATION actually has no children and in another example from
data I did not show, a top-level node is ACROMIOPLASTY (ANAT = 44.05)
which would be a node with multiple children from the split. Any ideas
on how to get around these?

Nov 16 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

22
by: Ram Laxman | last post by:
Hi all, I have a text file which have data in CSV format. "empno","phonenumber","wardnumber" 12345,2234353,1000202 12326,2243653,1000098 Iam a beginner of C/C++ programming. I don't know how to...
2
by: Peter Kirk | last post by:
Hi there I would like some help with parsing date strings to DateTime structures. I can see that DateTime has Parse and ParseExact methods - but I am not sure what is best for me to use, and...
8
by: moondaddy | last post by:
I'm writing an app in vb.net 1.1 and I need to parse strings that look similar to the one below. All 5 rows will make up one string. I have a form where a use can copy/paste data like what you...
13
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...
6
by: trevor | last post by:
Incorrect values when using float.Parse(string) I have discovered a problem with float.Parse(string) not getting values exactly correct in some circumstances(CSV file source) but in very similar...
6
by: Richard | last post by:
Which way would you guys recommened to best parse a multiline file which contains two fields seperated by a tab. In this case its the linux/proc/filesystems file a sample of which I have included...
29
by: gs | last post by:
let say I have to deal with various date format and I am give format string from one of the following dd/mm/yyyy mm/dd/yyyy dd/mmm/yyyy mmm/dd/yyyy dd/mm/yy mm/dd/yy dd/mmm/yy mmm/dd/yy
1
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this...
3
by: Peter Duniho | last post by:
I'm sure there's a good explanation for this, but I can't figure it out. I tried using DateTime.Parse() with a custom DateTimeFormatInfo instance, in which I'd replaced the...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.