Get regular expression

Mike

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?

Thanks,

Mike

Jun 20 '06 #1

Subscribe Post Reply

5122

Xicheng Jia

Mike wrote:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?

you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named capture,
like

^(?<expr>.+)(?=\s*).*\k<expr>

and access the variable "expr" at run time?

Xicheng

Jun 20 '06 #2

Mike

Xicheng Jia wrote:

Mike wrote:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?

you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named capture,
like

^(?<expr>.+)(?=\s*).*\k<expr>

and access the variable "expr" at run time?

Xicheng

I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.

Thanks,

Jun 20 '06 #3

Xicheng Jia

Mike wrote:

Xicheng Jia wrote:
Mike wrote:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?

you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named capture,
like

^(?<expr>.+)(?=\s*).*\k<expr>

and access the variable "expr" at run time?

Xicheng

I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.

err, my bad. you need the match object's Groups property, like
Group("expr") or Group(1) to access the captured values..(Groups[...]
for C#)

Xicheng

Jun 20 '06 #4

Kevin Spencer

> I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try

your suggestion first off in the morning.
First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
substring of the original string. The regular expression is the string
"^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
"HEART (CONDUCTION DEFECT)" is your match (which it is not), you could call
it a match for the regular expression (which may match more than once in a
string). But it is a substring of the original string. It may seem picky,
but in order to communicate effectively, one must use the right terms. As an
example, if I told you that I ate a car for breakfast, would you know that I
ate an apple?

Second, the string you posted contains 2 instances of the substring "HEART
(CONDUCTION DEFECT)". Do you want to get both of them? If so, what exactly
are your pattern-matching rules? A regular expression matches a pattern.
Obviously, not all of the strings you will be working with will be:

" HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 "

In fact, probably due to this being a newsgroup, and my using a newsreader,
I would doubt that the line breaks in the string are where they are, if they
are. And I have to wonder whether the string actually begins and ends with a
space.

In other words, you're going to be using a regular expression to isolate
substrings of various strings (most probably). A regular expression is
shorthand for a set of rules that defines a pattern you're looking for.
Whether the strings contain line breaks, for example, is important. Your
regular expression begins with the caret '^' character. This character can
indicate the beginning of a string, or the beginning of a line *or* a
string, depending upon what options you use. You didn't specify the
option(s) you're using, so we have no way to know.

In addition, your pattern is not likely to work in the way you expect. for
example, the following would match:

THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS ")

And in addition, if there are line breaks, like your example (as split by
the newsreader), the matching substring would be:

DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)

So, can you explain what your rules are, and what you are trying to match
here? I'm just guessing that you're parsing medical transcriptions, but
beyond that, I'm stumped.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.
"Mike" <ms********@charter.net> wrote in message
news:11**********************@p79g2000cwp.googlegr oups.com... Xicheng Jia wrote:
Mike wrote:
> I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
> matches. I would like to get what the actual regular expression is.
> In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
> CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
> do I gain access to the expression (not the matches) at runtime?

you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named capture,
like

^(?<expr>.+)(?=\s*).*\k<expr>

and access the variable "expr" at run time?

Xicheng

I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.

Thanks,

Jun 21 '06 #5

Mike

Must say I get burned in six different ways. Some groups I top post
and get scolded. On other groups others people top post and nobody
appears to have a problem. I'll top post here.

Given I've been asked for details I'll provide them, but typically
nobody wants to wade through them.

In the dark ages I had 24,000 lines of ICD9 index entries which got
appended with ICD9 codes and were processed one time per year into a
big paper report with a tree-like structure by an assembler program on
an OS390. An abbreviated example of the report is below for the
Ablation entry.

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

Across my institution in the past there have been multiple "master"
copies of ICD9 codes and index entries. The order came down that
long-term we will work towards a single copy of ICD9 codes with index
entries that will be accessed via webservices. The structure of the
data in our old database was as follows (no line breaks -- each entry
was one line):

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
The new webservices still have this same index structure except now,
for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
of code 57.91. The surgical coders still want to view the index
entries in a tree structure on demand. Without getting into
mind-numbing details, I can jump through some hoops and get back a set
of index entries that look like above for ABLATION but they are not
formatted in the way the surgical coders desire. I believe I have a
recursive algorithm that will work to format these into a tree
structure but this algorithm is predicated on being able to find the
nodes.

If you look carefully, the root node for entire set of index entries
above is "ABLATION" (as that is what begins each entry and repeats
across all of them). Subsequently, Endometrial (Hysteroscopic) + code
is a child of ABLATION with no children of its own because it is not
repeated. Next, Heart (Conduction Defect) + code is a node with "With
Catheter + code" as a child of that node because "Heart (Conduction
Defect)" repeats across both those lines.

I have begged the group that now owns the webservice to allow me to
restructure the data but no go (they say that would be bastardizing the
concept of everything being 'code-centric'). I am stuck with this and
also with the demand by the coders that they get the formatted tree
structure to look at when they code.

In general, I think if I do the following I can figure out the nodes
and children:

1. Read index entries until the first word changes.
2. Get the substring that begins the string and is repeated elsewhere
in the string (this is the node).
3. Remove that node and keep processing until the base case is hit etc.

If anyone has any better ideas of how to deal with this I would be
thrilled to no end to hear them.
Thanks,

Mike
Kevin Spencer wrote:

I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.

First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
substring of the original string. The regular expression is the string
"^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
"HEART (CONDUCTION DEFECT)" is your match (which it is not), you could call
it a match for the regular expression (which may match more than once in a
string). But it is a substring of the original string. It may seem picky,
but in order to communicate effectively, one must use the right terms. As an
example, if I told you that I ate a car for breakfast, would you know that I
ate an apple?

Second, the string you posted contains 2 instances of the substring "HEART
(CONDUCTION DEFECT)". Do you want to get both of them? If so, what exactly
are your pattern-matching rules? A regular expression matches a pattern.
Obviously, not all of the strings you will be working with will be:

" HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 "

In fact, probably due to this being a newsgroup, and my using a newsreader,
I would doubt that the line breaks in the string are where they are, if they
are. And I have to wonder whether the string actually begins and ends with a
space.

In other words, you're going to be using a regular expression to isolate
substrings of various strings (most probably). A regular expression is
shorthand for a set of rules that defines a pattern you're looking for.
Whether the strings contain line breaks, for example, is important. Your
regular expression begins with the caret '^' character. This character can
indicate the beginning of a string, or the beginning of a line *or* a
string, depending upon what options you use. You didn't specify the
option(s) you're using, so we have no way to know.

In addition, your pattern is not likely to work in the way you expect. for
example, the following would match:

THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS ")

And in addition, if there are line breaks, like your example (as split by
the newsreader), the matching substring would be:

DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)

So, can you explain what your rules are, and what you are trying to match
here? I'm just guessing that you're parsing medical transcriptions, but
beyond that, I'm stumped.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.
"Mike" <ms********@charter.net> wrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Xicheng Jia wrote:
Mike wrote:
> I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
> matches. I would like to get what the actual regular expression is.
> In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
> CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
> do I gain access to the expression (not the matches) at runtime?

you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named capture,
like

^(?<expr>.+)(?=\s*).*\k<expr>

and access the variable "expr" at run time?

Xicheng

I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.

Thanks,

Jun 21 '06 #6

Kevin Spencer

Hi Mike,

As far as Top-Posting is concerned, AFAIK it's still a matter of debate, and
as we're talking about Netiquette, not ISO or W3C standards, my personal
feeling is that anyone who scolds one about top- or bottom-posting has poor
sense of priority. After all, the purpose of groups such as this is
communication. I find it far more difficult to deal with poor communication
than with the format of a post, but that's just me! ;-)

In your case, you have done a pretty darned good job of communication, and I
appreciate that, so I will certainly do all I can to help out! I did have to
do a little research into ICD9, but that wasn't hard with Google.

It took me a few minutes of study to figure out (for the most part) what
your requirements are. Let me see if I can repeat them back to you in my own
words, and ask a couple of questions:

1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
2. In any case, this data consists multiple single-line entries of text.
3. The data is stored in such a way that the text represents a hierarchical
structure of nodes.
4. This is achieved by a top-level classification that is repeated in each
"record" (line) for every record that falls under it.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
6. The node identifier text in the sub-nodes can be identified by comparing
it with other records that are under the top-level node. There is no other
way to distinguish this text from any other text in the record, other than
by comparing it with other records.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within the
set of records.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Question: Will these records be ordered in any way? IOW, for example, will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the data.

Question: Can you tell me what sort of format the end result is supposed to
be in? Is it simply a data structure in memory? Or what?

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@u72g2000cwu.googlegro ups.com...

Must say I get burned in six different ways. Some groups I top post
and get scolded. On other groups others people top post and nobody
appears to have a problem. I'll top post here.

Given I've been asked for details I'll provide them, but typically
nobody wants to wade through them.

In the dark ages I had 24,000 lines of ICD9 index entries which got
appended with ICD9 codes and were processed one time per year into a
big paper report with a tree-like structure by an assembler program on
an OS390. An abbreviated example of the report is below for the
Ablation entry.

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

Across my institution in the past there have been multiple "master"
copies of ICD9 codes and index entries. The order came down that
long-term we will work towards a single copy of ICD9 codes with index
entries that will be accessed via webservices. The structure of the
data in our old database was as follows (no line breaks -- each entry
was one line):

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
The new webservices still have this same index structure except now,
for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
of code 57.91. The surgical coders still want to view the index
entries in a tree structure on demand. Without getting into
mind-numbing details, I can jump through some hoops and get back a set
of index entries that look like above for ABLATION but they are not
formatted in the way the surgical coders desire. I believe I have a
recursive algorithm that will work to format these into a tree
structure but this algorithm is predicated on being able to find the
nodes.

If you look carefully, the root node for entire set of index entries
above is "ABLATION" (as that is what begins each entry and repeats
across all of them). Subsequently, Endometrial (Hysteroscopic) + code
is a child of ABLATION with no children of its own because it is not
repeated. Next, Heart (Conduction Defect) + code is a node with "With
Catheter + code" as a child of that node because "Heart (Conduction
Defect)" repeats across both those lines.

I have begged the group that now owns the webservice to allow me to
restructure the data but no go (they say that would be bastardizing the
concept of everything being 'code-centric'). I am stuck with this and
also with the demand by the coders that they get the formatted tree
structure to look at when they code.

In general, I think if I do the following I can figure out the nodes
and children:

1. Read index entries until the first word changes.
2. Get the substring that begins the string and is repeated elsewhere
in the string (this is the node).
3. Remove that node and keep processing until the base case is hit etc.

If anyone has any better ideas of how to deal with this I would be
thrilled to no end to hear them.
Thanks,

Mike
Kevin Spencer wrote:
> I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
> your suggestion first off in the morning.

First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
substring of the original string. The regular expression is the string
"^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
"HEART (CONDUCTION DEFECT)" is your match (which it is not), you could
call
it a match for the regular expression (which may match more than once in
a
string). But it is a substring of the original string. It may seem picky,
but in order to communicate effectively, one must use the right terms. As
an
example, if I told you that I ate a car for breakfast, would you know
that I
ate an apple?

Second, the string you posted contains 2 instances of the substring
"HEART
(CONDUCTION DEFECT)". Do you want to get both of them? If so, what
exactly
are your pattern-matching rules? A regular expression matches a pattern.
Obviously, not all of the strings you will be working with will be:

" HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 "

In fact, probably due to this being a newsgroup, and my using a
newsreader,
I would doubt that the line breaks in the string are where they are, if
they
are. And I have to wonder whether the string actually begins and ends
with a
space.

In other words, you're going to be using a regular expression to isolate
substrings of various strings (most probably). A regular expression is
shorthand for a set of rules that defines a pattern you're looking for.
Whether the strings contain line breaks, for example, is important. Your
regular expression begins with the caret '^' character. This character
can
indicate the beginning of a string, or the beginning of a line *or* a
string, depending upon what options you use. You didn't specify the
option(s) you're using, so we have no way to know.

In addition, your pattern is not likely to work in the way you expect.
for
example, the following would match:

THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS ")

And in addition, if there are line breaks, like your example (as split by
the newsreader), the matching substring would be:

DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)

So, can you explain what your rules are, and what you are trying to match
here? I'm just guessing that you're parsing medical transcriptions, but
beyond that, I'm stumped.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.
"Mike" <ms********@charter.net> wrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
> Xicheng Jia wrote:
>> Mike wrote:
>> > I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
>> > matches. I would like to get what the actual regular expression is.
>> > In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
>> > (CONDUCTION
>> > DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> > CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)".
>> > How
>> > do I gain access to the expression (not the matches) at runtime?
>>
>> you want to access the expression "HEART (CONDUCTION DEFECT)" or the
>> regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
>> the latter one though, for the previous one, you can use named
>> capture,
>> like
>>
>> ^(?<expr>.+)(?=\s*).*\k<expr>
>>
>> and access the variable "expr" at run time?
>>
>> Xicheng
>
> I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
> your suggestion first off in the morning.
>
> Thanks,
>

Jun 21 '06 #7

Mike

Hi Kevin,

Since you appear to be rational about the actual objective of groups
(communication), I'll try to respond effectively.

1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
Yes, the test is now stored in a database with webservice front-end.
2. In any case, this data consists multiple single-line entries of text.
More-or-less, yes. I can get the data as such after jumping through
some hoops (user enters the first word of the index entry they want
which retrieves all the codes that have an index entry that has an
Index Entry property starting with that word. Then, before each code
can have multiple index entries -- not necessarily all ones they want
-- I have to discard the ones that do not start with 'Ablation' for
example).
3. The data is stored in such a way that the text represents a hierarchical
structure of nodes.
If I understand your comment, yes. Actually, sorted alphabetically,
ALL the index entries are in the order they would appear in a tree
except for "WITH, BY" and a couple other terms that can become nodes
themselves. The issues is that the surgical indexers are used to
viewing the data in a certain way and want to continue to view it
without all the duplicate text in surrounding index entries.
4. This is achieved by a top-level classification that is repeated in each
"record" (line) for every record that falls under it.
Yes, I believe so.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
Yes. Sub nodes are just the text that repeats across lines after the
repeated substring in a larger set of lines has been removed.
6. The node identifier text in the sub-nodes can be identified by comparing
it with other records that are under the top-level node. There is no other
way to distinguish this text from any other text in the record, other than
by comparing it with other records.
Yes, because the actual text has no meta data to identify which is a
parent and which is a child.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within the
set of records.
Yes, I believe I can do this. On the old OS390 it was done kind of
like this with heavy parsing.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Yes, exactly, I believe I can setup a recursive algorithm that will
populate a tree view control that will represent the codes in a way the
surgical coders are used to seeing. They will, for example, be able to
explode "ABLATION" and see subnodes of "ENDOMETRIAL (HYSTEROSCOPIC)
68.23"
"Heart (Conduction Defect) 27.33/2"

Then, upon exploding those nodes any child nodes would be displayed,
etc.
Question: Will these records be ordered in any way? IOW, for example, will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the data.
In my pre-processing I can sort these alphabetically. As always, the
question is really how to eliminate the duplicated text from
surrounding lines and correctly place children/parents in relation to
each other.

Question: Can you tell me what sort of format the end result is supposed to
be in? Is it simply a data structure in memory? Or what?
Simply a data structure in memory as the end users want to be able to
pull this up on demand as they are coding surgical cases.
Kevin Spencer wrote: Hi Mike,

As far as Top-Posting is concerned, AFAIK it's still a matter of debate, and
as we're talking about Netiquette, not ISO or W3C standards, my personal
feeling is that anyone who scolds one about top- or bottom-posting has poor
sense of priority. After all, the purpose of groups such as this is
communication. I find it far more difficult to deal with poor communication
than with the format of a post, but that's just me! ;-)

In your case, you have done a pretty darned good job of communication, and I
appreciate that, so I will certainly do all I can to help out! I did have to
do a little research into ICD9, but that wasn't hard with Google.

It took me a few minutes of study to figure out (for the most part) what
your requirements are. Let me see if I can repeat them back to you in my own
words, and ask a couple of questions:

1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
2. In any case, this data consists multiple single-line entries of text.
3. The data is stored in such a way that the text represents a hierarchical
structure of nodes.
4. This is achieved by a top-level classification that is repeated in each
"record" (line) for every record that falls under it.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
6. The node identifier text in the sub-nodes can be identified by comparing
it with other records that are under the top-level node. There is no other
way to distinguish this text from any other text in the record, other than
by comparing it with other records.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within the
set of records.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Question: Will these records be ordered in any way? IOW, for example, will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the data.

Question: Can you tell me what sort of format the end result is supposed to
be in? Is it simply a data structure in memory? Or what?

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@u72g2000cwu.googlegro ups.com...
Must say I get burned in six different ways. Some groups I top post
and get scolded. On other groups others people top post and nobody
appears to have a problem. I'll top post here.

Given I've been asked for details I'll provide them, but typically
nobody wants to wade through them.

In the dark ages I had 24,000 lines of ICD9 index entries which got
appended with ICD9 codes and were processed one time per year into a
big paper report with a tree-like structure by an assembler program on
an OS390. An abbreviated example of the report is below for the
Ablation entry.

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

Across my institution in the past there have been multiple "master"
copies of ICD9 codes and index entries. The order came down that
long-term we will work towards a single copy of ICD9 codes with index
entries that will be accessed via webservices. The structure of the
data in our old database was as follows (no line breaks -- each entry
was one line):

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
The new webservices still have this same index structure except now,
for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
of code 57.91. The surgical coders still want to view the index
entries in a tree structure on demand. Without getting into
mind-numbing details, I can jump through some hoops and get back a set
of index entries that look like above for ABLATION but they are not
formatted in the way the surgical coders desire. I believe I have a
recursive algorithm that will work to format these into a tree
structure but this algorithm is predicated on being able to find the
nodes.

If you look carefully, the root node for entire set of index entries
above is "ABLATION" (as that is what begins each entry and repeats
across all of them). Subsequently, Endometrial (Hysteroscopic) + code
is a child of ABLATION with no children of its own because it is not
repeated. Next, Heart (Conduction Defect) + code is a node with "With
Catheter + code" as a child of that node because "Heart (Conduction
Defect)" repeats across both those lines.

I have begged the group that now owns the webservice to allow me to
restructure the data but no go (they say that would be bastardizing the
concept of everything being 'code-centric'). I am stuck with this and
also with the demand by the coders that they get the formatted tree
structure to look at when they code.

In general, I think if I do the following I can figure out the nodes
and children:

1. Read index entries until the first word changes.
2. Get the substring that begins the string and is repeated elsewhere
in the string (this is the node).
3. Remove that node and keep processing until the base case is hit etc.

If anyone has any better ideas of how to deal with this I would be
thrilled to no end to hear them.
Thanks,

Mike
Kevin Spencer wrote:
> I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
> your suggestion first off in the morning.

First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
substring of the original string. The regular expression is the string
"^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
"HEART (CONDUCTION DEFECT)" is your match (which it is not), you could
call
it a match for the regular expression (which may match more than once in
a
string). But it is a substring of the original string. It may seem picky,
but in order to communicate effectively, one must use the right terms. As
an
example, if I told you that I ate a car for breakfast, would you know
that I
ate an apple?

Second, the string you posted contains 2 instances of the substring
"HEART
(CONDUCTION DEFECT)". Do you want to get both of them? If so, what
exactly
are your pattern-matching rules? A regular expression matches a pattern.
Obviously, not all of the strings you will be working with will be:

" HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 "

In fact, probably due to this being a newsgroup, and my using a
newsreader,
I would doubt that the line breaks in the string are where they are, if
they
are. And I have to wonder whether the string actually begins and ends
with a
space.

In other words, you're going to be using a regular expression to isolate
substrings of various strings (most probably). A regular expression is
shorthand for a set of rules that defines a pattern you're looking for.
Whether the strings contain line breaks, for example, is important. Your
regular expression begins with the caret '^' character. This character
can
indicate the beginning of a string, or the beginning of a line *or* a
string, depending upon what options you use. You didn't specify the
option(s) you're using, so we have no way to know.

In addition, your pattern is not likely to work in the way you expect.
for
example, the following would match:

THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS ")

And in addition, if there are line breaks, like your example (as split by
the newsreader), the matching substring would be:

DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)

So, can you explain what your rules are, and what you are trying to match
here? I'm just guessing that you're parsing medical transcriptions, but
beyond that, I'm stumped.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.
"Mike" <ms********@charter.net> wrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
> Xicheng Jia wrote:
>> Mike wrote:
>> > I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
>> > matches. I would like to get what the actual regular expression is.
>> > In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
>> > (CONDUCTION
>> > DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> > CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)".
>> > How
>> > do I gain access to the expression (not the matches) at runtime?
>>
>> you want to access the expression "HEART (CONDUCTION DEFECT)" or the
>> regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
>> the latter one though, for the previous one, you can use named
>> capture,
>> like
>>
>> ^(?<expr>.+)(?=\s*).*\k<expr>
>>
>> and access the variable "expr" at run time?
>>
>> Xicheng
>
> I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
> your suggestion first off in the morning.
>
> Thanks,
>

Jun 21 '06 #8

Kevin Spencer

Hi Mike,

I fiddled with this problem using regular expressions for entirely too long
last night, and finally came to the conclusion that regular expressions
aren't going to provide what you need in this case. As you discovered, your
regular expression solution, which was about as close as one could get to
something that works with regular expressions, can't identify an unknown
pattern and then match that, which is essentially what you tried valiantly
to do. I have to give you credit for creativity!

Of course, this doesn't bring you any closer to a solution, so I gave that
some thought as well. It seems to me that you're looking for some sort of
recursive nested looping function. Once the data is sorted alphabetically,
it's basically a matter of comparing each line with the line that follows.
If you can be sure that the pattern will break on a word break (space), the
task becomes easier. I'll try to sketch something out along the lines of
what I'm thinking, and you can see what you think and perhaps flesh it out:

This is the comparison method. It does a char-by-char comparison of 2
strings, returning the number of chars that match from the beginning of the
first string. If you can be sure that your nodes will break on spaces, you
could optimize this by using a word-by-word comparison.

string[] items;

// Compare each char of a string in an array with
// Each char of the next string in the array, and
// return the length of the matched string.
int Match(int index, int length)
{
int maxLength = (length < 0 ? items[i].Length : length;
if (index == items.Length - 1) return 0;
for (int i = 0; i < maxLength; i++)
{
if (items[index][i] != items[index + 1][i]) return i - 1;
if (i == items[index + 1].Length - 1) return i;
}
}

Now, what I would do with this is, since you want to create a hierarchical
tree, use the System.Xml Namespace, and an XmlDocument class to create your
in-memory structure. You could certainly create your own lightweight
hierarchical node type, but this way, if the need ever arises (and it
probably will) that you want to transform your data to another format, you
have the ability to use the XmlDocument class as an XML Document, and
transform it any way you like (including as pure XML text), one of the
beauties of XML.

Once you've created your root node, you loop through the "array", calling
the Match method for each item in the "array" until it returns 0. You
initialize it by passing -1 to it, which indicates that it compares the
entire length of the first string. After that, you pass the return value
from the first comparison, which gives you the length of the first child
node. At this point you have your first sub-grouping, and your first child
node, which is the substring of the first string having the length returned
by the first comparison.

If the number of iterations is less than the length of the "array" you start
again with the next item in the array, in the same manner as the first. Each
pass of this routine returns a "node" and the length of the node value.

You recursively repeat this process for each subset of each node, starting
with the length of the node value, and using the substring starting from
that point for each element in the subset. This adds a list of nodes to each
node, and recursively does the same for each child node of each node, and so
on. When you have reached the end of all the strings, you're done.

This is about as elegant a solution as I can come up with. I'll be
interested to hear about your final solution.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...

Hi Kevin,

Since you appear to be rational about the actual objective of groups
(communication), I'll try to respond effectively.
1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.

Yes, the test is now stored in a database with webservice front-end.
2. In any case, this data consists multiple single-line entries of text.

More-or-less, yes. I can get the data as such after jumping through
some hoops (user enters the first word of the index entry they want
which retrieves all the codes that have an index entry that has an
Index Entry property starting with that word. Then, before each code
can have multiple index entries -- not necessarily all ones they want
-- I have to discard the ones that do not start with 'Ablation' for
example).
3. The data is stored in such a way that the text represents a
hierarchical
structure of nodes.

If I understand your comment, yes. Actually, sorted alphabetically,
ALL the index entries are in the order they would appear in a tree
except for "WITH, BY" and a couple other terms that can become nodes
themselves. The issues is that the surgical indexers are used to
viewing the data in a certain way and want to continue to view it
without all the duplicate text in surrounding index entries.
4. This is achieved by a top-level classification that is repeated in
each
"record" (line) for every record that falls under it.

Yes, I believe so.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.

Yes. Sub nodes are just the text that repeats across lines after the
repeated substring in a larger set of lines has been removed.
6. The node identifier text in the sub-nodes can be identified by
comparing
it with other records that are under the top-level node. There is no
other
way to distinguish this text from any other text in the record, other
than
by comparing it with other records.

Yes, because the actual text has no meta data to identify which is a
parent and which is a child.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within
the
set of records.

Yes, I believe I can do this. On the old OS390 it was done kind of
like this with heavy parsing.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Yes, exactly, I believe I can setup a recursive algorithm that will
populate a tree view control that will represent the codes in a way the
surgical coders are used to seeing. They will, for example, be able to
explode "ABLATION" and see subnodes of "ENDOMETRIAL (HYSTEROSCOPIC)
68.23"
"Heart (Conduction Defect) 27.33/2"

Then, upon exploding those nodes any child nodes would be displayed,
etc.
Question: Will these records be ordered in any way? IOW, for example,
will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the
data.

In my pre-processing I can sort these alphabetically. As always, the
question is really how to eliminate the duplicated text from
surrounding lines and correctly place children/parents in relation to
each other.

Question: Can you tell me what sort of format the end result is supposed
to
be in? Is it simply a data structure in memory? Or what?

Simply a data structure in memory as the end users want to be able to
pull this up on demand as they are coding surgical cases.
Kevin Spencer wrote:
Hi Mike,

As far as Top-Posting is concerned, AFAIK it's still a matter of debate,
and
as we're talking about Netiquette, not ISO or W3C standards, my personal
feeling is that anyone who scolds one about top- or bottom-posting has
poor
sense of priority. After all, the purpose of groups such as this is
communication. I find it far more difficult to deal with poor
communication
than with the format of a post, but that's just me! ;-)

In your case, you have done a pretty darned good job of communication,
and I
appreciate that, so I will certainly do all I can to help out! I did have
to
do a little research into ICD9, but that wasn't hard with Google.

It took me a few minutes of study to figure out (for the most part) what
your requirements are. Let me see if I can repeat them back to you in my
own
words, and ask a couple of questions:

1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
2. In any case, this data consists multiple single-line entries of text.
3. The data is stored in such a way that the text represents a
hierarchical
structure of nodes.
4. This is achieved by a top-level classification that is repeated in
each
"record" (line) for every record that falls under it.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
6. The node identifier text in the sub-nodes can be identified by
comparing
it with other records that are under the top-level node. There is no
other
way to distinguish this text from any other text in the record, other
than
by comparing it with other records.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within
the
set of records.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Question: Will these records be ordered in any way? IOW, for example,
will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the
data.

Question: Can you tell me what sort of format the end result is supposed
to
be in? Is it simply a data structure in memory? Or what?

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@u72g2000cwu.googlegro ups.com...
> Must say I get burned in six different ways. Some groups I top post
> and get scolded. On other groups others people top post and nobody
> appears to have a problem. I'll top post here.
>
> Given I've been asked for details I'll provide them, but typically
> nobody wants to wade through them.
>
> In the dark ages I had 24,000 lines of ICD9 index entries which got
> appended with ICD9 codes and were processed one time per year into a
> big paper report with a tree-like structure by an assembler program on
> an OS390. An abbreviated example of the report is below for the
> Ablation entry.
>
> Ablation
> Endometrial (Hysteroscopic) 68.23
> Heart (Conduction Defect) 27.33/2
> With Catheter 37.34/2
> Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
> By Injection 20.72
> Lesion Heart
> By Peripherally Inserted Catheter 37.34
>
> Across my institution in the past there have been multiple "master"
> copies of ICD9 codes and index entries. The order came down that
> long-term we will work towards a single copy of ICD9 codes with index
> entries that will be accessed via webservices. The structure of the
> data in our old database was as follows (no line breaks -- each entry
> was one line):
>
> ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
> ABLATION HEART (CONDUCTION DEFECT) 37.33/2
> ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
> ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
> ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
> ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
> ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
> APPROACH 37.34
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
> APPROACH 37.33
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
> APPROACH 37.33
> ABLATION PITUITARY 7.69
> ABLATION PITUITARY BY COBALT-60 92.32
> ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
> ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
> ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
> ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
> 60.97
> ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
> (TUNA) 60.97
> ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
> ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
> ABLATION (RCSA) 60.62
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
> ABLATION (RCSA) 60.29
> ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
> ABLATION VESICLE NECK (ANAT = 60.02) 57.91
>
>
> The new webservices still have this same index structure except now,
> for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
> of code 57.91. The surgical coders still want to view the index
> entries in a tree structure on demand. Without getting into
> mind-numbing details, I can jump through some hoops and get back a set
> of index entries that look like above for ABLATION but they are not
> formatted in the way the surgical coders desire. I believe I have a
> recursive algorithm that will work to format these into a tree
> structure but this algorithm is predicated on being able to find the
> nodes.
>
> If you look carefully, the root node for entire set of index entries
> above is "ABLATION" (as that is what begins each entry and repeats
> across all of them). Subsequently, Endometrial (Hysteroscopic) + code
> is a child of ABLATION with no children of its own because it is not
> repeated. Next, Heart (Conduction Defect) + code is a node with "With
> Catheter + code" as a child of that node because "Heart (Conduction
> Defect)" repeats across both those lines.
>
> I have begged the group that now owns the webservice to allow me to
> restructure the data but no go (they say that would be bastardizing the
> concept of everything being 'code-centric'). I am stuck with this and
> also with the demand by the coders that they get the formatted tree
> structure to look at when they code.
>
> In general, I think if I do the following I can figure out the nodes
> and children:
>
> 1. Read index entries until the first word changes.
> 2. Get the substring that begins the string and is repeated elsewhere
> in the string (this is the node).
> 3. Remove that node and keep processing until the base case is hit etc.
>
> If anyone has any better ideas of how to deal with this I would be
> thrilled to no end to hear them.
>
>
> Thanks,
>
> Mike
>
>
> Kevin Spencer wrote:
>> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> > try
>> > your suggestion first off in the morning.
>>
>> First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
>> substring of the original string. The regular expression is the string
>> "^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
>> "HEART (CONDUCTION DEFECT)" is your match (which it is not), you
>> could
>> call
>> it a match for the regular expression (which may match more than once
>> in
>> a
>> string). But it is a substring of the original string. It may seem
>> picky,
>> but in order to communicate effectively, one must use the right terms.
>> As
>> an
>> example, if I told you that I ate a car for breakfast, would you know
>> that I
>> ate an apple?
>>
>> Second, the string you posted contains 2 instances of the substring
>> "HEART
>> (CONDUCTION DEFECT)". Do you want to get both of them? If so, what
>> exactly
>> are your pattern-matching rules? A regular expression matches a
>> pattern.
>> Obviously, not all of the strings you will be working with will be:
>>
>> " HEART (CONDUCTION
>> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> CATHETER 37.34/2 "
>>
>> In fact, probably due to this being a newsgroup, and my using a
>> newsreader,
>> I would doubt that the line breaks in the string are where they are,
>> if
>> they
>> are. And I have to wonder whether the string actually begins and ends
>> with a
>> space.
>>
>> In other words, you're going to be using a regular expression to
>> isolate
>> substrings of various strings (most probably). A regular expression is
>> shorthand for a set of rules that defines a pattern you're looking
>> for.
>> Whether the strings contain line breaks, for example, is important.
>> Your
>> regular expression begins with the caret '^' character. This character
>> can
>> indicate the beginning of a string, or the beginning of a line *or* a
>> string, depending upon what options you use. You didn't specify the
>> option(s) you're using, so we have no way to know.
>>
>> In addition, your pattern is not likely to work in the way you expect.
>> for
>> example, the following would match:
>>
>> THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS
>> ")
>>
>> And in addition, if there are line breaks, like your example (as split
>> by
>> the newsreader), the matching substring would be:
>>
>> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
>>
>> So, can you explain what your rules are, and what you are trying to
>> match
>> here? I'm just guessing that you're parsing medical transcriptions,
>> but
>> beyond that, I'm stumped.
>>
>> --
>> HTH,
>>
>> Kevin Spencer
>> Microsoft MVP
>> Professional Chicken Salad Alchemist
>>
>> I recycle.
>> I send everything back to the planet it came from.
>>
>>
>> "Mike" <ms********@charter.net> wrote in message
>> news:11**********************@p79g2000cwp.googlegr oups.com...
>> > Xicheng Jia wrote:
>> >> Mike wrote:
>> >> > I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
>> >> > matches. I would like to get what the actual regular expression
>> >> > is.
>> >> > In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
>> >> > (CONDUCTION
>> >> > DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> >> > CATHETER 37.34/2 " the expression is "HEART (CONDUCTION
>> >> > DEFECT)".
>> >> > How
>> >> > do I gain access to the expression (not the matches) at runtime?
>> >>
>> >> you want to access the expression "HEART (CONDUCTION DEFECT)" or
>> >> the
>> >> regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get
>> >> exactly
>> >> the latter one though, for the previous one, you can use named
>> >> capture,
>> >> like
>> >>
>> >> ^(?<expr>.+)(?=\s*).*\k<expr>
>> >>
>> >> and access the variable "expr" at run time?
>> >>
>> >> Xicheng
>> >
>> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> > try
>> > your suggestion first off in the morning.
>> >
>> > Thanks,
>> >
>

Jun 22 '06 #9

Mike

Hi Kevin,

I have been working on this problem on and off for maybe 9 months. New
ICD9 codes are released in October so I have a few months to get a
solution in place. I am going to see if there is some other way to get
the owners of the webservices to offer something different.. I'd like
to get the data more structure (i.e do something similar to this:
http://www.developerfusion.co.uk/show/4633/) Any idea if retrieving
data structured like in the article would fit well with the C# tree
structure?

I will also look at your suggestion, but I had previously tried to
implement something similar with a two-dimensional array and the
problem was with performance. Very slow for those entries with several
hundred rows such as "Biopsy"...

Mike

Kevin Spencer wrote:

Hi Mike,

I fiddled with this problem using regular expressions for entirely too long
last night, and finally came to the conclusion that regular expressions
aren't going to provide what you need in this case. As you discovered, your
regular expression solution, which was about as close as one could get to
something that works with regular expressions, can't identify an unknown
pattern and then match that, which is essentially what you tried valiantly
to do. I have to give you credit for creativity!

Of course, this doesn't bring you any closer to a solution, so I gave that
some thought as well. It seems to me that you're looking for some sort of
recursive nested looping function. Once the data is sorted alphabetically,
it's basically a matter of comparing each line with the line that follows.
If you can be sure that the pattern will break on a word break (space), the
task becomes easier. I'll try to sketch something out along the lines of
what I'm thinking, and you can see what you think and perhaps flesh it out:

This is the comparison method. It does a char-by-char comparison of 2
strings, returning the number of chars that match from the beginning of the
first string. If you can be sure that your nodes will break on spaces, you
could optimize this by using a word-by-word comparison.

string[] items;

// Compare each char of a string in an array with
// Each char of the next string in the array, and
// return the length of the matched string.
int Match(int index, int length)
{
int maxLength = (length < 0 ? items[i].Length : length;
if (index == items.Length - 1) return 0;
for (int i = 0; i < maxLength; i++)
{
if (items[index][i] != items[index + 1][i]) return i - 1;
if (i == items[index + 1].Length - 1) return i;
}
}

Now, what I would do with this is, since you want to create a hierarchical
tree, use the System.Xml Namespace, and an XmlDocument class to create your
in-memory structure. You could certainly create your own lightweight
hierarchical node type, but this way, if the need ever arises (and it
probably will) that you want to transform your data to another format, you
have the ability to use the XmlDocument class as an XML Document, and
transform it any way you like (including as pure XML text), one of the
beauties of XML.

Once you've created your root node, you loop through the "array", calling
the Match method for each item in the "array" until it returns 0. You
initialize it by passing -1 to it, which indicates that it compares the
entire length of the first string. After that, you pass the return value
from the first comparison, which gives you the length of the first child
node. At this point you have your first sub-grouping, and your first child
node, which is the substring of the first string having the length returned
by the first comparison.

If the number of iterations is less than the length of the "array" you start
again with the next item in the array, in the same manner as the first. Each
pass of this routine returns a "node" and the length of the node value.

You recursively repeat this process for each subset of each node, starting
with the length of the node value, and using the substring starting from
that point for each element in the subset. This adds a list of nodes to each
node, and recursively does the same for each child node of each node, and so
on. When you have reached the end of all the strings, you're done.

This is about as elegant a solution as I can come up with. I'll be
interested to hear about your final solution.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...
Hi Kevin,

Since you appear to be rational about the actual objective of groups
(communication), I'll try to respond effectively.
1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.

Yes, the test is now stored in a database with webservice front-end.
2. In any case, this data consists multiple single-line entries of text.

More-or-less, yes. I can get the data as such after jumping through
some hoops (user enters the first word of the index entry they want
which retrieves all the codes that have an index entry that has an
Index Entry property starting with that word. Then, before each code
can have multiple index entries -- not necessarily all ones they want
-- I have to discard the ones that do not start with 'Ablation' for
example).
3. The data is stored in such a way that the text represents a
hierarchical
structure of nodes.

If I understand your comment, yes. Actually, sorted alphabetically,
ALL the index entries are in the order they would appear in a tree
except for "WITH, BY" and a couple other terms that can become nodes
themselves. The issues is that the surgical indexers are used to
viewing the data in a certain way and want to continue to view it
without all the duplicate text in surrounding index entries.
4. This is achieved by a top-level classification that is repeated in
each
"record" (line) for every record that falls under it.

Yes, I believe so.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.

Yes. Sub nodes are just the text that repeats across lines after the
repeated substring in a larger set of lines has been removed.
6. The node identifier text in the sub-nodes can be identified by
comparing
it with other records that are under the top-level node. There is no
other
way to distinguish this text from any other text in the record, other
than
by comparing it with other records.

Yes, because the actual text has no meta data to identify which is a
parent and which is a child.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within
the
set of records.

Yes, I believe I can do this. On the old OS390 it was done kind of
like this with heavy parsing.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Yes, exactly, I believe I can setup a recursive algorithm that will
populate a tree view control that will represent the codes in a way the
surgical coders are used to seeing. They will, for example, be able to
explode "ABLATION" and see subnodes of "ENDOMETRIAL (HYSTEROSCOPIC)
68.23"
"Heart (Conduction Defect) 27.33/2"

Then, upon exploding those nodes any child nodes would be displayed,
etc.
Question: Will these records be ordered in any way? IOW, for example,
will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the
data.

In my pre-processing I can sort these alphabetically. As always, the
question is really how to eliminate the duplicated text from
surrounding lines and correctly place children/parents in relation to
each other.

Question: Can you tell me what sort of format the end result is supposed
to
be in? Is it simply a data structure in memory? Or what?

Simply a data structure in memory as the end users want to be able to
pull this up on demand as they are coding surgical cases.
Kevin Spencer wrote:
Hi Mike,

As far as Top-Posting is concerned, AFAIK it's still a matter of debate,
and
as we're talking about Netiquette, not ISO or W3C standards, my personal
feeling is that anyone who scolds one about top- or bottom-posting has
poor
sense of priority. After all, the purpose of groups such as this is
communication. I find it far more difficult to deal with poor
communication
than with the format of a post, but that's just me! ;-)

In your case, you have done a pretty darned good job of communication,
and I
appreciate that, so I will certainly do all I can to help out! I did have
to
do a little research into ICD9, but that wasn't hard with Google.

It took me a few minutes of study to figure out (for the most part) what
your requirements are. Let me see if I can repeat them back to you in my
own
words, and ask a couple of questions:

1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
2. In any case, this data consists multiple single-line entries of text.
3. The data is stored in such a way that the text represents a
hierarchical
structure of nodes.
4. This is achieved by a top-level classification that is repeated in
each
"record" (line) for every record that falls under it.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
6. The node identifier text in the sub-nodes can be identified by
comparing
it with other records that are under the top-level node. There is no
other
way to distinguish this text from any other text in the record, other
than
by comparing it with other records.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within
the
set of records.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.

Question: Will these records be ordered in any way? IOW, for example,
will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the
data.

Question: Can you tell me what sort of format the end result is supposed
to
be in? Is it simply a data structure in memory? Or what?

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11*********************@u72g2000cwu.googlegro ups.com...
> Must say I get burned in six different ways. Some groups I top post
> and get scolded. On other groups others people top post and nobody
> appears to have a problem. I'll top post here.
>
> Given I've been asked for details I'll provide them, but typically
> nobody wants to wade through them.
>
> In the dark ages I had 24,000 lines of ICD9 index entries which got
> appended with ICD9 codes and were processed one time per year into a
> big paper report with a tree-like structure by an assembler program on
> an OS390. An abbreviated example of the report is below for the
> Ablation entry.
>
> Ablation
> Endometrial (Hysteroscopic) 68.23
> Heart (Conduction Defect) 27.33/2
> With Catheter 37.34/2
> Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
> By Injection 20.72
> Lesion Heart
> By Peripherally Inserted Catheter 37.34
>
> Across my institution in the past there have been multiple "master"
> copies of ICD9 codes and index entries. The order came down that
> long-term we will work towards a single copy of ICD9 codes with index
> entries that will be accessed via webservices. The structure of the
> data in our old database was as follows (no line breaks -- each entry
> was one line):
>
> ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
> ABLATION HEART (CONDUCTION DEFECT) 37.33/2
> ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
> ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
> ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
> ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
> ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
> APPROACH 37.34
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
> APPROACH 37.33
> ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
> APPROACH 37.33
> ABLATION PITUITARY 7.69
> ABLATION PITUITARY BY COBALT-60 92.32
> ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
> ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
> ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
> ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
> 60.97
> ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
> (TUNA) 60.97
> ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
> ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
> ABLATION (RCSA) 60.62
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
> ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
> ABLATION (RCSA) 60.29
> ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
> ABLATION VESICLE NECK (ANAT = 60.02) 57.91
>
>
> The new webservices still have this same index structure except now,
> for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
> of code 57.91. The surgical coders still want to view the index
> entries in a tree structure on demand. Without getting into
> mind-numbing details, I can jump through some hoops and get back a set
> of index entries that look like above for ABLATION but they are not
> formatted in the way the surgical coders desire. I believe I have a
> recursive algorithm that will work to format these into a tree
> structure but this algorithm is predicated on being able to find the
> nodes.
>
> If you look carefully, the root node for entire set of index entries
> above is "ABLATION" (as that is what begins each entry and repeats
> across all of them). Subsequently, Endometrial (Hysteroscopic) + code
> is a child of ABLATION with no children of its own because it is not
> repeated. Next, Heart (Conduction Defect) + code is a node with "With
> Catheter + code" as a child of that node because "Heart (Conduction
> Defect)" repeats across both those lines.
>
> I have begged the group that now owns the webservice to allow me to
> restructure the data but no go (they say that would be bastardizing the
> concept of everything being 'code-centric'). I am stuck with this and
> also with the demand by the coders that they get the formatted tree
> structure to look at when they code.
>
> In general, I think if I do the following I can figure out the nodes
> and children:
>
> 1. Read index entries until the first word changes.
> 2. Get the substring that begins the string and is repeated elsewhere
> in the string (this is the node).
> 3. Remove that node and keep processing until the base case is hit etc.
>
> If anyone has any better ideas of how to deal with this I would be
> thrilled to no end to hear them.
>
>
> Thanks,
>
> Mike
>
>
> Kevin Spencer wrote:
>> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> > try
>> > your suggestion first off in the morning.
>>
>> First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
>> substring of the original string. The regular expression is the string
>> "^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
>> "HEART (CONDUCTION DEFECT)" is your match (which it is not), you
>> could
>> call
>> it a match for the regular expression (which may match more than once
>> in
>> a
>> string). But it is a substring of the original string. It may seem
>> picky,
>> but in order to communicate effectively, one must use the right terms.
>> As
>> an
>> example, if I told you that I ate a car for breakfast, would you know
>> that I
>> ate an apple?
>>
>> Second, the string you posted contains 2 instances of the substring
>> "HEART
>> (CONDUCTION DEFECT)". Do you want to get both of them? If so, what
>> exactly
>> are your pattern-matching rules? A regular expression matches a
>> pattern.
>> Obviously, not all of the strings you will be working with will be:
>>
>> " HEART (CONDUCTION
>> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> CATHETER 37.34/2 "
>>
>> In fact, probably due to this being a newsgroup, and my using a
>> newsreader,
>> I would doubt that the line breaks in the string are where they are,
>> if
>> they
>> are. And I have to wonder whether the string actually begins and ends
>> with a
>> space.
>>
>> In other words, you're going to be using a regular expression to
>> isolate
>> substrings of various strings (most probably). A regular expression is
>> shorthand for a set of rules that defines a pattern you're looking
>> for.
>> Whether the strings contain line breaks, for example, is important.
>> Your
>> regular expression begins with the caret '^' character. This character
>> can
>> indicate the beginning of a string, or the beginning of a line *or* a
>> string, depending upon what options you use. You didn't specify the
>> option(s) you're using, so we have no way to know.
>>
>> In addition, your pattern is not likely to work in the way you expect.
>> for
>> example, the following would match:
>>
>> THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS
>> ")
>>
>> And in addition, if there are line breaks, like your example (as split
>> by
>> the newsreader), the matching substring would be:
>>
>> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
>>
>> So, can you explain what your rules are, and what you are trying to
>> match
>> here? I'm just guessing that you're parsing medical transcriptions,
>> but
>> beyond that, I'm stumped.
>>
>> --
>> HTH,
>>
>> Kevin Spencer
>> Microsoft MVP
>> Professional Chicken Salad Alchemist
>>
>> I recycle.
>> I send everything back to the planet it came from.
>>
>>
>> "Mike" <ms********@charter.net> wrote in message
>> news:11**********************@p79g2000cwp.googlegr oups.com...
>> > Xicheng Jia wrote:
>> >> Mike wrote:
>> >> > I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
>> >> > matches. I would like to get what the actual regular expression
>> >> > is.
>> >> > In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
>> >> > (CONDUCTION
>> >> > DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> >> > CATHETER 37.34/2 " the expression is "HEART (CONDUCTION
>> >> > DEFECT)".
>> >> > How
>> >> > do I gain access to the expression (not the matches) at runtime?
>> >>
>> >> you want to access the expression "HEART (CONDUCTION DEFECT)" or
>> >> the
>> >> regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get
>> >> exactly
>> >> the latter one though, for the previous one, you can use named
>> >> capture,
>> >> like
>> >>
>> >> ^(?<expr>.+)(?=\s*).*\k<expr>
>> >>
>> >> and access the variable "expr" at run time?
>> >>
>> >> Xicheng
>> >
>> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> > try
>> > your suggestion first off in the morning.
>> >
>> > Thanks,
>> >
>

Jun 22 '06 #10

chornbe

Mike wrote:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How
do I gain access to the expression (not the matches) at runtime?

Thanks,

Mike

Isn't the zero-th match (or group?) the string you're searching?

Jun 22 '06 #11

Kevin Spencer

Hi Mike,

The article looks very much like what I mentioned regarding creating your
own tree structure. The reason I suggested an XmlDocument is that it is also
a hierarchical tree, but can be transformed easily into virtually any other
format, including HTML, database, etc., and is also
cross-platform-compatible (pure text). The System.Xml namespace has plenty
of ready-made classes, such as XmlNode, XmlElement, XmlDocument, etc., which
carry a small performance penalty, due to their conformance to the XML
standard, but I would think the performance penalty was well worth it,
considering the extensibility of the result.

If you were to use Regular Expressions, you would incur about the same
performance problem as my character-based comparison, since a Regular
Expression compares a string character-by character, and even involves some
backtracking.And I just don't see Regular Expressions filling the bill here,
although I'll admit, it is possible that I'm wrong about that. As I said, if
you can be sure about word breaks, you can do a word-by-word comparison, but
of course, under the covers it always breaks down at some point to a
char-by-char comparison.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...

Hi Kevin,

I have been working on this problem on and off for maybe 9 months. New
ICD9 codes are released in October so I have a few months to get a
solution in place. I am going to see if there is some other way to get
the owners of the webservices to offer something different.. I'd like
to get the data more structure (i.e do something similar to this:
http://www.developerfusion.co.uk/show/4633/) Any idea if retrieving
data structured like in the article would fit well with the C# tree
structure?

I will also look at your suggestion, but I had previously tried to
implement something similar with a two-dimensional array and the
problem was with performance. Very slow for those entries with several
hundred rows such as "Biopsy"...

Mike

Kevin Spencer wrote:
Hi Mike,

I fiddled with this problem using regular expressions for entirely too
long
last night, and finally came to the conclusion that regular expressions
aren't going to provide what you need in this case. As you discovered,
your
regular expression solution, which was about as close as one could get to
something that works with regular expressions, can't identify an unknown
pattern and then match that, which is essentially what you tried
valiantly
to do. I have to give you credit for creativity!

Of course, this doesn't bring you any closer to a solution, so I gave
that
some thought as well. It seems to me that you're looking for some sort of
recursive nested looping function. Once the data is sorted
alphabetically,
it's basically a matter of comparing each line with the line that
follows.
If you can be sure that the pattern will break on a word break (space),
the
task becomes easier. I'll try to sketch something out along the lines of
what I'm thinking, and you can see what you think and perhaps flesh it
out:

This is the comparison method. It does a char-by-char comparison of 2
strings, returning the number of chars that match from the beginning of
the
first string. If you can be sure that your nodes will break on spaces,
you
could optimize this by using a word-by-word comparison.

string[] items;

// Compare each char of a string in an array with
// Each char of the next string in the array, and
// return the length of the matched string.
int Match(int index, int length)
{
int maxLength = (length < 0 ? items[i].Length : length;
if (index == items.Length - 1) return 0;
for (int i = 0; i < maxLength; i++)
{
if (items[index][i] != items[index + 1][i]) return i - 1;
if (i == items[index + 1].Length - 1) return i;
}
}

Now, what I would do with this is, since you want to create a
hierarchical
tree, use the System.Xml Namespace, and an XmlDocument class to create
your
in-memory structure. You could certainly create your own lightweight
hierarchical node type, but this way, if the need ever arises (and it
probably will) that you want to transform your data to another format,
you
have the ability to use the XmlDocument class as an XML Document, and
transform it any way you like (including as pure XML text), one of the
beauties of XML.

Once you've created your root node, you loop through the "array", calling
the Match method for each item in the "array" until it returns 0. You
initialize it by passing -1 to it, which indicates that it compares the
entire length of the first string. After that, you pass the return value
from the first comparison, which gives you the length of the first child
node. At this point you have your first sub-grouping, and your first
child
node, which is the substring of the first string having the length
returned
by the first comparison.

If the number of iterations is less than the length of the "array" you
start
again with the next item in the array, in the same manner as the first.
Each
pass of this routine returns a "node" and the length of the node value.

You recursively repeat this process for each subset of each node,
starting
with the length of the node value, and using the substring starting from
that point for each element in the subset. This adds a list of nodes to
each
node, and recursively does the same for each child node of each node, and
so
on. When you have reached the end of all the strings, you're done.

This is about as elegant a solution as I can come up with. I'll be
interested to hear about your final solution.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...
> Hi Kevin,
>
> Since you appear to be rational about the actual objective of groups
> (communication), I'll try to respond effectively.
>
>> 1. You have a set of data that is pure text, and is either stored in
>> an
>> actual database, or in the text equivalent of a database as a
>> multi-line
>> text document. I can't be exactly sure.
>
> Yes, the test is now stored in a database with webservice front-end.
>
>> 2. In any case, this data consists multiple single-line entries of
>> text.
>
> More-or-less, yes. I can get the data as such after jumping through
> some hoops (user enters the first word of the index entry they want
> which retrieves all the codes that have an index entry that has an
> Index Entry property starting with that word. Then, before each code
> can have multiple index entries -- not necessarily all ones they want
> -- I have to discard the ones that do not start with 'Ablation' for
> example).
>
>> 3. The data is stored in such a way that the text represents a
>> hierarchical
>> structure of nodes.
>
> If I understand your comment, yes. Actually, sorted alphabetically,
> ALL the index entries are in the order they would appear in a tree
> except for "WITH, BY" and a couple other terms that can become nodes
> themselves. The issues is that the surgical indexers are used to
> viewing the data in a certain way and want to continue to view it
> without all the duplicate text in surrounding index entries.
>
>> 4. This is achieved by a top-level classification that is repeated in
>> each
>> "record" (line) for every record that falls under it.
>
> Yes, I believe so.
>
>> 5. Sub-nodes are indicated in the same way by the first text that
>> follows
>> the top-level node text.
>
> Yes. Sub nodes are just the text that repeats across lines after the
> repeated substring in a larger set of lines has been removed.
>
>> 6. The node identifier text in the sub-nodes can be identified by
>> comparing
>> it with other records that are under the top-level node. There is no
>> other
>> way to distinguish this text from any other text in the record, other
>> than
>> by comparing it with other records.
>
> Yes, because the actual text has no meta data to identify which is a
> parent and which is a child.
>
>> 7. Therefore, the structure of the hierarchy can be inferred by using
>> a
>> recursive procedure that identifies increasingly "deep" sub-nodes
>> within
>> the
>> set of records.
>
> Yes, I believe I can do this. On the old OS390 it was done kind of
> like this with heavy parsing.
>
>> 8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
>> into
>> some form of data structure that can be used as an index, probably a
>> hierarchical structure such as a tree.
>>
>
> Yes, exactly, I believe I can setup a recursive algorithm that will
> populate a tree view control that will represent the codes in a way the
> surgical coders are used to seeing. They will, for example, be able to
> explode "ABLATION" and see subnodes of "ENDOMETRIAL (HYSTEROSCOPIC)
> 68.23"
> "Heart (Conduction Defect) 27.33/2"
>
> Then, upon exploding those nodes any child nodes would be displayed,
> etc.
>
>> Question: Will these records be ordered in any way? IOW, for example,
>> will
>> they be ordered alphabetically? If they are ordered alphabetically,
>> the
>> structure is already present, by virtue of the rules as stated above.
>> Otherwise, it will be necessary to do some form of re-scanning of the
>> data.
>
> In my pre-processing I can sort these alphabetically. As always, the
> question is really how to eliminate the duplicated text from
> surrounding lines and correctly place children/parents in relation to
> each other.
>
>>
>> Question: Can you tell me what sort of format the end result is
>> supposed
>> to
>> be in? Is it simply a data structure in memory? Or what?
>
> Simply a data structure in memory as the end users want to be able to
> pull this up on demand as they are coding surgical cases.
>
>
> Kevin Spencer wrote:
>> Hi Mike,
>>
>> As far as Top-Posting is concerned, AFAIK it's still a matter of
>> debate,
>> and
>> as we're talking about Netiquette, not ISO or W3C standards, my
>> personal
>> feeling is that anyone who scolds one about top- or bottom-posting has
>> poor
>> sense of priority. After all, the purpose of groups such as this is
>> communication. I find it far more difficult to deal with poor
>> communication
>> than with the format of a post, but that's just me! ;-)
>>
>> In your case, you have done a pretty darned good job of communication,
>> and I
>> appreciate that, so I will certainly do all I can to help out! I did
>> have
>> to
>> do a little research into ICD9, but that wasn't hard with Google.
>>
>> It took me a few minutes of study to figure out (for the most part)
>> what
>> your requirements are. Let me see if I can repeat them back to you in
>> my
>> own
>> words, and ask a couple of questions:
>>
>> 1. You have a set of data that is pure text, and is either stored in
>> an
>> actual database, or in the text equivalent of a database as a
>> multi-line
>> text document. I can't be exactly sure.
>> 2. In any case, this data consists multiple single-line entries of
>> text.
>> 3. The data is stored in such a way that the text represents a
>> hierarchical
>> structure of nodes.
>> 4. This is achieved by a top-level classification that is repeated in
>> each
>> "record" (line) for every record that falls under it.
>> 5. Sub-nodes are indicated in the same way by the first text that
>> follows
>> the top-level node text.
>> 6. The node identifier text in the sub-nodes can be identified by
>> comparing
>> it with other records that are under the top-level node. There is no
>> other
>> way to distinguish this text from any other text in the record, other
>> than
>> by comparing it with other records.
>> 7. Therefore, the structure of the hierarchy can be inferred by using
>> a
>> recursive procedure that identifies increasingly "deep" sub-nodes
>> within
>> the
>> set of records.
>> 8. (Now here's where I'm a bit fuzzy). Your task is to put all of this
>> into
>> some form of data structure that can be used as an index, probably a
>> hierarchical structure such as a tree.
>>
>> Question: Will these records be ordered in any way? IOW, for example,
>> will
>> they be ordered alphabetically? If they are ordered alphabetically,
>> the
>> structure is already present, by virtue of the rules as stated above.
>> Otherwise, it will be necessary to do some form of re-scanning of the
>> data.
>>
>> Question: Can you tell me what sort of format the end result is
>> supposed
>> to
>> be in? Is it simply a data structure in memory? Or what?
>>
>> --
>> HTH,
>>
>> Kevin Spencer
>> Microsoft MVP
>> Professional Chicken Salad Alchemist
>>
>> I recycle.
>> I send everything back to the planet it came from.
>>
>>
>>
>>
>> "Mike" <ms********@charter.net> wrote in message
>> news:11*********************@u72g2000cwu.googlegro ups.com...
>> > Must say I get burned in six different ways. Some groups I top post
>> > and get scolded. On other groups others people top post and nobody
>> > appears to have a problem. I'll top post here.
>> >
>> > Given I've been asked for details I'll provide them, but typically
>> > nobody wants to wade through them.
>> >
>> > In the dark ages I had 24,000 lines of ICD9 index entries which got
>> > appended with ICD9 codes and were processed one time per year into a
>> > big paper report with a tree-like structure by an assembler program
>> > on
>> > an OS390. An abbreviated example of the report is below for the
>> > Ablation entry.
>> >
>> > Ablation
>> > Endometrial (Hysteroscopic) 68.23
>> > Heart (Conduction Defect) 27.33/2
>> > With Catheter 37.34/2
>> > Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
>> > By Injection 20.72
>> > Lesion Heart
>> > By Peripherally Inserted Catheter 37.34
>> >
>> > Across my institution in the past there have been multiple "master"
>> > copies of ICD9 codes and index entries. The order came down that
>> > long-term we will work towards a single copy of ICD9 codes with
>> > index
>> > entries that will be accessed via webservices. The structure of the
>> > data in our old database was as follows (no line breaks -- each
>> > entry
>> > was one line):
>> >
>> > ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
>> > ABLATION HEART (CONDUCTION DEFECT) 37.33/2
>> > ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
>> > ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
>> > ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
>> > 20.72
>> > ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
>> > ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
>> > ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
>> > APPROACH 37.34
>> > ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN
>> > (TRANS-THORACIC)
>> > APPROACH 37.33
>> > ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
>> > APPROACH 37.33
>> > ABLATION PITUITARY 7.69
>> > ABLATION PITUITARY BY COBALT-60 92.32
>> > ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
>> > 92.39
>> > ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
>> > ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
>> > 60.21
>> > ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
>> > 60.97
>> > ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
>> > (TUNA) 60.97
>> > ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
>> > 60.62
>> > ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
>> > ABLATION (RCSA) 60.62
>> > ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
>> > ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
>> > 60.29
>> > ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
>> > ABLATION (RCSA) 60.29
>> > ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
>> > ABLATION VESICLE NECK (ANAT = 60.02) 57.91
>> >
>> >
>> > The new webservices still have this same index structure except now,
>> > for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a
>> > property
>> > of code 57.91. The surgical coders still want to view the index
>> > entries in a tree structure on demand. Without getting into
>> > mind-numbing details, I can jump through some hoops and get back a
>> > set
>> > of index entries that look like above for ABLATION but they are not
>> > formatted in the way the surgical coders desire. I believe I have a
>> > recursive algorithm that will work to format these into a tree
>> > structure but this algorithm is predicated on being able to find the
>> > nodes.
>> >
>> > If you look carefully, the root node for entire set of index entries
>> > above is "ABLATION" (as that is what begins each entry and repeats
>> > across all of them). Subsequently, Endometrial (Hysteroscopic) +
>> > code
>> > is a child of ABLATION with no children of its own because it is not
>> > repeated. Next, Heart (Conduction Defect) + code is a node with
>> > "With
>> > Catheter + code" as a child of that node because "Heart (Conduction
>> > Defect)" repeats across both those lines.
>> >
>> > I have begged the group that now owns the webservice to allow me to
>> > restructure the data but no go (they say that would be bastardizing
>> > the
>> > concept of everything being 'code-centric'). I am stuck with this
>> > and
>> > also with the demand by the coders that they get the formatted tree
>> > structure to look at when they code.
>> >
>> > In general, I think if I do the following I can figure out the nodes
>> > and children:
>> >
>> > 1. Read index entries until the first word changes.
>> > 2. Get the substring that begins the string and is repeated
>> > elsewhere
>> > in the string (this is the node).
>> > 3. Remove that node and keep processing until the base case is hit
>> > etc.
>> >
>> > If anyone has any better ideas of how to deal with this I would be
>> > thrilled to no end to hear them.
>> >
>> >
>> > Thanks,
>> >
>> > Mike
>> >
>> >
>> > Kevin Spencer wrote:
>> >> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> >> > try
>> >> > your suggestion first off in the morning.
>> >>
>> >> First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
>> >> substring of the original string. The regular expression is the
>> >> string
>> >> "^(.+)(?=\s*).*\1" that you are using to get your match. Assuming
>> >> that
>> >> "HEART (CONDUCTION DEFECT)" is your match (which it is not), you
>> >> could
>> >> call
>> >> it a match for the regular expression (which may match more than
>> >> once
>> >> in
>> >> a
>> >> string). But it is a substring of the original string. It may seem
>> >> picky,
>> >> but in order to communicate effectively, one must use the right
>> >> terms.
>> >> As
>> >> an
>> >> example, if I told you that I ate a car for breakfast, would you
>> >> know
>> >> that I
>> >> ate an apple?
>> >>
>> >> Second, the string you posted contains 2 instances of the substring
>> >> "HEART
>> >> (CONDUCTION DEFECT)". Do you want to get both of them? If so, what
>> >> exactly
>> >> are your pattern-matching rules? A regular expression matches a
>> >> pattern.
>> >> Obviously, not all of the strings you will be working with will be:
>> >>
>> >> " HEART (CONDUCTION
>> >> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> >> CATHETER 37.34/2 "
>> >>
>> >> In fact, probably due to this being a newsgroup, and my using a
>> >> newsreader,
>> >> I would doubt that the line breaks in the string are where they
>> >> are,
>> >> if
>> >> they
>> >> are. And I have to wonder whether the string actually begins and
>> >> ends
>> >> with a
>> >> space.
>> >>
>> >> In other words, you're going to be using a regular expression to
>> >> isolate
>> >> substrings of various strings (most probably). A regular expression
>> >> is
>> >> shorthand for a set of rules that defines a pattern you're looking
>> >> for.
>> >> Whether the strings contain line breaks, for example, is important.
>> >> Your
>> >> regular expression begins with the caret '^' character. This
>> >> character
>> >> can
>> >> indicate the beginning of a string, or the beginning of a line *or*
>> >> a
>> >> string, depending upon what options you use. You didn't specify the
>> >> option(s) you're using, so we have no way to know.
>> >>
>> >> In addition, your pattern is not likely to work in the way you
>> >> expect.
>> >> for
>> >> example, the following would match:
>> >>
>> >> THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS
>> >> IS
>> >> ")
>> >>
>> >> And in addition, if there are line breaks, like your example (as
>> >> split
>> >> by
>> >> the newsreader), the matching substring would be:
>> >>
>> >> DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
>> >>
>> >> So, can you explain what your rules are, and what you are trying to
>> >> match
>> >> here? I'm just guessing that you're parsing medical transcriptions,
>> >> but
>> >> beyond that, I'm stumped.
>> >>
>> >> --
>> >> HTH,
>> >>
>> >> Kevin Spencer
>> >> Microsoft MVP
>> >> Professional Chicken Salad Alchemist
>> >>
>> >> I recycle.
>> >> I send everything back to the planet it came from.
>> >>
>> >>
>> >> "Mike" <ms********@charter.net> wrote in message
>> >> news:11**********************@p79g2000cwp.googlegr oups.com...
>> >> > Xicheng Jia wrote:
>> >> >> Mike wrote:
>> >> >> > I have a regular expression (^(.+)(?=\s*).*\1 ) that results
>> >> >> > in
>> >> >> > matches. I would like to get what the actual regular
>> >> >> > expression
>> >> >> > is.
>> >> >> > In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
>> >> >> > (CONDUCTION
>> >> >> > DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
>> >> >> > CATHETER 37.34/2 " the expression is "HEART (CONDUCTION
>> >> >> > DEFECT)".
>> >> >> > How
>> >> >> > do I gain access to the expression (not the matches) at
>> >> >> > runtime?
>> >> >>
>> >> >> you want to access the expression "HEART (CONDUCTION DEFECT)" or
>> >> >> the
>> >> >> regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get
>> >> >> exactly
>> >> >> the latter one though, for the previous one, you can use named
>> >> >> capture,
>> >> >> like
>> >> >>
>> >> >> ^(?<expr>.+)(?=\s*).*\k<expr>
>> >> >>
>> >> >> and access the variable "expr" at run time?
>> >> >>
>> >> >> Xicheng
>> >> >
>> >> > I want to access the expression "HEART (CONDUCTION DEFECT)" I'll
>> >> > try
>> >> > your suggestion first off in the morning.
>> >> >
>> >> > Thanks,
>> >> >
>> >
>

Jun 23 '06 #12

Barry Kelly

Do you guys ever trim your quotes?

(I didn't download your messages because they were too big :)

-- Barry

--
http://barrkel.blogspot.com/

Jun 23 '06 #13

Bruce Wood

Mike,

A couple of questions:

1. Are you at liberty to read these codes once and store them in a
"nicer" form, for example in a database? Or would you prefer to (or
have no choice but to) read them from the Web service each time and
parse them each time you want to present them to the user? If you're
free to parse them once and store the results somewhere, this adds the
question of the best representation in which to store them. If you have
to read them each time, the representational problem is in-memory and
less challenging.

2. I gather that what you want is a WinForms (or WebForms) TreeView
control?

Mike wrote:

Hi Kevin,

I have been working on this problem on and off for maybe 9 months. New
ICD9 codes are released in October so I have a few months to get a
solution in place. I am going to see if there is some other way to get
the owners of the webservices to offer something different.. I'd like
to get the data more structure (i.e do something similar to this:
http://www.developerfusion.co.uk/show/4633/) Any idea if retrieving
data structured like in the article would fit well with the C# tree
structure?

I will also look at your suggestion, but I had previously tried to
implement something similar with a two-dimensional array and the
problem was with performance. Very slow for those entries with several
hundred rows such as "Biopsy"...

Mike

Jun 23 '06 #14

Kevin Spencer

> Do you guys ever trim your quotes?

Sometimes, yes. But I always post helpful information, major in the majors,
and avoid sweating the small stuff. Why do you ask?

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Barry Kelly" <ba***********@gmail.com> wrote in message
news:4r********************************@4ax.com...

Do you guys ever trim your quotes?

(I didn't download your messages because they were too big :)

-- Barry

--
http://barrkel.blogspot.com/

Jun 23 '06 #15

Kevin Spencer

Hi Bruce,

Actually, I already asked these questions. He apparently has repeatedly
asked to have the data stored more efficiently, to no avail. And he needs to
store it in memory.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@r2g2000cwb.googlegro ups.com...

Mike,

A couple of questions:

1. Are you at liberty to read these codes once and store them in a
"nicer" form, for example in a database? Or would you prefer to (or
have no choice but to) read them from the Web service each time and
parse them each time you want to present them to the user? If you're
free to parse them once and store the results somewhere, this adds the
question of the best representation in which to store them. If you have
to read them each time, the representational problem is in-memory and
less challenging.

2. I gather that what you want is a WinForms (or WebForms) TreeView
control?

Mike wrote:
Hi Kevin,

I have been working on this problem on and off for maybe 9 months. New
ICD9 codes are released in October so I have a few months to get a
solution in place. I am going to see if there is some other way to get
the owners of the webservices to offer something different.. I'd like
to get the data more structure (i.e do something similar to this:
http://www.developerfusion.co.uk/show/4633/) Any idea if retrieving
data structured like in the article would fit well with the C# tree
structure?

I will also look at your suggestion, but I had previously tried to
implement something similar with a two-dimensional array and the
problem was with performance. Very slow for those entries with several
hundred rows such as "Biopsy"...

Mike

Jun 23 '06 #16

Barry Kelly

"Kevin Spencer" <uc*@ftc.gov> wrote:

Do you guys ever trim your quotes?

Sometimes, yes. But I always post helpful information, major in the majors,
and avoid sweating the small stuff. Why do you ask?

The whole message following is just redundant. With top-quoting being
the accepted netiquette style, one must scroll to the bottom of a
message to see if any extra stuff was added, and with a huge message
that's a lot of wasted effort for everyone involved.

Also, I (personally) simply don't download messages with more than 500
lines. But that's the least relevant issue.

-- Barry

--
http://barrkel.blogspot.com/

Jun 23 '06 #17

Mike

Hi Bruce,

Per question 1), I have been "encouraged" to use the webservices "as
are" and instead develop greater skill at advanced data structures that
can appropriately handle what they are giving me. That said, I think
that in the end I will have to do something to store them in a "nicer"
form, as you say, as 8-9 months of mulling over this over along with at
least a couple attempts at differernt implementations have left me
empty-handed (either huge performance problems or just 'undoable'). In
sum, I may end up storing them separately in my own database and make
the case that the structure of the data in the webservice in not
useful.

Per question 2), Any tree-like representation would be ok for the
users, I think. They just want to continue seeing them exactly like
they have in the past (and that was in a tree-like view). This is a
winforms application.

Thanks much,

Mike

Bruce Wood wrote:

Mike,

A couple of questions:

1. Are you at liberty to read these codes once and store them in a
"nicer" form, for example in a database? Or would you prefer to (or
have no choice but to) read them from the Web service each time and
parse them each time you want to present them to the user? If you're
free to parse them once and store the results somewhere, this adds the
question of the best representation in which to store them. If you have
to read them each time, the representational problem is in-memory and
less challenging.

2. I gather that what you want is a WinForms (or WebForms) TreeView
control?

Mike wrote:
Hi Kevin,

I have been working on this problem on and off for maybe 9 months. New
ICD9 codes are released in October so I have a few months to get a
solution in place. I am going to see if there is some other way to get
the owners of the webservices to offer something different.. I'd like
to get the data more structure (i.e do something similar to this:
http://www.developerfusion.co.uk/show/4633/) Any idea if retrieving
data structured like in the article would fit well with the C# tree
structure?

I will also look at your suggestion, but I had previously tried to
implement something similar with a two-dimensional array and the
problem was with performance. Very slow for those entries with several
hundred rows such as "Biopsy"...

Mike

Jun 23 '06 #18

Mike

Hi Kevin,

I am inquiring with our DBAs to find out what XML features are
enabled/available in our Sybase ASE 12.5.2 db. If I can figure out a
usable format for these ICD9 Index entries I can propose something to
the Bioinformatics group that hosts the webservices to see if they are
willing to accept my format and provide it to users across the
institution in that format. If they are not, I may have to just keep
mine separate and therefore break their model of one 'master' set of
codes (and associated properties) for the whole institution.

Any suggestions on documentation to get up to speed?

Thanks much,

Mike

Kevin Spencer wrote:

Hi Mike,

The article looks very much like what I mentioned regarding creating your
own tree structure. The reason I suggested an XmlDocument is that it is also
a hierarchical tree, but can be transformed easily into virtually any other
format, including HTML, database, etc., and is also
cross-platform-compatible (pure text). The System.Xml namespace has plenty
of ready-made classes, such as XmlNode, XmlElement, XmlDocument, etc., which
carry a small performance penalty, due to their conformance to the XML
standard, but I would think the performance penalty was well worth it,
considering the extensibility of the result.

If you were to use Regular Expressions, you would incur about the same
performance problem as my character-based comparison, since a Regular
Expression compares a string character-by character, and even involves some
backtracking.And I just don't see Regular Expressions filling the bill here,
although I'll admit, it is possible that I'm wrong about that. As I said, if
you can be sure about word breaks, you can do a word-by-word comparison, but
of course, under the covers it always breaks down at some point to a
char-by-char comparison.

--
HTH,

Jun 23 '06 #19

Mark Wilden

"Kevin Spencer" <uc*@ftc.gov> wrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...

Do you guys ever trim your quotes?

Sometimes, yes. But I always post helpful information, major in the
majors, and avoid sweating the small stuff. Why do you ask?

I think it was pretty obvious why he asked (so why do you ask why he
asked?), and I'm with him. Excessive quoting ticks me off. Pet peeve -- no
biggie.

///ark

Jun 23 '06 #20

Kevin Spencer

Hi Mike,

You might start out by pointing out that Web Services means "SOAP" services,
which means that the data is always serialized as XML when being
transferred. So, by creating a class that is serializable as XML, or using
native XML, you are already in the ballpark with regards to web services.

Here's a couple of great overall references:

http://msdn.microsoft.com/library/en...asp?frame=true
http://msdn.microsoft.com/library/en...asp?frame=true

The second contains some excellent documentation of the W3C standards for
XML.

One of the coolest things about (many cool things about) XML is its native
ability to be transformed into almost any other data format (if not any
other data format). The 'X' in XML means "eXtensible." It is
self-describing. And XSLT (extensible stylesheet language) is a "flavor" of
XML that is used to transform XML to any other data format.

So, if you're looking for a solution with "legs," one that can easily be
upgraded, manipulated, and extended, for many years to come, XML is probably
your best bet.

Microsoft has also been betting heavily on XML. In fact, the next version
of Office uses an XML document markup rather than a proprietary binary
format.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Mike" <ms********@charter.net> wrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...

Hi Kevin,

I am inquiring with our DBAs to find out what XML features are
enabled/available in our Sybase ASE 12.5.2 db. If I can figure out a
usable format for these ICD9 Index entries I can propose something to
the Bioinformatics group that hosts the webservices to see if they are
willing to accept my format and provide it to users across the
institution in that format. If they are not, I may have to just keep
mine separate and therefore break their model of one 'master' set of
codes (and associated properties) for the whole institution.

Any suggestions on documentation to get up to speed?

Thanks much,

Mike

Kevin Spencer wrote:
Hi Mike,

The article looks very much like what I mentioned regarding creating your
own tree structure. The reason I suggested an XmlDocument is that it is
also
a hierarchical tree, but can be transformed easily into virtually any
other
format, including HTML, database, etc., and is also
cross-platform-compatible (pure text). The System.Xml namespace has
plenty
of ready-made classes, such as XmlNode, XmlElement, XmlDocument, etc.,
which
carry a small performance penalty, due to their conformance to the XML
standard, but I would think the performance penalty was well worth it,
considering the extensibility of the result.

If you were to use Regular Expressions, you would incur about the same
performance problem as my character-based comparison, since a Regular
Expression compares a string character-by character, and even involves
some
backtracking.And I just don't see Regular Expressions filling the bill
here,
although I'll admit, it is possible that I'm wrong about that. As I said,
if
you can be sure about word breaks, you can do a word-by-word comparison,
but
of course, under the covers it always breaks down at some point to a
char-by-char comparison.

--
HTH,

Jun 23 '06 #21

Bruce Wood

Just playing around, I put together this little mock-up program. It's
300 lines long, but it takes the entries you posted and formats them
both as a tree and as a list of keyed entries. Hope this helps.

using System;
using System.Collections;

namespace Namespace
{
class Program
{
static string[] TextEntries =
{
"ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23",
"ABLATION HEART (CONDUCTION DEFECT) 37.33/2",
"ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2",
"ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4",
"ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION
20.72",
"ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34",
"ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34",
"ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34",
"ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN
(TRANS-THORACIC) APPROACH 37.33",
"ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33",
"ABLATION PITUITARY 7.69",
"ABLATION PITUITARY BY COBALT-60 92.32",
"ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC
92.39 ",
"ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33 ",
"ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL
60.21 ",
"ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
60.97 ",
"ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97 ",
"ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION
60.62 ",
"ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62 ",
"ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21 ",
"ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION
60.29 ",
"ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29 ",
"ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0 ",
"ABLATION VESICLE NECK (ANAT = 60.02) 57.91 "
};

static string[] ExclusionList = { "BY", "WITH" };

public class Entry
{
private long _uid;
private string _text;
private string _citation;
private ArrayList _child;

public Entry(string text) : this(text, "", new ArrayList())
{ }

public Entry(string text, string citation) : this(text, citation,
new ArrayList())
{ }

public Entry(string text, string citation, ArrayList child)
{
this._uid = 0;
this._text = text;
this._citation = citation;
this._child = child;
}

public string Text
{
get { return this._text; }
set { this._text = value; }
}

public string Citation
{
get { return this._citation; }
set { this._citation = value; }
}

public ArrayList Child
{
get { return this._child; }
}

public long Uid
{
get { return this._uid; }
set { this._uid = value; }
}
}
static void Main(string[] args)
{
ArrayList list = new ArrayList();
foreach (string entry in TextEntries)
{
string text;
string citation;
SplitCitation(entry, out text, out citation);
AddToList(text, citation, list);
}
FoldInExcludedWords(list);
PrintList(list, 0);
long nextUid = 1;
AssignUids(list, ref nextUid);
PrintAsHashtable(list, 0);
Console.ReadLine();
}

private static void SplitCitation(string line, out string text, out
string citation)
{
// Could use Regex here, but it's probably faster to just do it the
brain-dead way
int i = line.Length - 1;
int len = 0;
while (i >= 0 && Char.IsWhiteSpace(line[i]))
{
i -= 1;
}
while (i >= 0 && Char.IsDigit(line[i]))
{
i -= 1;
len += 1;
}
if (i >= 0 && line[i] == '/')
{
i -= 1;
len += 1;
}
while (i >= 0 && Char.IsDigit(line[i]))
{
i -= 1;
len += 1;
}
if (i >= 0 && line[i] == '.')
{
i -= 1;
len += 1;
}
while (i >= 0 && Char.IsDigit(line[i]))
{
i -= 1;
len += 1;
}
if (i >= 0 && Char.IsWhiteSpace(line[i]))
{
citation = line.Substring(i + 1, len);
}
else
{
citation = "";
}

while (i >= 0 && Char.IsWhiteSpace(line[i]))
{
i -= 1;
}
if (i >= 0)
{
text = line.Substring(0, i + 1);
}
else
{
text = "";
}
}

private static int InitialEqualStringLength(string text1, string
text2)
{
int i = 0;
while (i < text1.Length && i < text2.Length && text1[i] == text2[i])
{
i++;
}
if (i >= text1.Length && i >= text2.Length)
{
return i;
}
if (i >= text1.Length && Char.IsWhiteSpace(text2[i]))
{
return i;
}
if (i >= text2.Length && Char.IsWhiteSpace(text1[i]))
{
return i;
}
if (i < text1.Length && i < text2.Length &&
Char.IsWhiteSpace(text1[i]) && Char.IsWhiteSpace(text2[i]))
{
return i;
}
do
{
i -= 1;
} while (i > 0 && !Char.IsWhiteSpace(text1[i]));
return i;
}

private static bool ExcludedWord(string text)
{
foreach (string word in ExclusionList)
{
if (word == text)
{
return true;
}
}
return false;
}

public static void AddToList(string line, string citation, ArrayList
list)
{
for (int i = 0; i < list.Count; i++)
{
Entry e = (Entry)list[i];

int matchLen = InitialEqualStringLength(line, e.Text);
if (matchLen > 0)
{
if (line == e.Text)
{
if (e.Citation.Length == 0)
{
e.Citation = citation;
}
else if (e.Citation != citation)
{
// Error! Two matching lines with different citations
}
return;
}
else if (matchLen == e.Text.Length)
{
string newText = line.Substring(matchLen).Trim();
AddToList(newText, citation, e.Child);
return;
}
else if (matchLen == line.Length)
{
e.Text = e.Text.Substring(matchLen).Trim();
Entry newEntry = new Entry(line.Substring(0, matchLen).Trim(),
citation);
newEntry.Child.Add(e);
list[i] = newEntry;
return;
}
else
{
string sharedText = line.Substring(0, matchLen).Trim();
string newOriginalText = e.Text.Substring(matchLen).Trim();
string newEntryText = line.Substring(matchLen).Trim();
e.Text = newOriginalText;
Entry newEntry = new Entry(sharedText);
newEntry.Child.Add(e);
newEntry.Child.Add(new Entry(newEntryText, citation));
list[i] = newEntry;
return;
}
}
}

// No match found in list
Entry addEntry = new Entry(line, citation);
list.Add(addEntry);
}

public static void FoldInExcludedWords(ArrayList list)
{
for (int i = 0; i < list.Count; i++)
{
Entry e = (Entry)list[i];
FoldInExcludedWords(e.Child);
if (ExcludedWord(e.Text))
{
// Add the text and a space to all child nodes
list.RemoveAt(i);
for (int j = e.Child.Count - 1; j >= 0; j--)
{
Entry f = (Entry)e.Child[j];
f.Text = e.Text + " " + f.Text;
list.Insert(i, f);
}
}
}
}

public static void PrintList(ArrayList list, int indent)
{
foreach (Entry e in list)
{
string formatString = String.Format("{{0,{0}}}{{1}}: {{2}}",
indent);
Console.WriteLine(formatString, " ", e.Text, e.Citation);
PrintList(e.Child, indent + 4);
}
}

public static void AssignUids(ArrayList list, ref long nextUid)
{
foreach (Entry e in list)
{
e.Uid = nextUid;
nextUid++;
AssignUids(e.Child, ref nextUid);
}
}

public static void PrintAsHashtable(ArrayList list, long parentUid)
{
foreach (Entry e in list)
{
Console.WriteLine("UID:{0}, Parent UID:{1}, Text:{2},
Citation:{3}", e.Uid, parentUid, e.Text, e.Citation);
PrintAsHashtable(e.Child, e.Uid);
}
}
}
}

Jun 23 '06 #22

Marcus Andrén

On Fri, 23 Jun 2006 15:23:01 +0100, Barry Kelly
<ba***********@gmail.com> wrote:

The whole message following is just redundant. With top-quoting being
the accepted netiquette style, one must scroll to the bottom of a
message to see if any extra stuff was added, and with a huge message
that's a lot of wasted effort for everyone involved.

Yup.

When it comes to replying to messages, the order of preferences is
very simple from the point of usability.

Inline quoting (below the relevant text) is the best, providing the
perfect context and overview for the reader. The best practice is to
only quote the relevant information, but it takes time to decide what
should be quoted so often some extra information gets included.

No quoting at all is basically the same as inline quoting, but the
poster decided that there wasn't anything relevant to be quoted.

Bottom posting is doable. It works with inline posting, but provides
only a single context, and the quoted text isn't trimmed. It does
however respect standard document flow from top to bottom. If you do
this, it usually looks better to atleast strip the signature from the
quoted text.

The worst posting type by far is placing the quoted text below the
reply. This fails to respect the fact that text is read from top to
bottom. It completly fails to give the reader any context, because by
the time he reaches the quote, he has already read the reply. Any
quoted text that is after the reply is basically wasted from the
readers point of view, so it better to not include it at all.

The only real argument for top posting is that you don't have to
scroll down to read the reply, but that isn't actually an argument for
top posting, but instead an argument against excessive quoting.

Curse Microsoft Outlook into an eternity in hell for introducing top
posting to the internet masses.

--
Marcus Andrén

Jun 23 '06 #23

Barry Kelly

Marcus Andrén <a@b.c> wrote:

Curse Microsoft Outlook into an eternity in hell for introducing top
posting to the internet masses.

Hail, fellow Agent user!

-- Barry

--
http://barrkel.blogspot.com/

Jun 23 '06 #24

Bruce Wood

Marcus Andrén wrote:

Curse Microsoft Outlook into an eternity in hell for introducing top
posting to the internet masses.

I curse those who bring the silly top-posting / bottom-posting debate
into newsgroups in which it was formerly unknown. I file such debates
in the same bin as I file the "where should the curly braces go"
debates.

A pox on the whole stupid debate!

Jun 24 '06 #25

Kevin Spencer

Agreed. These sort of Lilliputian disputes betray a lack of sense of
priority on the part of the debater(s). Inferential logic dictates that if
one lacks a sense of priority in one area, it is likely that one will also
lack a sense of priority in other areas. Prioritization is a critical skill
to the practice of application development. Comments which enhance the
development skills of the community are useful; comments which focus on
trivialities are not.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

I recycle.
I send everything back to the planet it came from.

"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@g10g2000cwb.googlegr oups.com...
Marcus Andrén wrote:

Curse Microsoft Outlook into an eternity in hell for introducing top
posting to the internet masses.

Jun 24 '06 #26

Similar topics