Connecting Tech Pros Worldwide Help | Site Map

Regex help, please - Recognize quoted strings?

Dave
Guest
 
Posts: n/a
#1: Nov 16 '05
I'm struggling with something that should be fairly simple. I just don't
know the regext syntax very well, unfortunately.

I'd like to parse words out of what is basically a boolean search string.
It's actually the input string into a Microsoft Index Server search.

The string will consist of words, perhaps enclosed in quotes or parentheses.
I'd like to use Regex to pull out the words, or the phrases if the words are
enclosed in quotes. Example

The string: asdf or qwer or hjkl
should yield three results: asdf, qwer, hjkl

and:

"two words" and asdf
should yield two results: "two words", and "asdf"

There's the added complexity that the strings may have groups of words
surrounded by parentheses, but I think I can figure that out if I solve the
quoted strings problem.

I've tried a few things, but I can't manage to come up with something that
isn't returning the quotes in the return values.

Here's some code:

Regex regEx("") = new Regex("([\"][^\"]+[\"]|\\S+)");

string searchText = "\"two words\" and asdf";
foreach (Match m in regEx.Matches(searchText))
{
string text = m.ToString();

MessageBox.Show(text);
}

In the above code, it will pull out the words, but the text pulled out
includes the quotes in "two words";
I tried to tell it to match but ignore the quotes, using:
Regex regEx("") = new Regex("(?:(\"){1}[^\"]?:(\"){1}|\\S)+");

but that doesn't work either. Obviously I don't know what I'm doing.

Please help!

- Daev


Wes
Guest
 
Posts: n/a
#2: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


> I'm struggling with something that should be fairly simple. I just[color=blue]
> don't know the regext syntax very well, unfortunately.
>
> I'd like to parse words out of what is basically a boolean search
> string. It's actually the input string into a Microsoft Index Server
> search.
>
> The string will consist of words, perhaps enclosed in quotes or
> parentheses. I'd like to use Regex to pull out the words, or the
> phrases if the words are enclosed in quotes. Example
>
> The string: asdf or qwer or hjkl
> should yield three results: asdf, qwer, hjkl
> and:
>
> "two words" and asdf
> should yield two results: "two words", and "asdf"
> There's the added complexity that the strings may have groups of words
> surrounded by parentheses, but I think I can figure that out if I
> solve the quoted strings problem.
>
> I've tried a few things, but I can't manage to come up with something
> that isn't returning the quotes in the return values.
>
> Here's some code:
>
> Regex regEx("") = new Regex("([\"][^\"]+[\"]|\\S+)");
>
> string searchText = "\"two words\" and asdf";
> foreach (Match m in regEx.Matches(searchText))
> {
> string text = m.ToString();
> MessageBox.Show(text);
> }
> In the above code, it will pull out the words, but the text pulled out
> includes the quotes in "two words";
> I tried to tell it to match but ignore the quotes, using:
> Regex regEx("") = new Regex("(?:(\"){1}[^\"]?:(\"){1}|\\S)+");
> but that doesn't work either. Obviously I don't know what I'm doing.
>
> Please help!
>
> - Daev[/color]

Hello Dave,

With "(?:(\"){1}[^\"]?:(\"){1}|\\S)+" you are saying don't capture the the whole thing i.e. by '(?:' but you are capturing both quotes individually with (\").

Try (?:"\"([^\"]+)\"|\\S+)

This should only capture the stuff with quotes around it, excuding the quotes.

HTH
Wes Haggard
http://weblogs.asp.net/whaggard/


Dave
Guest
 
Posts: n/a
#3: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


Thanks, Wes!

The regex string you gave me now solves the big problem - it returns the
entire "phrase" inside the quotes. It still does return the quotes
themselves, though. I can strip those out with a call to Trim(), but that's
a little bit of a hack. Can you figure out how to tell it to strip the
quotes for me?

- Dave

"Wes" <newsgroups@puzzleware.net> wrote in message
news:uQ8KB06tEHA.1472@TK2MSFTNGP10.phx.gbl...[color=blue][color=green]
> > I'm struggling with something that should be fairly simple. I just
> > don't know the regext syntax very well, unfortunately.
> >
> > I'd like to parse words out of what is basically a boolean search
> > string. It's actually the input string into a Microsoft Index Server
> > search.
> >
> > The string will consist of words, perhaps enclosed in quotes or
> > parentheses. I'd like to use Regex to pull out the words, or the
> > phrases if the words are enclosed in quotes. Example
> >
> > The string: asdf or qwer or hjkl
> > should yield three results: asdf, qwer, hjkl
> > and:
> >
> > "two words" and asdf
> > should yield two results: "two words", and "asdf"
> > There's the added complexity that the strings may have groups of words
> > surrounded by parentheses, but I think I can figure that out if I
> > solve the quoted strings problem.
> >
> > I've tried a few things, but I can't manage to come up with something
> > that isn't returning the quotes in the return values.
> >
> > Here's some code:
> >
> > Regex regEx("") = new Regex("([\"][^\"]+[\"]|\\S+)");
> >
> > string searchText = "\"two words\" and asdf";
> > foreach (Match m in regEx.Matches(searchText))
> > {
> > string text = m.ToString();
> > MessageBox.Show(text);
> > }
> > In the above code, it will pull out the words, but the text pulled out
> > includes the quotes in "two words";
> > I tried to tell it to match but ignore the quotes, using:
> > Regex regEx("") = new Regex("(?:(\"){1}[^\"]?:(\"){1}|\\S)+");
> > but that doesn't work either. Obviously I don't know what I'm doing.
> >
> > Please help!
> >
> > - Daev[/color]
>
> Hello Dave,
>
> With "(?:(\"){1}[^\"]?:(\"){1}|\\S)+" you are saying don't capture the the[/color]
whole thing i.e. by '(?:' but you are capturing both quotes individually
with (\").[color=blue]
>
> Try (?:"\"([^\"]+)\"|\\S+)
>
> This should only capture the stuff with quotes around it, excuding the[/color]
quotes.[color=blue]
>
> HTH
> Wes Haggard
> http://weblogs.asp.net/whaggard/
>
>[/color]


Wes
Guest
 
Posts: n/a
#4: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


Hello Dave,

It looks like I had a typo in my regular expression (an extra quote) here is the corrected version
(?:\"([^\"]+)\"|\\S+)
but that isn't your problem.

It looks like from the example you have there you are getting the value from m.ToString(). That will actually return the Value of the first group (m.Group[0].Value) which is defaultly the entire sub-string that the match was found in. You can try m.Group[1].Value that will give you the string without quotes.

I just dug up a regular expression I used in the past to split a string at any whitespace but not split if the string is within quotes.

string searchText = "\"two words\" and asdf";
string[] split = Regex.Split(searchText, @"(?<!""\b[^""]*)\s+(?![^""]*\b"")");
foreach (string s in split)
Console.WriteLine(s.Trim('"'));

// Output
two words
and
asdf

It does however leave the quotes on the string but that is taken care of with Trim. I think this may make your job a little easier (that is as long as you don't try to figure out exactly what that regular expression is doing, I still have trouble with it when I don't look at it for a while ;)

HTH
Wes Haggard
http://weblogs.asp.net/whaggard/
[color=blue]
> Thanks, Wes!
>
> The regex string you gave me now solves the big problem - it returns
> the entire "phrase" inside the quotes. It still does return the
> quotes themselves, though. I can strip those out with a call to
> Trim(), but that's a little bit of a hack. Can you figure out how to
> tell it to strip the quotes for me?
>
> - Dave
>
> "Wes" <newsgroups@puzzleware.net> wrote in message
> news:uQ8KB06tEHA.1472@TK2MSFTNGP10.phx.gbl...
>[color=green][color=darkred]
>>> I'm struggling with something that should be fairly simple. I just
>>> don't know the regext syntax very well, unfortunately.
>>>
>>> I'd like to parse words out of what is basically a boolean search
>>> string. It's actually the input string into a Microsoft Index Server
>>> search.
>>>
>>> The string will consist of words, perhaps enclosed in quotes or
>>> parentheses. I'd like to use Regex to pull out the words, or the
>>> phrases if the words are enclosed in quotes. Example
>>>
>>> The string: asdf or qwer or hjkl
>>> should yield three results: asdf, qwer, hjkl
>>> and:
>>> "two words" and asdf
>>> should yield two results: "two words", and "asdf"
>>> There's the added complexity that the strings may have groups of
>>> words
>>> surrounded by parentheses, but I think I can figure that out if I
>>> solve the quoted strings problem.
>>> I've tried a few things, but I can't manage to come up with
>>> something that isn't returning the quotes in the return values.
>>>
>>> Here's some code:
>>>
>>> Regex regEx("") = new Regex("([\"][^\"]+[\"]|\\S+)");
>>>
>>> string searchText = "\"two words\" and asdf";
>>> foreach (Match m in regEx.Matches(searchText))
>>> {
>>> string text = m.ToString();
>>> MessageBox.Show(text);
>>> }
>>> In the above code, it will pull out the words, but the text pulled
>>> out
>>> includes the quotes in "two words";
>>> I tried to tell it to match but ignore the quotes, using:
>>> Regex regEx("") = new Regex("(?:(\"){1}[^\"]?:(\"){1}|\\S)+");
>>> but that doesn't work either. Obviously I don't know what I'm
>>> doing.
>>> Please help!
>>>
>>> - Daev
>>>[/color]
>> Hello Dave,
>>
>> With "(?:(\"){1}[^\"]?:(\"){1}|\\S)+" you are saying don't capture
>> the the
>>[/color]
> whole thing i.e. by '(?:' but you are capturing both quotes
> individually with (\").
>[color=green]
>> Try (?:"\"([^\"]+)\"|\\S+)
>>
>> This should only capture the stuff with quotes around it, excuding
>> the
>>[/color]
> quotes.
>[color=green]
>> HTH
>> Wes Haggard
>> http://weblogs.asp.net/whaggard/[/color][/color]

Dave
Guest
 
Posts: n/a
#5: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


Wes:

Unfortunately, the new string doesn't work at all. Also,, m.Groups[0].Value
still returns the string in quotes (using the original string you gave me).
I did try to figure out what that pattern is doing - whew! It uses a
character that isn't even documented in the doc I've been using - the "<"
char? I'm going by what's at:
http://msdn.microsoft.com/library/de...gexpsyntax.asp

At this point, this is mostly an intellectual exercise - I have it working
by trimming out the surrounding quotes. Just a little bit of a hack. If
you have something else for me to try, I'd love to try it. I used to be
competent with this stuff, in my old sed, awk, and lex days. But, it's been
a while. If you'd prefer to punt, that's fine, and thanks for all your help
so far.

- Dave

"Wes" <newsgroups@puzzleware.net> wrote in message
news:eaUT1L8tEHA.1228@TK2MSFTNGP10.phx.gbl...[color=blue]
> Hello Dave,
>
> It looks like I had a typo in my regular expression (an extra quote) here[/color]
is the corrected version[color=blue]
> (?:\"([^\"]+)\"|\\S+)
> but that isn't your problem.
>
> It looks like from the example you have there you are getting the value[/color]
from m.ToString(). That will actually return the Value of the first group
(m.Group[0].Value) which is defaultly the entire sub-string that the match
was found in. You can try m.Group[1].Value that will give you the string
without quotes.[color=blue]
>
> I just dug up a regular expression I used in the past to split a string at[/color]
any whitespace but not split if the string is within quotes.[color=blue]
>
> string searchText = "\"two words\" and asdf";
> string[] split = Regex.Split(searchText,[/color]
@"(?<!""\b[^""]*)\s+(?![^""]*\b"")");[color=blue]
> foreach (string s in split)
> Console.WriteLine(s.Trim('"'));
>
> // Output
> two words
> and
> asdf
>
> It does however leave the quotes on the string but that is taken care of[/color]
with Trim. I think this may make your job a little easier (that is as long
as you don't try to figure out exactly what that regular expression is
doing, I still have trouble with it when I don't look at it for a while ;)[color=blue]
>
> HTH
> Wes Haggard
> http://weblogs.asp.net/whaggard/
>[color=green]
> > Thanks, Wes!
> >
> > The regex string you gave me now solves the big problem - it returns
> > the entire "phrase" inside the quotes. It still does return the
> > quotes themselves, though. I can strip those out with a call to
> > Trim(), but that's a little bit of a hack. Can you figure out how to
> > tell it to strip the quotes for me?
> >
> > - Dave
> >
> > "Wes" <newsgroups@puzzleware.net> wrote in message
> > news:uQ8KB06tEHA.1472@TK2MSFTNGP10.phx.gbl...
> >[color=darkred]
> >>> I'm struggling with something that should be fairly simple. I just
> >>> don't know the regext syntax very well, unfortunately.
> >>>
> >>> I'd like to parse words out of what is basically a boolean search
> >>> string. It's actually the input string into a Microsoft Index Server
> >>> search.
> >>>
> >>> The string will consist of words, perhaps enclosed in quotes or
> >>> parentheses. I'd like to use Regex to pull out the words, or the
> >>> phrases if the words are enclosed in quotes. Example
> >>>
> >>> The string: asdf or qwer or hjkl
> >>> should yield three results: asdf, qwer, hjkl
> >>> and:
> >>> "two words" and asdf
> >>> should yield two results: "two words", and "asdf"
> >>> There's the added complexity that the strings may have groups of
> >>> words
> >>> surrounded by parentheses, but I think I can figure that out if I
> >>> solve the quoted strings problem.
> >>> I've tried a few things, but I can't manage to come up with
> >>> something that isn't returning the quotes in the return values.
> >>>
> >>> Here's some code:
> >>>
> >>> Regex regEx("") = new Regex("([\"][^\"]+[\"]|\\S+)");
> >>>
> >>> string searchText = "\"two words\" and asdf";
> >>> foreach (Match m in regEx.Matches(searchText))
> >>> {
> >>> string text = m.ToString();
> >>> MessageBox.Show(text);
> >>> }
> >>> In the above code, it will pull out the words, but the text pulled
> >>> out
> >>> includes the quotes in "two words";
> >>> I tried to tell it to match but ignore the quotes, using:
> >>> Regex regEx("") = new Regex("(?:(\"){1}[^\"]?:(\"){1}|\\S)+");
> >>> but that doesn't work either. Obviously I don't know what I'm
> >>> doing.
> >>> Please help!
> >>>
> >>> - Daev
> >>>
> >> Hello Dave,
> >>
> >> With "(?:(\"){1}[^\"]?:(\"){1}|\\S)+" you are saying don't capture
> >> the the
> >>[/color]
> > whole thing i.e. by '(?:' but you are capturing both quotes
> > individually with (\").
> >[color=darkred]
> >> Try (?:"\"([^\"]+)\"|\\S+)
> >>
> >> This should only capture the stuff with quotes around it, excuding
> >> the
> >>[/color]
> > quotes.
> >[color=darkred]
> >> HTH
> >> Wes Haggard
> >> http://weblogs.asp.net/whaggard/[/color][/color]
>[/color]


Wes
Guest
 
Posts: n/a
#6: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


Hello Dave,
Comments inline.[color=blue]
> Wes:
>
> Unfortunately, the new string doesn't work at all.[/color]
Really? I have tested it on the string you gave me and it worked for me at least it matched quoted strings.
Anyway here is a complete sample piece of code that matches quoted and non-quoted strings.

string searchText = "\"two words\" and asdf";
Regex regEx = new Regex("(?:\"([^\"]+)\"|(\\S+))");
foreach (Match m in regEx.Matches(searchText))
{
// If quoted string
string text = m.Groups[1].Value;

// If non-quoted string
if (text == string.Empty)
text = m.Groups[2].Value;

Console.WriteLine(text);
}

// Output
two words
and
asdf
[color=blue]
> Also,,
> m.Groups[0].Value still returns the string in quotes (using the
> original string you gave me).[/color]
m.Groups[1].Value should be the one with no quotes.

I did try to figure out what that[color=blue]
> pattern is doing - whew! It uses a character that isn't even
> documented in the doc I've been using - the "<" char? I'm going by
> what's at:
> http://msdn.microsoft.com/library/de...ry/en-us/scrip
> t56/html/jsgrpregexpsyntax.asp[/color]
FYI: the link you gave is for VB script regular expression syntax, which is not exactly the same as .Net Regular Expression syntax, which can be found
http://msdn.microsoft.com/library/de...geElements.asp (and the (?<! ) construct is under the grouping constructs section link, it is a negative lookbehind)
[color=blue]
> At this point, this is mostly an intellectual exercise - I have it
> working by trimming out the surrounding quotes.[/color]
I know but that is part of the reason for me helping people with issues like this so that I can stay intellectually sharp. ;) Plus i hate giving up before the objective is obtained.
[color=blue]
> Just a little bit of
> a hack. If you have something else for me to try, I'd love to try it.
> I used to be competent with this stuff, in my old sed, awk, and lex
> days. But, it's been a while. If you'd prefer to punt, that's fine,
> and thanks for all your help so far.
>
> - Dave[/color]

I hope this is what you are looking for.

Wes Haggard
http://weblogs.asp.net/whaggard/
Dave
Guest
 
Posts: n/a
#7: Nov 16 '05

re: Regex help, please - Recognize quoted strings?


Wes:

That one almost works. It was the @"(?<!""\b[^""]*)\s+(?![^""]*\b"") one
that I was referring to that didn't work.

The new one works.

Thanks!

"Wes" <newsgroups@puzzleware.net> wrote in message
news:ep0PeXFuEHA.2128@TK2MSFTNGP11.phx.gbl...[color=blue]
> Hello Dave,
> Comments inline.[color=green]
> > Wes:
> >
> > Unfortunately, the new string doesn't work at all.[/color]
> Really? I have tested it on the string you gave me and it worked for me at[/color]
least it matched quoted strings.[color=blue]
> Anyway here is a complete sample piece of code that matches quoted and[/color]
non-quoted strings.[color=blue]
>
> string searchText = "\"two words\" and asdf";
> Regex regEx = new Regex("(?:\"([^\"]+)\"|(\\S+))");
> foreach (Match m in regEx.Matches(searchText))
> {
> // If quoted string
> string text = m.Groups[1].Value;
>
> // If non-quoted string
> if (text == string.Empty)
> text = m.Groups[2].Value;
>
> Console.WriteLine(text);
> }
>
> // Output
> two words
> and
> asdf
>[color=green]
> > Also,,
> > m.Groups[0].Value still returns the string in quotes (using the
> > original string you gave me).[/color]
> m.Groups[1].Value should be the one with no quotes.
>
> I did try to figure out what that[color=green]
> > pattern is doing - whew! It uses a character that isn't even
> > documented in the doc I've been using - the "<" char? I'm going by
> > what's at:
> > http://msdn.microsoft.com/library/de...ry/en-us/scrip
> > t56/html/jsgrpregexpsyntax.asp[/color]
> FYI: the link you gave is for VB script regular expression syntax, which[/color]
is not exactly the same as .Net Regular Expression syntax, which can be
found[color=blue]
>[/color]
http://msdn.microsoft.com/library/de...geElements.asp
(and the (?<! ) construct is under the grouping constructs section link, it
is a negative lookbehind)[color=blue]
>[color=green]
> > At this point, this is mostly an intellectual exercise - I have it
> > working by trimming out the surrounding quotes.[/color]
> I know but that is part of the reason for me helping people with issues[/color]
like this so that I can stay intellectually sharp. ;) Plus i hate giving up
before the objective is obtained.[color=blue]
>[color=green]
> > Just a little bit of
> > a hack. If you have something else for me to try, I'd love to try it.
> > I used to be competent with this stuff, in my old sed, awk, and lex
> > days. But, it's been a while. If you'd prefer to punt, that's fine,
> > and thanks for all your help so far.
> >
> > - Dave[/color]
>
> I hope this is what you are looking for.
>
> Wes Haggard
> http://weblogs.asp.net/whaggard/[/color]


Closed Thread