Connecting Tech Pros Worldwide Help | Site Map

Finding position of a RegExp subexpression

Csaba Gabor
Guest
 
Posts: n/a
#1: Apr 21 '06
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

For example:
var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
var text = "There were some nesting parens in the test";
alert (regExpPos (text, re, 3));

should show 17


Would anyone have one of these?
Csaba Gabor from Vienna

Randy Webb
Guest
 
Posts: n/a
#2: Apr 21 '06

re: Finding position of a RegExp subexpression


Csaba Gabor said the following on 4/21/2006 1:23 PM:[color=blue]
> I need to come up with a function
> function regExpPos (text, re, parenNum) { ... }
> that will return the position within text of RegExp.$parenNum if there
> is a match, and -1 otherwise.[/color]

There is one already. indexOf :)
Never tried it with RegExp's though :)
--
Randy
comp.lang.javascript FAQ - http://jibbering.com/faq & newsgroup weekly
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Csaba Gabor
Guest
 
Posts: n/a
#3: Apr 21 '06

re: Finding position of a RegExp subexpression


Randy Webb wrote:[color=blue]
> Csaba Gabor said the following on 4/21/2006 1:23 PM:[color=green]
> > I need to come up with a function
> > function regExpPos (text, re, parenNum) { ... }
> > that will return the position within text of RegExp.$parenNum if there
> > is a match, and -1 otherwise.[/color]
>
> There is one already. indexOf :)
> Never tried it with RegExp's though :)[/color]

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

Csaba

By the way, thanks for that ear piercing demo in the other thread. :)



The problem with using text.indexOf(RegExp.$pare,pos) will find the
position of substring within string, but the problem is that that
RegExp.$parenNum may not be unique within string

Randy Webb
Guest
 
Posts: n/a
#4: Apr 21 '06

re: Finding position of a RegExp subexpression


Csaba Gabor said the following on 4/21/2006 2:48 PM:[color=blue]
> Randy Webb wrote:[color=green]
>> Csaba Gabor said the following on 4/21/2006 1:23 PM:[color=darkred]
>>> I need to come up with a function
>>> function regExpPos (text, re, parenNum) { ... }
>>> that will return the position within text of RegExp.$parenNum if there
>>> is a match, and -1 otherwise.[/color]
>> There is one already. indexOf :)
>> Never tried it with RegExp's though :)[/color]
>
> The problem with
> function regExpPos (text, re, parenNum) {
> if (!text.match(re)) return -1;
> return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
> }
>
> is that RegExp['$'+parenNum] may not be unique within text (though it
> is in the example that I gave). So if I change text to
> var text = "There were some questionable nesting parens in the test";
> regExpPos (text, re, 3) would return 18 instead of the correct 30.[/color]

My knowledge of RegExp's may not be well enough to understand them so I
may be reading it wrong, but if you want the last match, then
lastIndexOf gives it. -1 if no match.
[color=blue]
> Csaba
>
> By the way, thanks for that ear piercing demo in the other thread. :)[/color]

It does a better job than coffee at 5 am :)

--
Randy
comp.lang.javascript FAQ - http://jibbering.com/faq & newsgroup weekly
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
Dr John Stockton
Guest
 
Posts: n/a
#5: Apr 21 '06

re: Finding position of a RegExp subexpression


JRS: In article <1145640221.825938.207830@j33g2000cwa.googlegroups .com>
, dated Fri, 21 Apr 2006 10:23:41 remote, seen in
news:comp.lang.javascript, Csaba Gabor <danswer@gmail.com> posted :[color=blue]
>I need to come up with a function
>function regExpPos (text, re, parenNum) { ... }
>that will return the position within text of RegExp.$parenNum if there
>is a match, and -1 otherwise.
>
>For example:
>var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
>var text = "There were some nesting parens in the test";
>alert (regExpPos (text, re, 3));
>
>should show 17[/color]

If you can alter the RegExp by inserting extra parentheses so that
everything is matched, them you could sum the lengths of all lower
matches.

Or you could then, with .replace, substitute all lower matches to "",
and see by how much the length has changed.

But I don't know whether that would always work with sufficiently
complex RegExps.

You could .replace the parameter in question with an Unreasonable String
(it is, after all, Unicode) and then do indexOf(that US).

Note : if the original string is less than 2^16 characters long, there
mist be at least one "16-bit" Unicode character that it does not
contain. So to find a one-character US, start searching for each
possible character in turn (starting with the least plausible) until you
find one that is not there.

Untested.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4 ©
<URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
Csaba Gabor
Guest
 
Posts: n/a
#6: Apr 22 '06

re: Finding position of a RegExp subexpression


Dr John Stockton wrote:[color=blue]
> JRS: In article <1145640221.825938.207830@j33g2000cwa.googlegroups .com>
> , dated Fri, 21 Apr 2006 10:23:41 remote, seen in
> news:comp.lang.javascript, Csaba Gabor <danswer@gmail.com> posted :[color=green]
> >I need to come up with a function
> >function regExpPos (text, re, parenNum) { ... }
> >that will return the position within text of RegExp.$parenNum if there
> >is a match, and -1 otherwise.
> >
> >For example:
> >var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
> >var text = "There were some nesting parens in the test";
> >alert (regExpPos (text, re, 3));
> >
> >should show 17[/color]
>
> If you can alter the RegExp by inserting extra parentheses so that
> everything is matched, them you could sum the lengths of all lower
> matches.[/color]

This is, in effect, what I have done, code provided below. However, it
is a non trivial process that must account for nested parentheses
(...(...()...()...)...(...()...)...), back references (\#), and non
capturing subexpressions (?:...).
[color=blue]
> Or you could then, with .replace, substitute all lower matches to "",
> and see by how much the length has changed.
>
> But I don't know whether that would always work with sufficiently
> complex RegExps.
>
> You could .replace the parameter in question with an Unreasonable String
> (it is, after all, Unicode) and then do indexOf(that US).[/color]

I appreciate the brainstorming. Back references render the remaining
above ideas unworkable, as far as I can tell. Below is a function I
coded up which does the job. It works by introducing parens ending at
the start of the specified capturing parens [those are parens that
don't start with (?:] and stretching back to the start of the
containing capturing parens. Of course the containing paren's position
must be identified, too, so you get the idea this is recursive. The
complete listing of the function in all its gory glory follows (not
extensively tested).

Csaba Gabor from Vienna


function regExpPos (text, re, parenNum) {
// returns the starting position of the parenNum-th capturing parens
// of the RegExp, re, when matching text; -1 if not successful
if (!parenNum) { // terminating case
if (!text.match(re)) return -1;
return RegExp.leftContext.length; }
var i, j, aParen, src=re.source;
if (arguments.length<4) { // initial entry - this section determines
// opening and closing positions of all capturing parens
var code, chr;
aParen = [[0, src.length]];
var mode = 0; // 0 => normal, 1 => character []
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") { ++i; continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
if (chr=="(" && src.substr(i+1,2)!="?:") aParen.push([i, -1]);
else if (chr==")")
for (j=aParen.length;j--;)
if (aParen[j][1]<0) { aParen[j][1]=i; break; }
}
if (parenNum>=aParen.length) {
if (!text.match(re)) return -1;
return (RegExp.leftContext.length + RegExp.lastMatch.length); }
} else aParen = arguments[3];

// step 1 - find the containing parens (cp, aCP)
var aTP = aParen[parenNum]; // parenNum's start, end position
for (var cP=parenNum;cP--;) if (aParen[cP][1]>aTP[1]) break;
var res, aP2, aCP = aParen[cP]; // containing paren's start, end pos

// step 2 - avoid introducing extra level of parens
// for when cP to parenNum is completely filled with parens
for (i=parenNum, aP2=[i];--i>cP;)
if (aParen[aP2[aP2.length-1]][0]==aParen[i][1]+1)
aP2[aP2.length] = i;
if (aParen[aP2[aP2.length-1]][0]==aCP[0]+1) {
if (!text.match(re)) return -1;
for (res=0, i=aP2.length;--i;) res += RegExp['$'+aP2[i]].length;
return res + (!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen)); }

// step 3 - insert parens from start of cP to start of parenNum
//alert (aParen.join("\n"));
src = src.slice(0,i=aCP[0]) + "(" +
src.slice(i,i=aTP[0]) + ")" + src.slice(i);

// step 4 - replace back references >= parenNum
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") {
if (!mode && (code=src.charCodeAt(i+1))<57 && (code>=48+(cP+1)))
src = src.slice(0,i+1) + String.fromCharCode(code+1) +
src.slice(i+2);
++i;
continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
}

// step 5 - do the regular expression
var rex = /x/;
rex.compile(src);
if (!text.match(rex)) return -1;
return RegExp['$'+(cP+1)].length +
(!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen));
}

Lasse Reichstein Nielsen
Guest
 
Posts: n/a
#7: Apr 22 '06

re: Finding position of a RegExp subexpression


"Csaba Gabor" <danswer@gmail.com> writes:
[color=blue]
> I need to come up with a function
> function regExpPos (text, re, parenNum) { ... }
> that will return the position within text of RegExp.$parenNum if there
> is a match, and -1 otherwise.[/color]

I can't see an immediate way that works with all regexps and/or
texts. You only get the value of the group match, and that can be very
un-unique in the string, and even in the match. The only index you
ever get is the index of the entire match.

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Dr John Stockton
Guest
 
Posts: n/a
#8: Apr 22 '06

re: Finding position of a RegExp subexpression


JRS: In article <adydnbH8hs4pttTZRVn-iA@comcast.com>, dated Fri, 21 Apr
2006 15:00:08 remote, seen in news:comp.lang.javascript, Randy Webb
<HikksNotAtHome@aol.com> posted :[color=blue]
>Csaba Gabor said the following on 4/21/2006 2:48 PM:[color=green]
>> Randy Webb wrote:[color=darkred]
>>> Csaba Gabor said the following on 4/21/2006 1:23 PM:
>>>> I need to come up with a function
>>>> function regExpPos (text, re, parenNum) { ... }
>>>> that will return the position within text of RegExp.$parenNum if there
>>>> is a match, and -1 otherwise.
>>> There is one already. indexOf :)
>>> Never tried it with RegExp's though :)[/color]
>>
>> The problem with
>> function regExpPos (text, re, parenNum) {
>> if (!text.match(re)) return -1;
>> return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
>> }
>>
>> is that RegExp['$'+parenNum] may not be unique within text (though it
>> is in the example that I gave). So if I change text to
>> var text = "There were some questionable nesting parens in the test";
>> regExpPos (text, re, 3) would return 18 instead of the correct 30.[/color]
>
>My knowledge of RegExp's may not be well enough to understand them so I
>may be reading it wrong, but if you want the last match, then
>lastIndexOf gives it. -1 if no match.[/color]

ISTM that, if he had wanted that, he would have said so. After all, the
Viennese are good at English.


Testing such as

R = ("12j3456789").match(/(\d)(\d)(\d)(\d)/)
A = R['lastIndex']

suggests that A is indeed the index at which to start the next match,
and
A = R['lastIndex'] - R[R.length-1].length

is therefore the beginning of the last match.

So, Csaba, you just need a RegExp that edits RegExps to have only n
matches, and a question very similar to the original is already
answered.

It looks as if RegExp.leftContext.length *may* actually answer the
modified question but IE4 appears not to have leftContext.

Small Flanagan asserts that IE4 has neither leftContext not lastIndex.


<FAQENTRY> The FAQ needs a goof link or two, and a supporting entry, for
RegExp.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4 ©
<URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
Closed Thread