In article <11**********************@m73g2000cwd.googlegroups .com>,
<wr********@gmail.comwrote:
: I think I need to do a negative lookahead with a regular expression,
: but I'm a bit confused how to make it all work. Take these example
: texts:
:
: Need to match these two:
: =========================
:
: Item 4.01 Regulation and other items
: <b>Item 4. Regulation</b>
:
: =========================
: Need to avoid matching these two:
: =========================
:
: ...then he looked at Item 12.06 for more information...
: <a href="000">Item 6. Other</abelow
:
: [...]
When you're having trouble getting the right pattern, turn the
verbosity knob way up. In other words, be explicit about the
different cases:
using System;
using System.Text.RegularExpressions;
using MbUnit.Framework;
namespace Item
{
public class ItemScanner
{
static Regex valid =
new Regex(@"
^(?<item>Item\s\d+\.\d*.+) |
>(?<item>Item\s\d+\.\d*.*?)<(?!/a>)",
RegexOptions.IgnorePatternWhitespace);
public string ExtractItem(string s)
{
Match m = valid.Match(s);
return m.Success ? m.Groups["item"].Value : null;
}
}
[TestFixture]
public class Test
{
[RowTest]
[Row("Item 4.01 Regulation and other items",
"Item 4.01 Regulation and other items")]
[Row("<b>Item 4. Regulation</b>",
"Item 4. Regulation")]
[Row("...then he looked at Item 12.06 for more...", null)]
[Row("<a href=\"000\">Item 6. Other</abelow", null)]
[Row("<abc>Item 7.</abc>",
"Item 7.")]
public void LookForInterestingItems(string input, string expect)
{
ItemScanner scanner = new ItemScanner();
Assert.AreEqual(expect, scanner.ExtractItem(input));
}
}
}
Using negative lookahead is tricky because it's easy to forget that
the matcher is the Little Engine That Could: the pesky thing will
keep on backtracking until it finds a match.
Consider a paraphrased version of your pattern:
Item\s\d+\..*?(?!</a>)
For "Item 7. blah</a>", the matcher sees the dot after one or more
digits and then tries the negative lookahead without consuming any
more input, i.e., .*? tries first tries zero repititions, then one,
then two, and so on. The string " blah</a>" does not start with
"</a>", so the match succeeds.
In general, the more anchors or checkpoints you can put in your
patterns, the easier you'll make it on yourself. Note that in my
pattern, I match a less-than that I'm going to throw away. Because I'm
in the string-value branch, I know I have to find a less-than, so I
look for it and then make sure it's not a bad end element.
PLEASE NOTE: All that said, regular expressions are *very* poor
substitutes for HTML parsers. Say you have the following item:
<b>Item 4. <em>Really</embad!</b>
My pattern will report "Item 4. " as the item, which is probably
not what you want.
Find another approach if possible, e.g., XPath if your documents are
XHTML.
Hope this helps,
Greg
--
Those who are asking for more government interference are asking
ultimately for more compulsion and less freedom.
-- Ludwig von Mises