si************@hotmail.com wrote:
Hi All,
I need to parse certain text from a paragraph (like 20 lines).
I know the exact tags that I am looking for.
my approach is to define a xml (config) file that defines what tag I
am looking for and corresponding regular expression to search for the
pattern.
Xml file will also have a way to say what should be the pervious tag
and what should be the next tag. Again some of it through regular
expression and some of it through logic.
Run time just read the xml .find each tag and corresponding regular
expression execute it.
Assuming there may be more additions of the patterns and there might
be more rules coming up , Is this the best approach for this.
Are there other ways to make it more flexible and generic.
I don't want to end with stringent rules rather develop some sort of
extendable grammar.
Any Ideas
You'll always end up with code that's tied to the grammar of your
'language', unless you're using an LR(n) parser core with action/goto
tables.
Normally, you'd use a lexical analyzer to convert text to tokens, then
interpret the tokens by a parser and 'handle' them by converting
streams of terminals (tokens) into non-terminals and execute actions
based on the determined non-terminals. Terminals and Non-terminals are
terms used in (E)BNF, the notation for grammar.
What you should focus on is to write something that works, rather than
something that can parse every language in the world, because that
won't work, there's always a part of the code that's tied to the
grammar. For example, if you're using a lr(n) parser generator which in
theory produces an action/goto table and uses a generic parser core, it
still has to have rule handlers which handle the action to be executed
when a non-terminal is found. For example, say you have the following
syntaxis:
http://www.microsoft.com
This then can be written in ENBF as:
URL -> UrlStartToken urltext UrlEndToken
UrlStartToken ->
UrlEndToken ->
urltext -> ...
Now, if the nonterminal 'URL' is found, it has to be handled, so the
rule handler for that nonterminal has to be written in code and is
therefore tied to the grammar and therefore not generic. But that's ok,
as you simply want to parse something, to get something done, not to
have something completely generic which doesn't do anything.
Frans
--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website:
http://www.llblgen.com
My .NET blog:
http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------