473,545 Members | 2,788 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regexp optimization question

I'm working on a project (Atox) where I need to match quite a few
regular expressions (several hundred) in reasonably large text files.
I've found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn't been designed for speed, and
any optimizations will entail quite a bit of refactoring.)

I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven't yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.

So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I've tried using Psyco, but that didn't help;
I guess it might help if I implemented things differently...)

Any ideas?

--
Magnus Lie Hetland "Oppression and harassment is a small price to pay
http://hetland.org to live in the land of the free." -- C. M. Burns
Jul 18 '05 #1
19 2166
Magnus Lie Hetland wrote:
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location.


Are you using the .lastindex attribute of match objects yet?

Martin

Jul 18 '05 #2
Magnus Lie Hetland <ml*@furu.idi.n tnu.no> wrote:
Any ideas?


Few concrete examples, perhaps? Sadly, my telepathetic power is not
what it used to be...

--
William Park, Open Geometry Consulting, <op**********@y ahoo.ca>
Linux solution/training/migration, Thin-client
Jul 18 '05 #3

"Magnus Lie Hetland" <ml*@furu.idi.n tnu.no> schrieb im Newsbeitrag
news:slrnc8gal3 .9da.ml*@furu.i di.ntnu.no...

Any ideas?


Maybe Plex is helpful. I did not use it already, but it seems to adress your
problem
The author of Plex is Greg Ewing. He build Pyrex on top of Plex

The documentation
http://www.cosc.canterbury.ac.nz/~gr...doc/index.html
contains

"""
Plex is designed to fill a need that is left wanting by the existing Python
regular expression modules. If you've ever tried to use one of them for
implementing a scanner, you will have found that they're not really suited
to the task. You can define a bunch of regular expressions which match your
tokens all right, but you can only match one of them at a time against your
input. To match all of them at once, you have to join them all together into
one big r.e., but then you've got no easy way to tell which one matched.
This is the problem that Plex is designed to solve.
"""

Hope I could help you
Guenter
Jul 18 '05 #4
Magnus Lie Hetland wrote:
but now I have to find out which one of them matched at a
certain location.


the techniques discussed in this article may be helpful:

http://effbot.org/zone/xml-scanner.htm

</F>


Jul 18 '05 #5
In article <c6************ *@news.t-online.com>, Martin v. Löwis wrote:
Magnus Lie Hetland wrote:
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location.
Are you using the .lastindex attribute of match objects yet?


Dang! I thought I had read the docs ;)

Seems very useful. I actually need to know all the matches, because
which one gets selected depends on the current priorities (ordering)
in the parser, and cannot be fixed for the scanner. (I could put the
patterns in the or-group in reverse prioritized order, but that would
have to be fixed...)

Thanks for the tip, anyway.
Martin


--
Magnus Lie Hetland "Wake up!" - Rage Against The Machine
http://hetland.org "Shut up!" - Linkin Park
Jul 18 '05 #6
In article <c6************ @ID-99293.news.uni-berlin.de>, William Park wrote:
Magnus Lie Hetland <ml*@furu.idi.n tnu.no> wrote:
Any ideas?


Few concrete examples, perhaps? Sadly, my telepathetic power is not
what it used to be...


Well... Hard to give specific examples, as the specifics will be
user-specified.

But in order to explain what I'm doing, I could give a rather generic
example.

I might have a bunch of regexps like 'foo1', 'foo2', 'foo3', ...,
'foo500' (these would be more complex, of course).

Now I can do something like this:

hits = []
for pat in pats:
hits.extend(pat .finditer(text) )
# Maybe sort them in order of occurrence

*Or* I can do something like this:

bigPat = '(' + '|'.join(pats) + ')'
hits = list(bigPat.fin diter(text))

The last one is *much* faster -- but only if I forego named groups.
And without them, I can't find out which patterns matched at which
locations. (I can, as suggested, use .lastindex to find *one* of them,
but not all, as I need.)

I'm sure there are plenty of bottlenecks in my code, but this seems to
be one of them, and it's slow even if I run it alone (without any of
the rest of my parsing code).

--
Magnus Lie Hetland "Wake up!" - Rage Against The Machine
http://hetland.org "Shut up!" - Linkin Park
Jul 18 '05 #7
In article <c6**********@n ewsreader2.netc ologne.de>, Günter Jantzen wrote:

"Magnus Lie Hetland" <ml*@furu.idi.n tnu.no> schrieb im Newsbeitrag
news:slrnc8gal 3.9da.ml*@furu. idi.ntnu.no...

Any ideas?

Maybe Plex is helpful. I did not use it already, but it seems to adress your
problem


Ahah.
The author of Plex is Greg Ewing. He build Pyrex on top of Plex
Yeah, I know -- I just never looked at Plex in detail.
The documentation
http://www.cosc.canterbury.ac.nz/~gr...doc/index.html
contains [snip]

Yeah, I looked at the docs, and it looks very promising!

One of the reasons I've avoided existing lexers is that I don't do
standard tokenization -- I don't partition all of the text into regexp
tokens. I allow the lexer to skip over text -- somewhat like how
whitespace is normally handled, except that this can be *any* text --
and to return the next token that is of any use to the current parsing
rule.

But who knows -- I may be able to use Plex anyway.

One problem might be that the regexp format seems to be quite
stripped-down (although, of course, a regexp is a regexp,
theoretically speaking ;)

No non-greedy matching, no lookahead/lookback etc.

But if Plex gives a real performance boost, I may be able to live with
that. (The regexp part is functionality that is available to the user,
in my case.)
Hope I could help you
It might. Thanks.
Guenter


--
Magnus Lie Hetland "Wake up!" - Rage Against The Machine
http://hetland.org "Shut up!" - Linkin Park
Jul 18 '05 #8
In article <ma************ *************** ***********@pyt hon.org>,
Fredrik Lundh wrote:
Magnus Lie Hetland wrote:
but now I have to find out which one of them matched at a
certain location.
the techniques discussed in this article may be helpful:

http://effbot.org/zone/xml-scanner.htm


Yes, this is indeed useful. The most useful thing, I guess, is the
lastindex attribute (which was pointed out earlier) -- but I suspect I
can't use it, as I have to know all the patterns that were matched at
a given location (and let the parser choose which one to use, based
on its context-dependent priorities). Or, at least, that's how it
seems to me now.

I *could*, of course, say that ambiguity is not allowed -- or that the
effects aren't defined. Or maybe I could order the patterns in terms
of specificity (but analyzing regexps for specificity is no mean feat,
of course ;)

Thanks for the pointer, though -- a nice tutorial!
</F>


--
Magnus Lie Hetland "Wake up!" - Rage Against The Machine
http://hetland.org "Shut up!" - Linkin Park
Jul 18 '05 #9
Magnus Lie Hetland <ml*@furu.idi.n tnu.no> wrote:
Now I can do something like this:

hits = []
for pat in pats:
hits.extend(pat .finditer(text) )
# Maybe sort them in order of occurrence

*Or* I can do something like this:

bigPat = '(' + '|'.join(pats) + ')'
hits = list(bigPat.fin diter(text))

The last one is *much* faster -- but only if I forego named groups.
And without them, I can't find out which patterns matched at which
locations. (I can, as suggested, use .lastindex to find *one* of them,
but not all, as I need.)


Since you want both the matched strings and their locations in file, you
pretty much have to this manually, one by one.

--
William Park, Open Geometry Consulting, <op**********@y ahoo.ca>
Linux solution/training/migration, Thin-client
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2184
by: python_charmer2000 | last post by:
I want to match several regexps against a large body of text. What I have so far is similar to this: re1 = <some regexp> re2 = <some regexp> re3 = <some regexp> big_re = re.compile(re1 + '|' + re2 + '|' + re3) matches = big_re.finditer(file_list)
5
2338
by: Lukas Holcik | last post by:
Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using re.sub? I found out they are...
93
3553
by: roman ziak | last post by:
I just read couple articles on this group and it keeps amazing me how the portability is used as strong argument for language cleanliness. In my opinion, porting the program (so you just take the source code and recompile) is a myth started 20-30 years ago when world consisted of UNIX systems. Well, world does not consist of UNIX systems...
7
3427
by: Csaba Gabor | last post by:
I need to come up with a function function regExpPos (text, re, parenNum) { ... } that will return the position within text of RegExp.$parenNum if there is a match, and -1 otherwise. For example: var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/ var text = "There were some nesting parens in the test"; alert (regExpPos (text,...
5
2375
by: wkaras | last post by:
I've compiled this code: const int x0 = 10; const int x1 = 20; const int x2 = 30; int x = { x2, x0, x1 }; struct Y {
9
1922
by: vbfoobar | last post by:
Hello I am looking for python code that takes as input a list of strings (most similar, but not necessarily, and rather short: say not longer than 50 chars) and that computes and outputs the python regular expression that matches these string values (not necessarily strictly, perhaps the code is able to determine patterns, i.e. families...
11
2904
by: HopfZ | last post by:
I coudn't understand some behavior of RegExp.test function. Example html code: ---------------- <html><head></head><body><script type="text/javascript"> var r = /^https?:\/\//g; document.write( ); </script></body></html> ---------------------
7
1073
by: cirfu | last post by:
pat = re.compile("(\w* *)*") this matches all sentences. if fed the string "are you crazy? i am" it will return "are you crazy". i want to find a in a big string a sentence containing Zlatan Ibrahimovic and some other text. ie return the first sentence containing the name Zlatan Ibrahimovic.
20
2323
by: Ravikiran | last post by:
Hi Friends, I wanted know about whatt is ment by zero optimization and sign optimization and its differences.... Thank you...
0
7432
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7689
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7943
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
5359
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3490
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3470
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1919
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1044
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
743
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.