470,636 Members | 1,453 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,636 developers. It's quick & easy.

Re: Regex on a huge text

On Fri, Aug 22, 2008 at 11:24 AM, Dan <re********@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
for line in f:
res.extend(getmatches(regex, line))
# Of course "getmatches" describes the concept.
#</code>

Regards
Aug 22 '08 #1
3 1468
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
(snip)
#</code>
Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""

Aug 22 '08 #2
En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <sj******@lexicon.netescribió:
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
>On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""
Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

--
Gabriel Genellina

Aug 24 '08 #3
On Aug 22, 9:19*pm, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
* * for line in f:
* * * * res.extend(getmatches(regex, line))
# *Of course "getmatches" describes the concept.
#</code>

Regards
Try and pre-filter your file on a line basis to cut it down , then
apply a further filter on the result.

For example, if you were looking for consecutive SPAM records with the
same Name field then you might first extract only the SPAM records
from the gigabytes to leave something more manageable to search for
consecutive Name fields in.

- Paddy.
Aug 24 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Joe Fisherman | last post: by
9 posts views Thread by Whitless | last post: by
5 posts views Thread by Chris | last post: by
4 posts views Thread by mikewolfbaltimore | last post: by
6 posts views Thread by Extremest | last post: by
16 posts views Thread by Mark Chambers | last post: by
reply views Thread by Terry Reedy | last post: by
6 posts views Thread by | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.