473,323 Members | 1,551 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,323 software developers and data experts.

Re: Regex on a huge text

On Fri, Aug 22, 2008 at 11:24 AM, Dan <re********@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
for line in f:
res.extend(getmatches(regex, line))
# Of course "getmatches" describes the concept.
#</code>

Regards
Aug 22 '08 #1
3 1589
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
(snip)
#</code>
Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""

Aug 22 '08 #2
En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <sj******@lexicon.netescribió:
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
>On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""
Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

--
Gabriel Genellina

Aug 24 '08 #3
On Aug 22, 9:19*pm, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
* * for line in f:
* * * * res.extend(getmatches(regex, line))
# *Of course "getmatches" describes the concept.
#</code>

Regards
Try and pre-filter your file on a line basis to cut it down , then
apply a further filter on the result.

For example, if you were looking for consecutive SPAM records with the
same Name field then you might first extract only the SPAM records
from the gigabytes to leave something more manageable to search for
consecutive Name fields in.

- Paddy.
Aug 24 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joe Fisherman | last post by:
I have used regex to parse a huge text file, and grab a tab delimited portion of it. I often use comma delimited text files, and use Jet Oledb4. I read that I would need an ini if the file wasn't...
9
by: Whitless | last post by:
Okay I am ready to pull what little hair I have left out. I pass the function below my String to search, my find string (a regular expression) and my replace string (another regular expression)....
5
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated...
4
by: mikewolfbaltimore | last post by:
Hello all have a regex question... I want to split an address into descrete parts so 709 S Milton Ave is split into number = 709 Direction = S Name = Milton
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
10
by: Barry L. Camp | last post by:
Hi all... hope someone can help out. Not a unique situation, but my search for a solution has not yielded what I need yet. I'm trying to come up with a regular expression for a...
16
by: Mark Chambers | last post by:
Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string)...
0
by: Terry Reedy | last post by:
Medardo Rodriguez wrote: Does not grep only work a line at a time? Just like the code below?
6
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.