473,386 Members | 1,668 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Re: Regex on a huge text

On Fri, Aug 22, 2008 at 11:24 AM, Dan <re********@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
for line in f:
res.extend(getmatches(regex, line))
# Of course "getmatches" describes the concept.
#</code>

Regards
Aug 22 '08 #1
3 1592
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
(snip)
#</code>
Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""

Aug 22 '08 #2
En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <sj******@lexicon.netescribió:
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
>On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""
Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

--
Gabriel Genellina

Aug 24 '08 #3
On Aug 22, 9:19*pm, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
* * for line in f:
* * * * res.extend(getmatches(regex, line))
# *Of course "getmatches" describes the concept.
#</code>

Regards
Try and pre-filter your file on a line basis to cut it down , then
apply a further filter on the result.

For example, if you were looking for consecutive SPAM records with the
same Name field then you might first extract only the SPAM records
from the gigabytes to leave something more manageable to search for
consecutive Name fields in.

- Paddy.
Aug 24 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joe Fisherman | last post by:
I have used regex to parse a huge text file, and grab a tab delimited portion of it. I often use comma delimited text files, and use Jet Oledb4. I read that I would need an ini if the file wasn't...
9
by: Whitless | last post by:
Okay I am ready to pull what little hair I have left out. I pass the function below my String to search, my find string (a regular expression) and my replace string (another regular expression)....
5
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated...
4
by: mikewolfbaltimore | last post by:
Hello all have a regex question... I want to split an address into descrete parts so 709 S Milton Ave is split into number = 709 Direction = S Name = Milton
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
10
by: Barry L. Camp | last post by:
Hi all... hope someone can help out. Not a unique situation, but my search for a solution has not yielded what I need yet. I'm trying to come up with a regular expression for a...
16
by: Mark Chambers | last post by:
Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string)...
0
by: Terry Reedy | last post by:
Medardo Rodriguez wrote: Does not grep only work a line at a time? Just like the code below?
6
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.