473,387 Members | 1,859 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Re: Regex on a huge text

On Fri, Aug 22, 2008 at 11:24 AM, Dan <re********@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
for line in f:
res.extend(getmatches(regex, line))
# Of course "getmatches" describes the concept.
#</code>

Regards
Aug 22 '08 #1
3 1593
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
(snip)
#</code>
Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""

Aug 22 '08 #2
En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <sj******@lexicon.netescribió:
On Aug 23, 6:19 am, "Medardo Rodriguez" <med....@gmail.comwrote:
>On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?

Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""
Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

--
Gabriel Genellina

Aug 24 '08 #3
On Aug 22, 9:19*pm, "Medardo Rodriguez" <med....@gmail.comwrote:
On Fri, Aug 22, 2008 at 11:24 AM, Dan <redalas...@gmail.comwrote:
I'm looking on how to apply a regex on a pretty huge input text (a file
that's a couple of gigabytes). I found finditer which would return results
iteratively which is good but it looks like I still need to send a string
which would be bigger than my RAM. Is there a way to apply a regex directly
on a file?
Any help would be appreciated.

You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
* * for line in f:
* * * * res.extend(getmatches(regex, line))
# *Of course "getmatches" describes the concept.
#</code>

Regards
Try and pre-filter your file on a line basis to cut it down , then
apply a further filter on the result.

For example, if you were looking for consecutive SPAM records with the
same Name field then you might first extract only the SPAM records
from the gigabytes to leave something more manageable to search for
consecutive Name fields in.

- Paddy.
Aug 24 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joe Fisherman | last post by:
I have used regex to parse a huge text file, and grab a tab delimited portion of it. I often use comma delimited text files, and use Jet Oledb4. I read that I would need an ini if the file wasn't...
9
by: Whitless | last post by:
Okay I am ready to pull what little hair I have left out. I pass the function below my String to search, my find string (a regular expression) and my replace string (another regular expression)....
5
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated...
4
by: mikewolfbaltimore | last post by:
Hello all have a regex question... I want to split an address into descrete parts so 709 S Milton Ave is split into number = 709 Direction = S Name = Milton
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
10
by: Barry L. Camp | last post by:
Hi all... hope someone can help out. Not a unique situation, but my search for a solution has not yielded what I need yet. I'm trying to come up with a regular expression for a...
16
by: Mark Chambers | last post by:
Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string)...
0
by: Terry Reedy | last post by:
Medardo Rodriguez wrote: Does not grep only work a line at a time? Just like the code below?
6
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.