Finding messages in huge mboxes

Bastiaan Welmers

Hi,

I wondered if anyone has ever met this same mbox issue.

I'm having the following problem:

I need find messages in huge mbox files (50MB or more).
The following way is (of course?) not very usable:

fp = open("mbox", "r")
archive = mailbox.UnixMailbox(fp)
i=0
while i < message_number_needed:
i+=1
archive.next()

needed_message = archive.next()

Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
i+=1
fp.seek(-i, SEEK_TO_END=2)
line = fp.readline()
if not line:
break
if line[:5] == 'From ':
j+=1
if j == total_messages - message_number_needed:
archive.seekp = fp.tell()
message = archive.next()
# message found

But also seems to be slow and CPU consuming.

Anyone who has a better idea?

Regards,

Bastiaan Welmers

Jul 18 '05 #1

Subscribe Post Reply

1701

Miklós

What about putting it into a database like MySQL? <pyWink>

Miklós
"Bastiaan Welmers" <ha****@welmers.net> wrote in message
news:40*********************@news.xs4all.nl...

Hi,

I wondered if anyone has ever met this same mbox issue.

I'm having the following problem:

I need find messages in huge mbox files (50MB or more).
The following way is (of course?) not very usable:

fp = open("mbox", "r")
archive = mailbox.UnixMailbox(fp)
i=0
while i < message_number_needed:
i+=1
archive.next()

needed_message = archive.next()

Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
i+=1
fp.seek(-i, SEEK_TO_END=2)
line = fp.readline()
if not line:
break
if line[:5] == 'From ':
j+=1
if j == total_messages - message_number_needed:
archive.seekp = fp.tell()
message = archive.next()
# message found

But also seems to be slow and CPU consuming.

Anyone who has a better idea?

Regards,

Bastiaan Welmers

Jul 18 '05 #2

Diez B. Roggisch

> Anyone who has a better idea?

AFAIK MUAs usually use a mbox.index-file for faster access. The index is
computed once, and updated whenever a new message is added. You could
create this index quite easily yourself by looping over the mbox and
pickling a list of tell'ed positions. If you also store the creation-date
of the index and the filesize of the mbox-file, you should be able to
create a function that will update the index whenever the underlying mbox
has changed. Another approach would be to perform index-creation on regular
bases using cron.

Regards,

Diez

Jul 18 '05 #3

Donn Cave

In article <40*********************@news.xs4all.nl>,
Bastiaan Welmers <ha****@welmers.net> wrote:
....

I need find messages in huge mbox files (50MB or more). .... Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

readline() is not your friend here. I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively. Like,
next = block.find('\nFrom ', prev + 1)

This will give you the location of each message in
the current block, so you can split the block up
into a list of messages. (There will be an extra
chunk of data at the beginning of each block, before
the first "From " - recycle that onto the end of the
next block.)

Since file object buffering is at best useless in this
application, I would use posix.open, posix.lseek and
posix.read. Taking this approach, I find that reading
the last 10 messages in a 100 Mb folder takes 0.05 sec.

Donn Cave, do**@u.washington.edu

Jul 18 '05 #4

David M. Cooke

At some point, Donn Cave <do**@u.washington.edu> wrote:

In article <40*********************@news.xs4all.nl>,
Bastiaan Welmers <ha****@welmers.net> wrote:
...
I need find messages in huge mbox files (50MB or more).

...
Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

readline() is not your friend here. I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively. Like,
next = block.find('\nFrom ', prev + 1)

Unless, of course, you read '\nFr', then 'om ' in the next block...

I can't think of a simple way around this (except for reading by
lines). Concating the last two together means having to keep track of
what you've seen in the last block. Maybe picking off the last line
from the last block (using line.rfind('\n')), and concatenating that
to the beginning of the next.

--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)physics(dot)mcmaster(dot)ca

Jul 18 '05 #5

Donn Cave

Quoth co**********@physics.mcmaster.ca (David M. Cooke):
| At some point, Donn Cave <do**@u.washington.edu> wrote:
|> In article <40*********************@news.xs4all.nl>,
|> Bastiaan Welmers <ha****@welmers.net> wrote:
|> ...
|>> I need find messages in huge mbox files (50MB or more).
|> ...
|>> Especially because I often need messages at the end
|>> of the MBOX file.
|>> So I tried the following (scanning messages backwards
|>> on found "From " lines with readline())
|>
|> readline() is not your friend here. I suggest that
|> you read large blocks of data, like 8192 bytes for
|> example, and search them iteratively. Like,
|> next = block.find('\nFrom ', prev + 1)
|
| Unless, of course, you read '\nFr', then 'om ' in the next block...
|
| I can't think of a simple way around this (except for reading by
| lines). Concating the last two together means having to keep track of
| what you've seen in the last block. Maybe picking off the last line
| from the last block (using line.rfind('\n')), and concatenating that
| to the beginning of the next.

I'm reading from the end backwards, so the fragment is block[:start].
Append that to the block before it, and each block always will end at
a message boundary. If you start in the middle, you have to deal with
an extra boundary problem. If reading forward from the beginning, it
would be about as simple.

If I have overlooked some obvious problem with this, it wouldn't be
the first time, but I think it's as simple as it could be. The only
inelegance to it is that you have to scan the fragment at least twice
(one extra time for each time it's added to a new block.)

Donn Cave, do**@drizzle.com

Jul 18 '05 #6

Miki Tebeka

Hell Bastiaan,

I need find messages in huge mbox files (50MB or more).
...
Anyone who has a better idea?

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

--- last.py ---
#!/usr/bin/env python
from os import popen
from sys import argv

# Find last "From:" line
last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
last = int(last.split(":")[0])
# Find total number of lines
size = popen("wc -l %s" % argv[1]).read()
size = int(size.split()[0].strip())
# Print the message
print popen("tail -%d %s" % (size - last, argv[1])).read()
--- last.py ---
Tool less than 1sec on my computer on a 11MB mailbox.

HTH.
Miki

Jul 18 '05 #7

Cameron Laird

In article <4f**************************@posting.google.com >,
Miki Tebeka <mi*********@zoran.com> wrote:

Hell Bastiaan,
I need find messages in huge mbox files (50MB or more).
...
Anyone who has a better idea?

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

--- last.py ---
#!/usr/bin/env python
from os import popen
from sys import argv

# Find last "From:" line
last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
last = int(last.split(":")[0])
# Find total number of lines
size = popen("wc -l %s" % argv[1]).read()
size = int(size.split()[0].strip())
# Print the message
print popen("tail -%d %s" % (size - last, argv[1])).read()
--- last.py ---
Tool less than 1sec on my computer on a 11MB mailbox.

Jul 18 '05 #8

Erno Kuusela

Bastiaan Welmers <ha****@welmers.net> writes:

Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
i+=1
fp.seek(-i, SEEK_TO_END=2)
line = fp.readline()
if not line:
break
if line[:5] == 'From ':
j+=1
if j == total_messages - message_number_needed:
archive.seekp = fp.tell()
message = archive.next()
# message found

But also seems to be slow and CPU consuming.

something like this might work. the loop below scanned a 115MB mailbox
in about 1 second on a 1.2ghz k7. extracts the next-to-last message,
but you get the idea. if you don't want to read the file into cache,
you could adapt it to start with a smaller mmapped chunk from the end
of the file and enlarge it until you find what you want.
import os, re, mmap, sys
from cStringIO import StringIO
import email

fd = os.open(sys.argv[1], os.O_RDONLY)
size = os.fstat(fd).st_size
print size
buf = mmap.mmap(fd, size, access=mmap.ACCESS_READ)
message_offsets = []
for m in re.finditer(r'(?s)\n\nFrom', buf):
message_offsets.append(m.start())

msgfp = StringIO(buf[message_offsets[-2] + 2:message_offsets[-1] + 2])
msg = email.message_from_file(msgfp)
print msg['to']

-- erno

Jul 18 '05 #9

Bastiaan Welmers

Miki Tebeka wrote:

Hell Bastiaan,
I need find messages in huge mbox files (50MB or more).
...
Anyone who has a better idea?

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

Sounds as a very good idea. Tanks.

/Bastiaan

Jul 18 '05 #10

Bastiaan Welmers

Miklós wrote:

What about putting it into a database like MySQL? <pyWink>

Too much work to archieve this. It's just a Mailman archieve mbox
which has to be opened. So then I have to rewrite
pipermail archiever.

/Bastiaan

Jul 18 '05 #11

Bastiaan Welmers

Diez B. Roggisch wrote:

Anyone who has a better idea?

AFAIK MUAs usually use a mbox.index-file for faster access. The index is
computed once, and updated whenever a new message is added. You could
create this index quite easily yourself by looping over the mbox and
pickling a list of tell'ed positions. If you also store the creation-date
of the index and the filesize of the mbox-file, you should be able to
create a function that will update the index whenever the underlying mbox
has changed. Another approach would be to perform index-creation on
regular bases using cron.

Also good idea. It's a mailman archieve so then I have
to hack mailman for creating an index file besides the
mbox file.

/Bastiaan

Jul 18 '05 #12

Similar topics

How to decipher a SQLCODE

by: Twan Kennis | last post by:

Hi, I have a DB2 database on the IBM iSeries platform, on which I created several Stored Procedures with the SQLCODE as a return-parameter. These Stored Procedures are called from a Windows...

DB2 Database

Netbios Messages sent before UDP packet

by: Will Price | last post by:

I am having problems using UDPClient.Send() in System.Net.Sockets. Each time I make the call to the function, there is a significant delay from when the code is executed to when the packet...

C# / C Sharp

Any tips on finding the problematic line of code?

by: Eric Lilja | last post by:

Hello, when I compile my project I get this (after doing a complete clean first): $ make g++ -Wall -W -ansi -pedantic -g3 -O0 -D_WIN32_WINNT=0x501 -D_WIN32_IE=0x600 -c common_dialogs.cpp g++...

C / C++

Finding Duplicate Messages off of COM port

by: ucfcpegirl06 | last post by:

Hello, I have a dilemma. I am trying to flag duplicate messages received off of a com port. I have a software tool that is supposed to detect dup messages and flag and write the text "DUP" on...

C / C++

Finding words in huge list!

by: gg302 | last post by:

Hey, i have a textfile hosted on a website which is a huge list (over 2000 lines). I need to get some data from it but it would take forever for me to manually remove the characthers i dont want, so...

Visual Basic 4 / 5 / 6

Finding out if a computer is connected to the network

by: NSF12345 | last post by:

Iv developed a small program that looks for a file over our network, and copy it to the location of another computer. Im using the "If FileExists("\\oldpc\main share\Folder\file.txt") Then" way of...

Visual Basic 4 / 5 / 6

finding shortest path

by: aleya | last post by:

I am developing a program using VB.NET that will accept a start and end location (2 list boxes), the system then will generate the shortest path to reach the end point. for your information i got a...

.NET Framework

How can i collect all the messages Off line.

by: RAVIN | last post by:

Hi Programmers, Comp.lang.C++ is a huge community. I really Appreciate the Active community with 50- 100's of Messages floating every day. Its Really Hard for a begginer like me in particular...

C / C++

Finding "missing" rows

by: maury | last post by:

Hello, I have a DB table with data filled from a weather sensor probe, I have one row every 10 minutes and the data fields is not in DateTime format but in string format: yyyyMMddHHmm So for...

Microsoft SQL Server

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++