python: ascii read

Sebastian Krause

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

Thanks.

Greetings,
Sebastian

Jul 18 '05 #1

Subscribe Post Reply

8978

Alex Martelli

Sebastian Krause <ca*****@gmx.net> wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.txt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex

Jul 18 '05 #2

Robert Kern

Sebastian Krause wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

What kind of data is it? What operations do you want to perform on the
data? What platform are you on?

Some of the scipy.io.read_array behavior that you see look like bugs. We
would greatly appreciate it if you were to send a complete bug report to
the scipy-dev mailing list. Thank you.

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 18 '05 #3

Sebastian Krause

I did not explictly mention that the ascii file should be read in as an
array of numbers (either integer or float).
To use open() and read() is very fast, but does only read in the data as
string and it also does not work with large files.

Sebastian

Alex Martelli wrote:

Sebastian Krause <ca*****@gmx.net> wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.txt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex

Jul 18 '05 #4

Sebastian Krause

The input data is is large ascii file of astrophysical parameters
(integer and float) of gaydynamics calculations. They should be read in
as an array of integer and float numbers not as string (as open() and
read() does). Then the array is used to make different plots from the
data and do some (simple) operations: subtraction and divison of
columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).

Sebastian

Robert Kern wrote:

Sebastian Krause wrote:
Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but
not as general as Python). The problem with scipy.io.read_array was,
that it is really slow, returns errors when trying to process large
files and it also changes (cuts) the files (after scipy.io.read_array
processed a 2GB file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly
and fast? (Maybe with another read-in routine.)

What kind of data is it? What operations do you want to perform on the
data? What platform are you on?

Some of the scipy.io.read_array behavior that you see look like bugs. We
would greatly appreciate it if you were to send a complete bug report to
the scipy-dev mailing list. Thank you.

Jul 18 '05 #5

Alex Martelli

Sebastian Krause <ca*****@gmx.net> wrote:

I did not explictly mention that the ascii file should be read in as an
array of numbers (either integer or float).
Ah, right, you didn't . So I was answering the literal question you
asked rather than the one you had in mind.
To use open() and read() is very fast, but does only read in the data as
string and it also does not work with large files.

It works just fine with files as large as you have memory for (and mmap
works for files as large as you have _spare address space_ for, if your
OS is decently good at its job). But if what you want is not the job
that .read() and mmap do, the fact that they _do_ perform that job quite
well on large files is of course of no use to you.

Back to, why scipy.io.read_array works so badly for you -- I don't know,
it's rather complicated code, as well as maybe old-ish (wraps files into
class instances to be able to iterate on their lines) and very general
(lots of options regarding what are separators, etc, etc). If your
needs are very specific (you know a lot about the format of those huge
files -- e.g. they're column-oriented, or only use whitespace separators
and \n line termination, or other such specifics) you might well be able
to do better -- likely even in Python, worst case in C. I assume you
need Numeric arrays, 2-d, specifically, as the result of reading your
files? Would you know in advance whether you're reading int or float
(it might be faster to have two separate functions)? Could you
pre-dimension the Numeric array and pass it in, or do you need it to
dimension itself dynamically based on file contents? The less
flexibility you need, the simpler and faster the reading can be...
Alex

Jul 18 '05 #6

Robert Kern

Sebastian Krause wrote:

The input data is is large ascii file of astrophysical parameters
(integer and float) of gaydynamics calculations. They should be read in
as an array of integer and float numbers not as string (as open() and
read() does). Then the array is used to make different plots from the
data and do some (simple) operations: subtraction and divison of
columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).
Well, one option is to use the "lines" argument to scipy.io.read_array
to only read in chunks at a time. It probably won't help speed any, but
hopefully it will be correct.
Sebastian

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 18 '05 #7

Des Small

al*****@yahoo.com (Alex Martelli) writes:

If your needs are very specific (you know a lot about the format of
those huge files -- e.g. they're column-oriented, or only use
whitespace separators and \n line termination, or other such
specifics) you might well be able to do better -- likely even in
Python, worst case in C. I assume you need Numeric arrays, 2-d,
specifically, as the result of reading your files? Would you know
in advance whether you're reading int or float (it might be faster
to have two separate functions)? Could you pre-dimension the
Numeric array and pass it in, or do you need it to dimension itself
dynamically based on file contents? The less flexibility you need,
the simpler and faster the reading can be...

The last time I wanted to be able to read large lumps of numerical
data from an ASCII file, I ended up using (f)lex, for performance
reasons. (Pure C _might_ have been faster still, of course, but it
would _quite certainly_ also have been pure C.)

This has caused minor irritation - the code has been in use through
several upgrades of Python, and it is considered polite to recompile
to match the current C API - but I'd probably do it the same way again
in the same situation.

Des
--
"[T]he structural trend in linguistics which took root with the
International Congresses of the twenties and early thirties [...] had
close and effective connections with phenomenology in its Husserlian
and Hegelian versions." -- Roman Jakobson

Jul 18 '05 #8

Brian van den Broek

Alex Martelli said unto the world upon 2004-09-16 07:22:

Sebastian Krause <ca*****@gmx.net> wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_array was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_array processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.txt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex

Hi all,

[neophyte question warning]

I'd not been aware of mmap until this post. Looking at the Library
Reference and my trusty copy of Python in a Nutshell, I've gotten some
idea of the differences between using mmap and the .read() method on a
file object -- such as it returns a mutable object vs an immutable
string, constraint on slice assignment that len(oldslice) must be equal
to len(newslice), etc.

But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.

And, since mmap behave differently on different platforms, I'm mostly a
win32 user looking to transition to Linux.

Best to all,

Brian vdB

Jul 18 '05 #9

Heiko Wundram

Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:

But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.

read()ing a file into memory does what it says; it reads the binary data from
the disk all at once, and allocates main memory (as needed) to fit all the
data there. Memory mapping a file (or device or whatever) means that the
virtual memory architecture is involved. What happens here:

mmapping a file creates virtual memory pages (just like virtual memory which
is put into your paging file), which are registered with the MMU of the
processor as being absent initially.

Now, when the program tries to access the memory page (pages are some fixed
short length, like 4k for most Pentium-style computers), a (page) fault is
generated by the MMU, which invokes the operating system's handler for page
faults. Now that the operating system sees that a certain page is accessed
(from the page address it can deduce the offset in the file that you're
trying to access), it loads the corresponding page from disk, and puts it
into memory at some position, and alters the pagetable entry in the LDT to be
present.

Future accesses to the page will take place immediately (without a page fault
taking place).

Changes in memory are written to disk once the page is flushed (meaning that
it gets removed from main memory because there are too few pages available of
real main memory). Now, when a page is forcefully flushed (not due to closing
the mmap), the operating system marks the pagetable entry in the LDT to be
absent again, and the next time the program tries to access this location, a
page-fault again takes place, and the OS can load the page from disk.

For speed, the operating system allows you to mmap read-only, which means that
once a page is discarded, it does not need to be written back to disk (which
of course is faster). Some MMUs (IIRC not the Pentium-class MMU) set a dirty
bit on the page-table entry once the page has been altered, this can also be
used to control whether the page needs to be written back to disk after
access.

So, basically what you get is load on demand file handling, which is similar
to what the paging file (virtual memory file) on win32 does for allround
memory. Actually, internally, the architecture to handle mmapped files and
virtual memory is the same, and you could think of the swap file as an
operating system mmapped file, from which programs can allocate slices
through some OS calls (well, actually through the normal malloc/calloc
calls).

HTH!

Heiko.

Jul 18 '05 #10

Brian van den Broek

Heiko Wundram said unto the world upon 2004-09-16 12:56:

Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:
But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.

read()ing a file into memory does what it says; it reads the binary data from
the disk all at once, and allocates main memory (as needed) to fit all the
data there. Memory mapping a file (or device or whatever) means that the
virtual memory architecture is involved. What happens here:

<Much helpful detail SNIPed>

HTH!

Heiko.

Thanks a lot for the detailed account, Heiko.

Best,

Brian vdB

Jul 18 '05 #11

Roel Schroeven

Brian van den Broek wrote:

But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.
Eventually the file is read, of course (or at least parts thereof). Mmap
is a feature of the virtual memory system in modern operating systems,
so you need a basic understanding of virtual memory in order to
understand mmap. All details can be found e.g. in Modern Operating
Systems by Andrew Tanenbaum.
http://mirrors.kernel.org/LDP/LDP/tlk/tlk.html does a good job
explaining how Linux handles it,, but I'll try to explain the general
basics here in short.

With virtual memory systems, the addresses that are used by application
programs don't refer directly to memory locations. Instead the addresses
are split in two parts; the first part is a page number, the second is
the offset of the memory location in the page. The system keeps a list
of all pages. When an address is referenced, the page is looked up in
that list (Pages are blocks of memory, typically 4-8 kB). There are two
possibilities:
- The page is already in memory. In that case, the list contains the
real physical address of the page in memory. That address is combined
with the offset to form the physical address of the memory location.
- The page is not in memory. The virtual memory system loads it in
memory and stores the physical address in the list. Processing then
continues as in the other case. Note that it may be necessary to remove
another page from memory in order to load a new one; in that case, the
other page is paged to disk if it is still needed so that it can be read
again later.

This behind-the-scenes translation and paging to and from disk is what
allows modern operating systems to use much more memory than what's
physically available in the system.

mmap creates an entry in the list that says the page is not in memory,
but tells the system what file to load it from: a range of addresses is
'mapped' to the data in the file. It also returns the logical address of
the data. When an address in the range is referenced, the virtual memory
system loads the appropriate page from disk (or possibly more than one
page at the time, for efficiency reasons) to memory and stores its
(theirs) location in the list. An application program can access exactly
the same way as any other part of memory.
And, since mmap behave differently on different platforms, I'm mostly a
win32 user looking to transition to Linux.

I think Python hides much of the differences between the Windows and
Unix implentations of mmap (Windows doesn't really have mmap; instead
you use CreateFileMapping and MapViewOfFile).

--
"Codito ergo sum"
Roel Schroeven

Jul 18 '05 #12

python: ascii read

Similar topics