473,573 Members | 2,932 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

python: ascii read

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_a rray, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_a rray was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_a rray processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

Thanks.

Greetings,
Sebastian
Jul 18 '05 #1
11 8997
Sebastian Krause <ca*****@gmx.ne t> wrote:
Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_a rray, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_a rray was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_a rray processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)


If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.t xt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex
Jul 18 '05 #2
Sebastian Krause wrote:
Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_a rray, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_a rray was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_a rray processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)


What kind of data is it? What operations do you want to perform on the
data? What platform are you on?

Some of the scipy.io.read_a rray behavior that you see look like bugs. We
would greatly appreciate it if you were to send a complete bug report to
the scipy-dev mailing list. Thank you.

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
Jul 18 '05 #3
I did not explictly mention that the ascii file should be read in as an
array of numbers (either integer or float).
To use open() and read() is very fast, but does only read in the data as
string and it also does not work with large files.

Sebastian

Alex Martelli wrote:
Sebastian Krause <ca*****@gmx.ne t> wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read _array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_a rray was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_a rray processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.t xt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex

Jul 18 '05 #4
The input data is is large ascii file of astrophysical parameters
(integer and float) of gaydynamics calculations. They should be read in
as an array of integer and float numbers not as string (as open() and
read() does). Then the array is used to make different plots from the
data and do some (simple) operations: subtraction and divison of
columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).

Sebastian

Robert Kern wrote:
Sebastian Krause wrote:
Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read_a rray, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but
not as general as Python). The problem with scipy.io.read_a rray was,
that it is really slow, returns errors when trying to process large
files and it also changes (cuts) the files (after scipy.io.read_a rray
processed a 2GB file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly
and fast? (Maybe with another read-in routine.)

What kind of data is it? What operations do you want to perform on the
data? What platform are you on?

Some of the scipy.io.read_a rray behavior that you see look like bugs. We
would greatly appreciate it if you were to send a complete bug report to
the scipy-dev mailing list. Thank you.

Jul 18 '05 #5
Sebastian Krause <ca*****@gmx.ne t> wrote:
I did not explictly mention that the ascii file should be read in as an
array of numbers (either integer or float).
Ah, right, you didn't . So I was answering the literal question you
asked rather than the one you had in mind.
To use open() and read() is very fast, but does only read in the data as
string and it also does not work with large files.


It works just fine with files as large as you have memory for (and mmap
works for files as large as you have _spare address space_ for, if your
OS is decently good at its job). But if what you want is not the job
that .read() and mmap do, the fact that they _do_ perform that job quite
well on large files is of course of no use to you.

Back to, why scipy.io.read_a rray works so badly for you -- I don't know,
it's rather complicated code, as well as maybe old-ish (wraps files into
class instances to be able to iterate on their lines) and very general
(lots of options regarding what are separators, etc, etc). If your
needs are very specific (you know a lot about the format of those huge
files -- e.g. they're column-oriented, or only use whitespace separators
and \n line termination, or other such specifics) you might well be able
to do better -- likely even in Python, worst case in C. I assume you
need Numeric arrays, 2-d, specifically, as the result of reading your
files? Would you know in advance whether you're reading int or float
(it might be faster to have two separate functions)? Could you
pre-dimension the Numeric array and pass it in, or do you need it to
dimension itself dynamically based on file contents? The less
flexibility you need, the simpler and faster the reading can be...
Alex
Jul 18 '05 #6
Sebastian Krause wrote:
The input data is is large ascii file of astrophysical parameters
(integer and float) of gaydynamics calculations. They should be read in
as an array of integer and float numbers not as string (as open() and
read() does). Then the array is used to make different plots from the
data and do some (simple) operations: subtraction and divison of
columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).
Well, one option is to use the "lines" argument to scipy.io.read_a rray
to only read in chunks at a time. It probably won't help speed any, but
hopefully it will be correct.
Sebastian


--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
Jul 18 '05 #7
al*****@yahoo.c om (Alex Martelli) writes:
If your needs are very specific (you know a lot about the format of
those huge files -- e.g. they're column-oriented, or only use
whitespace separators and \n line termination, or other such
specifics) you might well be able to do better -- likely even in
Python, worst case in C. I assume you need Numeric arrays, 2-d,
specifically, as the result of reading your files? Would you know
in advance whether you're reading int or float (it might be faster
to have two separate functions)? Could you pre-dimension the
Numeric array and pass it in, or do you need it to dimension itself
dynamically based on file contents? The less flexibility you need,
the simpler and faster the reading can be...


The last time I wanted to be able to read large lumps of numerical
data from an ASCII file, I ended up using (f)lex, for performance
reasons. (Pure C _might_ have been faster still, of course, but it
would _quite certainly_ also have been pure C.)

This has caused minor irritation - the code has been in use through
several upgrades of Python, and it is considered polite to recompile
to match the current C API - but I'd probably do it the same way again
in the same situation.

Des
--
"[T]he structural trend in linguistics which took root with the
International Congresses of the twenties and early thirties [...] had
close and effective connections with phenomenology in its Husserlian
and Hegelian versions." -- Roman Jakobson
Jul 18 '05 #8
Alex Martelli said unto the world upon 2004-09-16 07:22:
Sebastian Krause <ca*****@gmx.ne t> wrote:

Hello,

I tried to read in some large ascii files (200MB-2GB) in Python using
scipy.io.read _array, but it did not work as I expected. The whole idea
was to find a fast Python routine to read in arbitrary ascii files, to
replace Yorick (which I use right now and which is really fast, but not
as general as Python). The problem with scipy.io.read_a rray was, that it
is really slow, returns errors when trying to process large files and it
also changes (cuts) the files (after scipy.io.read_a rray processed a 2GB
file its size was only 64MB).

Can someone give me hint how to use Python to do this job correctly and
fast? (Maybe with another read-in routine.)

If all you need is what you say -- read a huge amount of ASCII data into
memory -- it's hard to beat
data = open('thefile.t xt').read()

mmap may in fact be preferable for many uses, but it doesn't actually
read (it _maps_ the file into memory instead).
Alex


Hi all,

[neophyte question warning]

I'd not been aware of mmap until this post. Looking at the Library
Reference and my trusty copy of Python in a Nutshell, I've gotten some
idea of the differences between using mmap and the .read() method on a
file object -- such as it returns a mutable object vs an immutable
string, constraint on slice assignment that len(oldslice) must be equal
to len(newslice), etc.

But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.

And, since mmap behave differently on different platforms, I'm mostly a
win32 user looking to transition to Linux.

Best to all,

Brian vdB

Jul 18 '05 #9
Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:
But I don't really feel I've a handle on the significance of saying it
maps the file into memory versus reading the file. The naive thought is
that since the data gets into memory, the file must be read. But this
makes me sure I'm missing a distinction in the terminology. Explanations
and pointers for what to read gratefully received.


read()ing a file into memory does what it says; it reads the binary data from
the disk all at once, and allocates main memory (as needed) to fit all the
data there. Memory mapping a file (or device or whatever) means that the
virtual memory architecture is involved. What happens here:

mmapping a file creates virtual memory pages (just like virtual memory which
is put into your paging file), which are registered with the MMU of the
processor as being absent initially.

Now, when the program tries to access the memory page (pages are some fixed
short length, like 4k for most Pentium-style computers), a (page) fault is
generated by the MMU, which invokes the operating system's handler for page
faults. Now that the operating system sees that a certain page is accessed
(from the page address it can deduce the offset in the file that you're
trying to access), it loads the corresponding page from disk, and puts it
into memory at some position, and alters the pagetable entry in the LDT to be
present.

Future accesses to the page will take place immediately (without a page fault
taking place).

Changes in memory are written to disk once the page is flushed (meaning that
it gets removed from main memory because there are too few pages available of
real main memory). Now, when a page is forcefully flushed (not due to closing
the mmap), the operating system marks the pagetable entry in the LDT to be
absent again, and the next time the program tries to access this location, a
page-fault again takes place, and the OS can load the page from disk.

For speed, the operating system allows you to mmap read-only, which means that
once a page is discarded, it does not need to be written back to disk (which
of course is faster). Some MMUs (IIRC not the Pentium-class MMU) set a dirty
bit on the page-table entry once the page has been altered, this can also be
used to control whether the page needs to be written back to disk after
access.

So, basically what you get is load on demand file handling, which is similar
to what the paging file (virtual memory file) on win32 does for allround
memory. Actually, internally, the architecture to handle mmapped files and
virtual memory is the same, and you could think of the swap file as an
operating system mmapped file, from which programs can allocate slices
through some OS calls (well, actually through the normal malloc/calloc
calls).

HTH!

Heiko.
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
2413
by: Paul Prescod | last post by:
I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold any arbitrary data, even binary data such as photos and movies.They are of course also good for their traditional role of storing and manipulating...
4
4991
by: Andreas Pauley | last post by:
Hi all, I'm trying to implement a Python equivalent of a C# method that encrypts a string. My Python attempt is in the attached file, but does not return the same value as the C# method (see below). Any hints? Thanks,
6
2024
by: Rafael Almeida | last post by:
Hello, I'm studying compilers now on my university and I can't quite understand one thing about the python interpreter. Why is its input a binary file (pyc)? The LOAD_CONST opcode is 100 (dec) and STORE_FAST's is 125 (dec). The translation of the following code: foo.py: x = 10
16
2054
by: william tanksley | last post by:
I'm trying to convert the URLs contained in iTunes' XML file into a form comparable with the filenames returned by iTunes' COM interface. I'm writing a podcast sorter in Python; I'm using iTunes under Windows right now. iTunes' COM provides most of my data input and all of my mp3/aac editing capabilities; the one thing I can't access through...
13
3898
by: Liang Chen | last post by:
Hope you all had a nice weekend. I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the...
0
7781
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
8202
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
8066
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6421
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5292
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3733
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2216
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1304
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1041
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.