473,950 Members | 2,213 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fast File Input

Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

If I were working in C, I'd consider using a lower level I/O library,
minimizing text processing, and reducing memory redundancy. However, I have
no idea at all what to do to optimize this process in Python.

Can anyone offer some suggestions?

Thanks,
Scott

--
Remove ".nospam" from the user ID in my e-mail to reply via e-mail.
Jul 18 '05 #1
5 2338
P
Scott Brady Drummonds wrote:
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

If I were working in C, I'd consider using a lower level I/O library,
minimizing text processing, and reducing memory redundancy. However, I have
no idea at all what to do to optimize this process in Python.

Can anyone offer some suggestions?


This actually improved a lot with python version 2
but is still quite slow as you can see here:
http://www.pixelbeat.org/readline/
There are a few notes within the python script there.

Pádraig.

Jul 18 '05 #2

"Scott Brady Drummonds" <sc************ **********@inte l.com> wrote in
message news:c1******** **@news01.intel .com...
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data (split()). For my large input files, this text processing is taking many
hours.


for line in file('somefile. txt'): ...
will be faster because the file iterator reads a much larger block with
each disk access.

Do you really need strip()? Clipping \n off the last item after split()
*might* be faster.

Terry J. Reedy


Jul 18 '05 #3
Scott Brady Drummonds wrote on Wed, 25 Feb 2004 08:35:43 -0800:
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.


An easy improvement is using "for line in sometextfile:" instead of
repetitive readline(). Not sure how much time this will save you (depends
on what you're doing after reading), but it can make a difference at
virtually no cost. You might also want to try rstrip() instead of strip()
(not sure if it's faster, but perhaps it is).

--
Yours,

Andrei

=====
Real contact info (decode with rot13):
ce******@jnanqb b.ay. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
gur yvfg, fb gurer'f ab arrq gb PP.
Jul 18 '05 #4
>"Scott Brady Drummonds" <sc************ **********@inte l.com> wrote in
message news:c1******** **@news01.intel .com...
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing

time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited

data
(split()). For my large input files, this text processing is taking many
hours.


If you mean delimited in the CSV sense then I believe that the CSV modules are
optimised for this. Included in 2.3 IIRC.

Eddie

Jul 18 '05 #5

Pádraig> This actually improved a lot with python version 2
Pádraig> but is still quite slow as you can see here:
Pádraig> http://www.pixelbeat.org/readline/
Pádraig> There are a few notes within the python script there.

Your page doesn't mention precisely which version of Python 2 you used.I
suspect a rather old one (2.0? 2.1?) because of the style of loop you used
to read from sys.stdin. Eliminating comments, your python2 script was:

import sys

while 1:
line = sys.stdin.readl ine()
if line == '':
break
try:
print line,
except:
pass

Running that using the CVS version of Python feeding it my machine's
dictionary as input I got this time(1) output (fastest real time of four runs):

% time python readltst.py < /usr/share/dict/words > /dev/null

real 0m1.384s
user 0m1.290s
sys 0m0.060s

Rewriting it to eliminate the try/except statement (why did you have that
there?) got it to:

% time python readltst.py < /usr/share/dict/words > /dev/null

real 0m1.373s
user 0m1.270s
sys 0m0.040s

Further rewriting it as the more modern:

import sys

for line in sys.stdin:
print line,

yielded:

% time python readltst2.py < /usr/share/dict/words > /dev/null

real 0m0.660s
user 0m0.600s
sys 0m0.060s

My guess is that your python2 times are probably at least a factor of 2too
large if you accept that people will use a recent version of Python in which
file objects are iterators.

Skip
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1581
by: Andre | last post by:
Hi, I need to fetch a value in an Access db, so i use a form to pass the parameter for the SQL statement fromVB client to ASP. After the submit line in VBscript, i expect the value in order to test it and go further in VBscript. But my problem is when clicking on the button, that the VB code doesn't wait to get that value and continues directly, so the a=<%=totd%> line give 0 in stead of the real value. If i click a second time, i get...
2
1786
by: DraguVaso | last post by:
Hi, I need a FAST way to put the content of a file in a datatable (one record for each line in the file). I have a routine for it, but it takes me too much time (2-5 seconds for each file) to put the lines in the datatable and do some actions. Does anybody knows a faster method? The fact is: I actually only need the lines starting with "0*" and "1*".
4
2075
by: DraguVaso | last post by:
Hi, I have files I need to read, which contains records with a variable lenght. What I need to do is Copy a Part of such a File to a new File, based on the a Begin- and End-record. I used this functions: Dim intMyFile As Integer = FreeFile() FileOpen(intMyFile, MakePathFile(strDirS, strFileS), OpenMode.Input, OpenAccess.Read, OpenShare.Shared, -1)
6
1761
by: DraguVaso | last post by:
Hi, I have files I need to read, which contains records with a variable lenght. What I need to do is Copy a Part of such a File to a new File, based on the a Begin- and End-record. I used this functions: Dim intMyFile As Integer = FreeFile() FileOpen(intMyFile, MakePathFile(strDirS, strFileS), OpenMode.Input, OpenAccess.Read, OpenShare.Shared, -1)
4
3635
by: Alexis Gallagher | last post by:
(I tried to post this yesterday but I think my ISP ate it. Apologies if this is a double-post.) Is it possible to do very fast string processing in python? My bioinformatics application needs to scan very large ASCII files (80GB+), compare adjacent lines, and conditionally do some further processing. I believe the disk i/o is the main bottleneck so for now that's what I'm optimizing. What I have now is roughly as follows (on python...
6
6638
by: Yi Xing | last post by:
Hi, I need to read specific lines of huge text files. Each time, I know exactly which line(s) I want to read. readlines() or readline() in a loop is just too slow. Since different lines have different size, I cannot use seek(). So I am thinking of building an index for the file for fast access. Can anybody give me some tips on how to do this in Python? Thanks. Yi
3
6236
by: satishknight | last post by:
Hi, Can some one tell me how to change the validation sequence for the code pasted below, actually what I want it when any one enters the wrong login information (already registered users) then it has to tell then them its wrong information but currently it takes then to a next page and then tells them its incorrect information. This is tedious as every time they enter wrong they will be redirected to a different page and then they have to...
0
10171
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9991
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11595
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
11191
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
11366
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10703
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7443
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
6352
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4549
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.