471,602 Members | 1,296 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,602 software developers and data experts.

what happens when the file begin read is too big for all lines to beread with "readlines()"

HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.
Nov 22 '05 #1
34 2627
newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.

Ross Reyes wrote:
HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.


Nov 22 '05 #2
newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.

Ross Reyes wrote:
HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.


Nov 22 '05 #3
Ross Reyes <ro*******@rcn.com> wrote:
Sorry for maybe a too simple a question but I googled and also
checked my reference O'Reilly Learning Python book and I did not
find a satisfactory answer.
The Python documentation is online, and it's good to get familiar with
it:

<URL:http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.
When I use readlines, what happens if the number of lines is huge?
I have a very big file (4GB) I want to read in, but I'm sure there
must be some limitation to readlines and I'd like to know how it is
handled by python.


The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

<URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

--
\ "If you're not part of the solution, you're part of the |
`\ precipitate." -- Steven Wright |
_o__) |
Ben Finney
Nov 22 '05 #4
Ross Reyes <ro*******@rcn.com> wrote:
Sorry for maybe a too simple a question but I googled and also
checked my reference O'Reilly Learning Python book and I did not
find a satisfactory answer.
The Python documentation is online, and it's good to get familiar with
it:

<URL:http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.
When I use readlines, what happens if the number of lines is huge?
I have a very big file (4GB) I want to read in, but I'm sure there
must be some limitation to readlines and I'd like to know how it is
handled by python.


The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

<URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

--
\ "If you're not part of the solution, you're part of the |
`\ precipitate." -- Steven Wright |
_o__) |
Ben Finney
Nov 22 '05 #5
Just try it, it is not that hard ... ;-)

/Jean Brouwers

PS) Here is what happens on Linux:

$ limit vmemory 10000
$ python
...
s = file(<bugfile>).readlines() Traceback (most recent call last):
File "<stdin>", line 1 in ?
MemoryError


Nov 22 '05 #6
Just try it, it is not that hard ... ;-)

/Jean Brouwers

PS) Here is what happens on Linux:

$ limit vmemory 10000
$ python
...
s = file(<bugfile>).readlines() Traceback (most recent call last):
File "<stdin>", line 1 in ?
MemoryError


Nov 22 '05 #7
bo****@gmail.com wrote:
newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.
I have some other questions:

when "fh" will be closed?

And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?
Ross Reyes wrote:

HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.



Nov 22 '05 #8
bo****@gmail.com wrote:
newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.
I have some other questions:

when "fh" will be closed?

And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?
Ross Reyes wrote:

HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.



Nov 22 '05 #9
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
I have some other questions:

when "fh" will be closed?
When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.
And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()
--
Steven.

Nov 22 '05 #10
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
I have some other questions:

when "fh" will be closed?
When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.
And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()
--
Steven.

Nov 22 '05 #11
Steven D'Aprano wrote:
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.
And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()

Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?
Regards,

xiaojf

Nov 22 '05 #12
Steven D'Aprano wrote:
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.
And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()

Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?
Regards,

xiaojf

Nov 22 '05 #13
Xiao Jianfeng wrote:
Steven D'Aprano wrote:

On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:

I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.

And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()


Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?

Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Nov 22 '05 #14
Xiao Jianfeng wrote:
Steven D'Aprano wrote:

On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:

I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.

And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()


Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?

Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Nov 22 '05 #15
On Sun, 20 Nov 2005 12:28:07 +0800, Xiao Jianfeng wrote:
Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
Here is one solution using a flag:

done = False
for line in file("myfile", "r"):
if done:
break
done = line == "token\n" # note the newline
# we expect Python to close the file when we exit the loop
if done:
DoSomethingWith(line) # the line *after* the one with the token
else:
print "Token not found!"
Here is another solution, without using a flag:

def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result
Here is a third solution that raises an exception instead of printing an
error message:

def get_line(filename, token):
for line in file(filename, "r"):
if line.strip() == token:
break
else:
raise ValueError("Token not found")
return fp.readline()
# we rely on Python to close the file when we are done
And I think reading one line each time is less efficient, am I right?


Less efficient than what? Spending hours or days writing more complex code
that only saves you a few seconds, or even runs slower?

I believe Python will take advantage of your file system's buffering
capabilities. Try it and see, you'll be surprised how fast it runs. If you
try it and it is too slow, then come back and we'll see what can be done
to speed it up. But don't try to speed it up before you know if it is fast
enough.
--
Steven.

Nov 22 '05 #16
On Sun, 20 Nov 2005 12:28:07 +0800, Xiao Jianfeng wrote:
Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
Here is one solution using a flag:

done = False
for line in file("myfile", "r"):
if done:
break
done = line == "token\n" # note the newline
# we expect Python to close the file when we exit the loop
if done:
DoSomethingWith(line) # the line *after* the one with the token
else:
print "Token not found!"
Here is another solution, without using a flag:

def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result
Here is a third solution that raises an exception instead of printing an
error message:

def get_line(filename, token):
for line in file(filename, "r"):
if line.strip() == token:
break
else:
raise ValueError("Token not found")
return fp.readline()
# we rely on Python to close the file when we are done
And I think reading one line each time is less efficient, am I right?


Less efficient than what? Spending hours or days writing more complex code
that only saves you a few seconds, or even runs slower?

I believe Python will take advantage of your file system's buffering
capabilities. Try it and see, you'll be surprised how fast it runs. If you
try it and it is too slow, then come back and we'll see what can be done
to speed it up. But don't try to speed it up before you know if it is fast
enough.
--
Steven.

Nov 22 '05 #17
On Sun, 20 Nov 2005 16:10:58 +1100, Steven D'Aprano wrote:
def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result


Correction: checking the Library Reference, I find that this is
wrong. The reason is that file objects implement their own read-ahead
buffer, and mixing calls to next() and readline() may not work right.

See http://docs.python.org/lib/bltin-file-objects.html

Replace the fp.readline() with fp.next() and all should be good.
--
Steven.

Nov 22 '05 #18
On Sun, 20 Nov 2005 16:10:58 +1100, Steven D'Aprano wrote:
def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result


Correction: checking the Library Reference, I find that this is
wrong. The reason is that file objects implement their own read-ahead
buffer, and mixing calls to next() and readline() may not work right.

See http://docs.python.org/lib/bltin-file-objects.html

Replace the fp.readline() with fp.next() and all should be good.
--
Steven.

Nov 22 '05 #19
Steve Holden wrote:
Xiao Jianfeng wrote:

Steven D'Aprano wrote:

On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:


I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.


And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()

Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?

Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!

Oh yes, thanks.
regards
Steve

First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

Regrads,

xiaojf

Nov 22 '05 #20
Steve Holden wrote:
Xiao Jianfeng wrote:

Steven D'Aprano wrote:

On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:


I have some other questions:

when "fh" will be closed?


When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp
f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.


And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?


That is the best practice.

f.close()

Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?

Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!

Oh yes, thanks.
regards
Steve

First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

Regrads,

xiaojf

Nov 22 '05 #21

Xiao Jianfeng wrote:
First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

Nov 22 '05 #22

Xiao Jianfeng wrote:
First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

Nov 22 '05 #23
bo****@gmail.com wrote:
Xiao Jianfeng wrote:

First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

Thanks all of you!

I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.

Regards,

xiaojf
Nov 22 '05 #24
bo****@gmail.com wrote:
Xiao Jianfeng wrote:

First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data[i]
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

Thanks all of you!

I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.

Regards,

xiaojf
Nov 22 '05 #25

Xiao Jianfeng wrote:
I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.


So is the problem solved ?

Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.

Nov 22 '05 #26

Xiao Jianfeng wrote:
I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.


So is the problem solved ?

Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.

Nov 22 '05 #27
bo****@gmail.com wrote:
Xiao Jianfeng wrote:

I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.
So is the problem solved ?

Yes, thank you.
Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.

I think so :-)

Regards,

xiaojf

Nov 22 '05 #28
bo****@gmail.com wrote:
Xiao Jianfeng wrote:

I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.
So is the problem solved ?

Yes, thank you.
Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.

I think so :-)

Regards,

xiaojf

Nov 22 '05 #29
Yes, I have read this part....

readlines( [sizehint])

Read until EOF using readline() and return a list containing the lines thus
read. If the optional sizehint argument is present, instead of reading up to
EOF, whole lines totalling approximately sizehint bytes (possibly after
rounding up to an internal buffer size) are read. Objects implementing a
file-like interface may choose to ignore sizehint if it cannot be
implemented, or cannot be implemented efficiently.

Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?

How does one tell exactly what the limitation is to the size of the
returned list of strings?

----- Original Message -----
From: "Ben Finney" <bi****************@benfinney.id.au>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Saturday, November 19, 2005 6:48 AM
Subject: Re: what happens when the file begin read is too big for all lines
tobe?read with "readlines()"

Ross Reyes <ro*******@rcn.com> wrote:
Sorry for maybe a too simple a question but I googled and also
checked my reference O'Reilly Learning Python book and I did not
find a satisfactory answer.


The Python documentation is online, and it's good to get familiar with
it:

<URL:http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.
When I use readlines, what happens if the number of lines is huge?
I have a very big file (4GB) I want to read in, but I'm sure there
must be some limitation to readlines and I'd like to know how it is
handled by python.


The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

<URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

--
\ "If you're not part of the solution, you're part of the |
`\ precipitate." -- Steven Wright |
_o__) |
Ben Finney
--
http://mail.python.org/mailman/listinfo/python-list

Nov 22 '05 #30
Yes, I have read this part....

readlines( [sizehint])

Read until EOF using readline() and return a list containing the lines thus
read. If the optional sizehint argument is present, instead of reading up to
EOF, whole lines totalling approximately sizehint bytes (possibly after
rounding up to an internal buffer size) are read. Objects implementing a
file-like interface may choose to ignore sizehint if it cannot be
implemented, or cannot be implemented efficiently.

Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?

How does one tell exactly what the limitation is to the size of the
returned list of strings?

----- Original Message -----
From: "Ben Finney" <bi****************@benfinney.id.au>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Saturday, November 19, 2005 6:48 AM
Subject: Re: what happens when the file begin read is too big for all lines
tobe?read with "readlines()"

Ross Reyes <ro*******@rcn.com> wrote:
Sorry for maybe a too simple a question but I googled and also
checked my reference O'Reilly Learning Python book and I did not
find a satisfactory answer.


The Python documentation is online, and it's good to get familiar with
it:

<URL:http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.
When I use readlines, what happens if the number of lines is huge?
I have a very big file (4GB) I want to read in, but I'm sure there
must be some limitation to readlines and I'd like to know how it is
handled by python.


The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

<URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

--
\ "If you're not part of the solution, you're part of the |
`\ precipitate." -- Steven Wright |
_o__) |
Ben Finney
--
http://mail.python.org/mailman/listinfo/python-list

Nov 22 '05 #31
Ross Reyes wrote:
Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?
readlines handles memory overflow in exactly the same way as any
other operation: by raising a MemoryError exception:

http://www.python.org/doc/current/li...s.html#l2h-296
How does one tell exactly what the limitation is to the size of the
returned list of strings?


you can't. it depends on how much memory you have, what your
files look like (shorter lines means more string objects means more
overhead), and how your operating system handles large processes.
as soon as the operating system says that it cannot allocate more
memory to the Python process, Python will abort the operation and
raise an exception. if the operating system doesn't complain, neither
will Python.

</F>

Nov 22 '05 #32
Ross Reyes wrote:
Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?
readlines handles memory overflow in exactly the same way as any
other operation: by raising a MemoryError exception:

http://www.python.org/doc/current/li...s.html#l2h-296
How does one tell exactly what the limitation is to the size of the
returned list of strings?


you can't. it depends on how much memory you have, what your
files look like (shorter lines means more string objects means more
overhead), and how your operating system handles large processes.
as soon as the operating system says that it cannot allocate more
memory to the Python process, Python will abort the operation and
raise an exception. if the operating system doesn't complain, neither
will Python.

</F>

Nov 22 '05 #33
"Ross Reyes" <ro*******@rcn.com> writes:
Yes, I have read this part....
How does one tell exactly what the limitation is to the size of the
returned list of strings?


There's not really a good platform-indendent way to do that, because
you'll get memory until the OS won't give you any more.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Nov 22 '05 #34
"Ross Reyes" <ro*******@rcn.com> writes:
Yes, I have read this part....
How does one tell exactly what the limitation is to the size of the
returned list of strings?


There's not really a good platform-indendent way to do that, because
you'll get memory until the OS won't give you any more.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Nov 22 '05 #35

This discussion thread is closed

Replies have been disabled for this discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.