readlines() reading incorrect number of lines?

Wojciech Gryc

Hi,

I'm currently using Python to deal with a fairly large text file (800
MB), which I know has about 85,000 lines of text. I can confirm this
because (1) I built the file myself, and (2) running a basic Java
program to count lines yields a number in that range.

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything --
it seems the end of the loop was reached, and the code was executed
successfully.

I'm baffled and confused, and would be grateful for any advice as to
what I'm doing wrong, or why this may be happening.

Thank you,
Wojciech Gryc

Dec 20 '07 #1

Subscribe Post Reply

4058

John Machin

On Dec 21, 6:48 am, Wojciech Gryc <wojci...@gmail.comwrote:

Hi,

I'm currently using Python to deal with a fairly large text file (800
MB), which I know has about 85,000 lines of text. I can confirm this
because (1) I built the file myself, and (2) running a basic Java
program to count lines yields a number in that range.

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything --
it seems the end of the loop was reached, and the code was executed
successfully.

I'm baffled and confused, and would be grateful for any advice as to
what I'm doing wrong, or why this may be happening.

What platform, what version of python?

One possibility: you are running this on Windows and the file contains
Ctrl-Z aka chr(26) aka '\x1a'.

Dec 20 '07 #2

Wojciech Gryc

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about
\x1a -- there's a few lines that definitely have some strange
character sequences, so this would make sense... Would you happen to
know how I can actually fix this (e.g. replace the character)? Since
Python doesn't see the rest of the file, I don't even know how to get
to it to fix the problem... Due to the nature of the data I'm working
with, manual editing is also not an option.

Thanks,
Wojciech

On Dec 20, 3:30 pm, John Machin <sjmac...@lexicon.netwrote:

On Dec 21, 6:48 am, Wojciech Gryc <wojci...@gmail.comwrote:

Hi,

I'm currently using Python to deal with a fairly large text file (800
MB), which I know has about 85,000 lines of text. I can confirm this
because (1) I built the file myself, and (2) running a basic Java
program to count lines yields a number in that range.

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything --
it seems the end of the loop was reached, and the code was executed
successfully.

I'm baffled and confused, and would be grateful for any advice as to
what I'm doing wrong, or why this may be happening.

What platform, what version of python?

One possibility: you are running this on Windows and the file contains
Ctrl-Z aka chr(26) aka '\x1a'.

Dec 20 '07 #3

John Machin

On Dec 21, 7:41 am, Wojciech Gryc <wojci...@gmail.comwrote:

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about
\x1a -- there's a few lines that definitely have some strange
character sequences, so this would make sense... Would you happen to
know how I can actually fix this (e.g. replace the character)? Since
Python doesn't see the rest of the file, I don't even know how to get
to it to fix the problem... Due to the nature of the data I'm working
with, manual editing is also not an option.

Please don't top-post.

Quick hack to remove all occurrences of '\x1a' (untested):

fin = open('old_file', 'rb') # note b BINARY
fout = open('new_file', 'wb')
blksz = 1024 * 1024
while True:
blk = fin.read(blksz)
if not blk: break
fout.write(blk.replace('\x1a', ''))
fout.close()
fin.close()

You may however want to investigate the "strange character sequences"
that have somehow appeared in your file after you built it
yourself :-)

HTH,
John

Dec 20 '07 #4

Steven D'Aprano

[Fixing top-posting.]

On Thu, 20 Dec 2007 12:41:44 -0800, Wojciech Gryc wrote:

On Dec 20, 3:30 pm, John Machin <sjmac...@lexicon.netwrote:

[snip]

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything
-- it seems the end of the loop was reached, and the code was
executed successfully.

....

>One possibility: you are running this on Windows and the file contains
Ctrl-Z aka chr(26) aka '\x1a'.

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
-- there's a few lines that definitely have some strange character
sequences, so this would make sense... Would you happen to know how I
can actually fix this (e.g. replace the character)? Since Python doesn't
see the rest of the file, I don't even know how to get to it to fix the
problem... Due to the nature of the data I'm working with, manual
editing is also not an option.

Thanks,
Wojciech

Open the file in binary mode:

open(filename, 'rb')
and Windows should do no special handling of Ctrl-Z characters.

--
Steven

Dec 20 '07 #5

John Machin

On Dec 21, 8:13 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:

[Fixing top-posting.]

On Thu, 20 Dec 2007 12:41:44 -0800, Wojciech Gryc wrote:
On Dec 20, 3:30 pm, John Machin <sjmac...@lexicon.netwrote:
[snip]

However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything
-- it seems the end of the loop was reached, and the code was
executed successfully.

...

One possibility: you are running this on Windows and the file contains
Ctrl-Z aka chr(26) aka '\x1a'.

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
-- there's a few lines that definitely have some strange character
sequences, so this would make sense... Would you happen to know how I
can actually fix this (e.g. replace the character)? Since Python doesn't
see the rest of the file, I don't even know how to get to it to fix the
problem... Due to the nature of the data I'm working with, manual
editing is also not an option.

Thanks,
Wojciech

Open the file in binary mode:

open(filename, 'rb')

and Windows should do no special handling of Ctrl-Z characters.

--
Steven

I don't know whether it's a bug or a feature or just a dark corner,
but using mode='rU' does no special handling of Ctrl-Z either.

>>x = 'foo\r\n\x1abar\r\n'
f = open('udcray.txt', 'wb')
f.write(x)
f.close()
open('udcray.txt', 'r').readlines()

['foo\n']

>>open('udcray.txt', 'rU').readlines()

['foo\n', '\x1abar\n']

>>for line in open('udcray.txt', 'rU'):

.... print repr(line)
....
'foo\n'
'\x1abar\n'

>>>

Using 'rU' should make the OP's task of finding the strange character
sequences a bit easier -- he won't have to read a block at a time and
worry about the guff straddling a block boundary.

Dec 20 '07 #6

Gerry

Something I've occasionally found helpful with problem text files is
to build a histogram of character counts, something like this:
"""
chist.py
print a histogram of character frequencies in a nemed input file
"""

import sys

whitespace = ' \t\n\r\v\f'
lowercase = 'abcdefghijklmnopqrstuvwxyz'
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
letters = lowercase + uppercase
ascii_lowercase = lowercase
ascii_uppercase = uppercase
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + letters + punctuation

try:
fname = sys.argv[1]
except:
print "usage is chist yourfilename"
sys.exit()

chars = {}

f = open (fname, "rb")
lines = f.readlines()
for line in lines:
for c in line:
try:
chars[ord(c)] += 1
except:
chars[ord(c)] = 1

ords = chars.keys()
ords.sort()

for o in ords:
if chr(o) in printable:
c = chr(o)
else:
c = "UNP"

print "%5d %-5s %10d" % (o, c, chars[o])
print "_" * 50
Gerry

On Dec 20, 5:47 pm, John Machin <sjmac...@lexicon.netwrote:

On Dec 21, 8:13 am, Steven D'Aprano <st...@REMOVE-THIS-

cybersource.com.auwrote:
[Fixing top-posting.]

On Thu, 20 Dec 2007 12:41:44 -0800, Wojciech Gryc wrote:
On Dec 20, 3:30 pm, John Machin <sjmac...@lexicon.netwrote:
[snip]
However, when I use Python's various methods -- readline(),
readlines(), or xreadlines() and loop through the lines of the file,
the line program exits at 16,000 lines. No error output or anything
-- it seems the end of the loop was reached, and the code was
executed successfully.
...
>One possibility: you are running this on Windows and the file contains
>Ctrl-Z aka chr(26) aka '\x1a'.

Hi,

Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
-- there's a few lines that definitely have some strange character
sequences, so this would make sense... Would you happen to know how I
can actually fix this (e.g. replace the character)? Since Python doesn't
see the rest of the file, I don't even know how to get to it to fix the
problem... Due to the nature of the data I'm working with, manual
editing is also not an option.

Thanks,
Wojciech

Open the file in binary mode:

open(filename, 'rb')

and Windows should do no special handling of Ctrl-Z characters.

--
Steven

I don't know whether it's a bug or a feature or just a dark corner,
but using mode='rU' does no special handling of Ctrl-Z either.

>x = 'foo\r\n\x1abar\r\n'
f = open('udcray.txt', 'wb')
f.write(x)
f.close()
open('udcray.txt', 'r').readlines()

['foo\n']

>open('udcray.txt', 'rU').readlines()

['foo\n', '\x1abar\n']>>for line in open('udcray.txt', 'rU'):

... print repr(line)
...
'foo\n'
'\x1abar\n'

Using 'rU' should make the OP's task of finding the strange character
sequences a bit easier -- he won't have to read a block at a time and
worry about the guff straddling a block boundary.

Dec 21 '07 #7

Gabriel Genellina

En Fri, 21 Dec 2007 16:42:21 -0300, Gerry <ge**********@gmail.com>
escribió:

whitespace = ' \t\n\r\v\f'
lowercase = 'abcdefghijklmnopqrstuvwxyz'
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
letters = lowercase + uppercase
ascii_lowercase = lowercase
ascii_uppercase = uppercase
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + letters + punctuation

You do know that most -if not all- of those sets are available as
attributes of the string module, don't you?
You could replace all the lines above with: from string import printable,
as it's the only constant used.

--
Gabriel Genellina

Dec 27 '07 #8

by: Richard | last post by:

Hi, Can anyone tell me what the difference is between for line in file.readlines( ): and for line in file:

Python

readlines()

by: Yong Wang | last post by:

Hi, I use readlines() to read one data file. Python automatically parses the read contents into a list of lines. When I used list to print out the 1st line, it is ok. When I use the list index 2...

Python

readlines() doesn't read entire file

by: Jeremy | last post by:

I have a most aggravating problem. I don't understand what is causing readlines() not to read all the lines in the file. I have the following syntax: # some initial stuff XS =...

Python

Reading a file in OFF format

by: Alexander Schmidt | last post by:

Hi, I am not very familiar with C++ programming, so before I do a dirty hack I ask for a more elegant solution (but only the usage of STL is allowed, no special libs). So I need to read a file...

C / C++

what happens when the file begin read is too big for all lines to beread with "readlines()"

by: Ross Reyes | last post by:

HI - Sorry for maybe a too simple a question but I googled and also checked my reference O'Reilly Learning Python book and I did not find a satisfactory answer. When I use readlines, what...

Python

How can read from file.txt & where can save this file(file.txt) to start reading

by: sani8888 | last post by:

Hi everybody I am a beginner with C++ programming. And I need some help. How can I start with this program *********** The program is using a text file of information as the source of the...

C / C++

readlines with line number support?

by: Nikhil | last post by:

Hi, I am reading a file with readlines method of the filepointer object returned by the open function. Along with reading the lines, I also need to know which line number of the file is read in...

Python

pyserial: failed to readlines() after many hours running.

by: zxo102 | last post by:

Hello All, I have a system. An instrument attched to 'com1' is wireless connected to many sensors at different locations. The instrument can forward the "commands" (from pyserial's write()) to...

Python

Issue reading data lines multiple times from a file

by: rka77 | last post by:

Hi, I am trying to make a Python2.6 script on a Win32 that will read all the text files stored in a directory and print only the lines containing actual data. A sample file - Set : 1 Date:...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

readlines() reading incorrect number of lines?

Similar topics