unicode .replace not working - why?

Kurt Peters

I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

Any help?
Kurt

#!/usr/bin/python
# -*- coding: utf-8 -+-
from pyPdf import PdfFileWriter, PdfFileReader
import unicodedata
fileencoding = "utf-16-LE" #"iso-8859-1" # "utf-8"
doc = PdfFileReader(file(r"C:\Documents and Settings\kpeters\My Documents
\SUA.pdf", "rb")

# print the title of document1.pdf
print "title = %s" % (doc.getDocumentInfo().title)
print "Subject:", doc.getDocumentInfo().subject
print "PDF Version:", doc.getDocumentInfo().producer
page4 = doc.getPage(3)
textu= page4.extractText()
#textu=textu.decode(fileencoding)
print type(textu)
#print type(textu.encode(fileencoding))
#textu=textu.encode(fileencoding) #Converts to str
fn = unichr(167)
print('The char is %s' % fn)
textu.replace(unichr(167),'\n')
#print unicodedata.bidirectional(fn) unichr(167)
for i, c in enumerate(textu):
if (i!=302):
print('# %d has char %s, ord: %d , char: %s, category %s, and
Name: %s' % (i, c, ord(c), unichr(ord(c)), unicodedata.category(c),
unicodedata.name(c)))

#if (ord(c)==167):
# print('Found it!')
#textu[i]='\n'
print('----------------------------------------------------')
print textu
print textu.encode(fileencoding)

Oct 11 '08 #1

Subscribe Post Reply

4614

John Machin

On Oct 12, 7:05*am, Kurt Peters <nospampete...@bigfoot.comwrote:

I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. *I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
* Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? *I tried a number of things so I left comments in place asa
subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John

Oct 11 '08 #2

Kurt Peters

I had done that about 21 revisions ago. Nevertheless, why would you think
that would work, when the code as shown doesn't?
kurt
"Dennis Lee Bieber" <wl*****@ix.netcom.comwrote in message
news:ms******************************@earthlink.co m...

On Sat, 11 Oct 2008 15:05:43 -0500, Kurt Peters
<no***********@bigfoot.comdeclaimed the following in comp.lang.python:

>textu.replace(unichr(167),'\n')

Might I suggest:

textu = textu.replace(fn, "\n") #you already created fn as the character
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

Oct 12 '08 #3

Peter Otten

Kurt Peters wrote:

I had done that about 21 revisions ago.

If you litter your module with code that is commented out it is hard to keep
track of what works and what doesn't.

Nevertheless, why would you think
that would work, when the code as shown doesn't?

Because he knows Python? Why don't /you/ try it before asking that question?

A good place to do "exploratory" programming is Python's interactive
interpreter. Here's a sample session:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>from pyPdf import PdfFileReader as PFR
doc = PFR(open("SUA.pdf"))
text = doc.getPage(3).extractText()
type(text)

>>text[:200]

u'2/16/08 7400.8P Table of Contents - Continued Section Page
\ xa773.49 New Hampshire (NH) 50
\xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '

>>print text[:200].replace(u"\xa7", u"\n")

2/16/08 7400.8P Table of Contents - Continued Section Page
73.49 New Hampshire (NH) 50
73.50 New Jersey (NJ) 50
73.51 New Mexico (NM) 51
73.52 New York (NY) 56
73.53 North

Peter

Oct 12 '08 #4

Kurt Peters

Thanks...

On a side note, do you really think the function call wouldn't interpret
the unichr before the function call?
Kurt
"Peter Otten" <__*******@web.dewrote in message
news:gc*************@news.t-online.com...

Kurt Peters wrote:

>I had done that about 21 revisions ago.

If you litter your module with code that is commented out it is hard to
keep
track of what works and what doesn't.

>Nevertheless, why would you think
that would work, when the code as shown doesn't?

Because he knows Python? Why don't /you/ try it before asking that
question?

A good place to do "exploratory" programming is Python's interactive
interpreter. Here's a sample session:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>>from pyPdf import PdfFileReader as PFR
doc = PFR(open("SUA.pdf"))
text = doc.getPage(3).extractText()
type(text)

<type 'unicode'>

>>>text[:200]

u'2/16/08 7400.8P Table of Contents - Continued Section
Page
\ xa773.49 New Hampshire (NH) 50
\xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '

>>>print text[:200].replace(u"\xa7", u"\n")

2/16/08 7400.8P Table of Contents - Continued Section Page
73.49 New Hampshire (NH) 50
73.50 New Jersey (NJ) 50
73.51 New Mexico (NM) 51
73.52 New York (NY) 56
73.53 North

Peter

Oct 13 '08 #5

Kurt Peters

Thanks,
clearly though, my "For loop" shows a character using ord(167), and using
print repr(textu), it shows the character \xa7 (as does Peter Oten's post).
So you can see what I see, here's the document I'm using - the Special Use
Airspace document at
http://www.faa.gov/airports_airtraff.../publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work. I can't seem to
figure out how to get it to work, despite all the great posts attempting to
shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42**********************************@u40g2000 pru.googlegroups.com...
On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:

I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

Oct 13 '08 #6

Mark Tolonen

In your original code:

textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn' replacement,
so I'll leave it out):

textu = textu.replace(unichr(167),'\n')

..replace does not modify the string in place. It returns the modified
string, so you have to reassign it.

-Mark

"Kurt Peters" <no**************@msn.comwrote in message
news:-O******************************@comcast.com...

Thanks,
clearly though, my "For loop" shows a character using ord(167), and using
print repr(textu), it shows the character \xa7 (as does Peter Oten's
post). So you can see what I see, here's the document I'm using - the
Special Use Airspace document at
http://www.faa.gov/airports_airtraff.../publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work. I can't seem
to figure out how to get it to work, despite all the great posts
attempting to shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42**********************************@u40g2000 pru.googlegroups.com...
On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:
>I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John

Oct 13 '08 #7

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

On a side note, do you really think the function call wouldn't interpret

the unichr before the function call?

Dennis' main point was not that you can reuse fn (which he suggested
just as performance improvement), but that you need to assign the result
of .replace back to textu.

Regards,
Martin

Oct 13 '08 #8

Kurt Peters

Thanks,
The "distraction" was my problem. I replaced the textu.replace as you
suggested and it works fine.
Kurt

On Sun, 12 Oct 2008 19:53:09 -0700, Mark Tolonen wrote:

In your original code:

textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn'
replacement, so I'll leave it out):

textu = textu.replace(unichr(167),'\n')

.replace does not modify the string in place. It returns the modified
string, so you have to reassign it.

-Mark

"Kurt Peters" <no**************@msn.comwrote in message
news:-O******************************@comcast.com...
>Thanks,
clearly though, my "For loop" shows a character using ord(167), and
using
print repr(textu), it shows the character \xa7 (as does Peter Oten's
post). So you can see what I see, here's the document I'm using - the
Special Use Airspace document at
http://www.faa.gov/airports_airtraff.../publications/ which
is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters. Once
again, using a "simple" replace, doesn't seem to work. I can't seem to
figure out how to get it to work, despite all the great posts
attempting to shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42f39e4c-

e4*************************@u40g2000...legrou ps.com...

>On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:
>>I'm using the code below to read a pdf document, and it has no line
feeds or carriage returns in the imported text. I'm therefore trying
to just replace the symbol that looks like it would be an end of line
(found by examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as
a subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in
range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as
expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John

Oct 18 '08 #9

Similar topics

Unicode and Zipfile problems

by: Gerson Kurz | last post by:

AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

unicode and xml/xsl

by: Matt Price | last post by:

Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a...

Python

replace text in unicode string

by: Svennglenn | last post by:

I'm having problems replacing text in a unicode string. Here's the code: # -*- coding: cp1252 -*- titel = unicode("ä", "iso-8859-1") print titel print type(titel)

Python

Sorting in Unicode not working

by: Hitesh Bagadiya | last post by:

Hi, Our database contains Hindi as well as English characters. We have specified the encoding to be unicode during initdb as well as createdb commands. Unfortunately sorting of the Hindi...

PostgreSQL Database

why isn't Unicode the default encoding?

by: John Salerno | last post by:

Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it...

Python

Display Unicode characters on Winforms

by: Bill Nguyen | last post by:

I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to...

Visual Basic .NET

Trouble fixing a broken ASCII string - "replace" mode in codec notworking.

by: John Nagle | last post by:

I'm trying to clean up a bad ASCII string, one read from a web page that is supposedly in the ASCII character set but has some characters above 127. And I get this: File...

Python

LANG, locale, unicode, setup.py and Debian packaging

by: Donn Ingle | last post by:

Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing