473,387 Members | 1,502 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

unicode .replace not working - why?

I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

Any help?
Kurt

#!/usr/bin/python
# -*- coding: utf-8 -+-
from pyPdf import PdfFileWriter, PdfFileReader
import unicodedata
fileencoding = "utf-16-LE" #"iso-8859-1" # "utf-8"
doc = PdfFileReader(file(r"C:\Documents and Settings\kpeters\My Documents
\SUA.pdf", "rb")

# print the title of document1.pdf
print "title = %s" % (doc.getDocumentInfo().title)
print "Subject:", doc.getDocumentInfo().subject
print "PDF Version:", doc.getDocumentInfo().producer
page4 = doc.getPage(3)
textu= page4.extractText()
#textu=textu.decode(fileencoding)
print type(textu)
#print type(textu.encode(fileencoding))
#textu=textu.encode(fileencoding) #Converts to str
fn = unichr(167)
print('The char is %s' % fn)
textu.replace(unichr(167),'\n')
#print unicodedata.bidirectional(fn) unichr(167)
for i, c in enumerate(textu):
if (i!=302):
print('# %d has char %s, ord: %d , char: %s, category %s, and
Name: %s' % (i, c, ord(c), unichr(ord(c)), unicodedata.category(c),
unicodedata.name(c)))

#if (ord(c)==167):
# print('Found it!')
#textu[i]='\n'
print('----------------------------------------------------')
print textu
print textu.encode(fileencoding)
Oct 11 '08 #1
8 4614
On Oct 12, 7:05*am, Kurt Peters <nospampete...@bigfoot.comwrote:
I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. *I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
* Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? *I tried a number of things so I left comments in place asa
subset of the bunch of things I tried to no avail.
This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John
Oct 11 '08 #2
I had done that about 21 revisions ago. Nevertheless, why would you think
that would work, when the code as shown doesn't?
kurt
"Dennis Lee Bieber" <wl*****@ix.netcom.comwrote in message
news:ms******************************@earthlink.co m...
On Sat, 11 Oct 2008 15:05:43 -0500, Kurt Peters
<no***********@bigfoot.comdeclaimed the following in comp.lang.python:

>textu.replace(unichr(167),'\n')

Might I suggest:

textu = textu.replace(fn, "\n") #you already created fn as the character
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

Oct 12 '08 #3
Kurt Peters wrote:
I had done that about 21 revisions ago.
If you litter your module with code that is commented out it is hard to keep
track of what works and what doesn't.
Nevertheless, why would you think
that would work, when the code as shown doesn't?
Because he knows Python? Why don't /you/ try it before asking that question?

A good place to do "exploratory" programming is Python's interactive
interpreter. Here's a sample session:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>from pyPdf import PdfFileReader as PFR
doc = PFR(open("SUA.pdf"))
text = doc.getPage(3).extractText()
type(text)
<type 'unicode'>
>>text[:200]
u'2/16/08 7400.8P Table of Contents - Continued Section Page
\ xa773.49 New Hampshire (NH) 50
\xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '
>>print text[:200].replace(u"\xa7", u"\n")
2/16/08 7400.8P Table of Contents - Continued Section Page
73.49 New Hampshire (NH) 50
73.50 New Jersey (NJ) 50
73.51 New Mexico (NM) 51
73.52 New York (NY) 56
73.53 North

Peter
Oct 12 '08 #4
Thanks...

On a side note, do you really think the function call wouldn't interpret
the unichr before the function call?
Kurt
"Peter Otten" <__*******@web.dewrote in message
news:gc*************@news.t-online.com...
Kurt Peters wrote:
>I had done that about 21 revisions ago.

If you litter your module with code that is commented out it is hard to
keep
track of what works and what doesn't.
>Nevertheless, why would you think
that would work, when the code as shown doesn't?

Because he knows Python? Why don't /you/ try it before asking that
question?

A good place to do "exploratory" programming is Python's interactive
interpreter. Here's a sample session:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>from pyPdf import PdfFileReader as PFR
doc = PFR(open("SUA.pdf"))
text = doc.getPage(3).extractText()
type(text)
<type 'unicode'>
>>>text[:200]
u'2/16/08 7400.8P Table of Contents - Continued Section
Page
\ xa773.49 New Hampshire (NH) 50
\xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '
>>>print text[:200].replace(u"\xa7", u"\n")
2/16/08 7400.8P Table of Contents - Continued Section Page
73.49 New Hampshire (NH) 50
73.50 New Jersey (NJ) 50
73.51 New Mexico (NM) 51
73.52 New York (NY) 56
73.53 North

Peter

Oct 13 '08 #5
Thanks,
clearly though, my "For loop" shows a character using ord(167), and using
print repr(textu), it shows the character \xa7 (as does Peter Oten's post).
So you can see what I see, here's the document I'm using - the Special Use
Airspace document at
http://www.faa.gov/airports_airtraff.../publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work. I can't seem to
figure out how to get it to work, despite all the great posts attempting to
shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42**********************************@u40g2000 pru.googlegroups.com...
On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:
I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.
This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John
Oct 13 '08 #6
In your original code:

textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn' replacement,
so I'll leave it out):

textu = textu.replace(unichr(167),'\n')

..replace does not modify the string in place. It returns the modified
string, so you have to reassign it.

-Mark

"Kurt Peters" <no**************@msn.comwrote in message
news:-O******************************@comcast.com...
Thanks,
clearly though, my "For loop" shows a character using ord(167), and using
print repr(textu), it shows the character \xa7 (as does Peter Oten's
post). So you can see what I see, here's the document I'm using - the
Special Use Airspace document at
http://www.faa.gov/airports_airtraff.../publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work. I can't seem
to figure out how to get it to work, despite all the great posts
attempting to shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42**********************************@u40g2000 pru.googlegroups.com...
On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:
>I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John
Oct 13 '08 #7
On a side note, do you really think the function call wouldn't interpret
the unichr before the function call?
Dennis' main point was not that you can reuse fn (which he suggested
just as performance improvement), but that you need to assign the result
of .replace back to textu.

Regards,
Martin
Oct 13 '08 #8
Thanks,
The "distraction" was my problem. I replaced the textu.replace as you
suggested and it works fine.
Kurt

On Sun, 12 Oct 2008 19:53:09 -0700, Mark Tolonen wrote:
In your original code:

textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn'
replacement, so I'll leave it out):

textu = textu.replace(unichr(167),'\n')

.replace does not modify the string in place. It returns the modified
string, so you have to reassign it.

-Mark

"Kurt Peters" <no**************@msn.comwrote in message
news:-O******************************@comcast.com...
>Thanks,
clearly though, my "For loop" shows a character using ord(167), and
using
print repr(textu), it shows the character \xa7 (as does Peter Oten's
post). So you can see what I see, here's the document I'm using - the
Special Use Airspace document at
http://www.faa.gov/airports_airtraff.../publications/ which
is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters. Once
again, using a "simple" replace, doesn't seem to work. I can't seem to
figure out how to get it to work, despite all the great posts
attempting to shed some light on the subject.

Regards,
Kurt
"John Machin" <sj******@lexicon.netwrote in message
news:42f39e4c-
e4*************************@u40g2000...legrou ps.com...
>On Oct 12, 7:05 am, Kurt Peters <nospampete...@bigfoot.comwrote:
>>I'm using the code below to read a pdf document, and it has no line
feeds or carriage returns in the imported text. I'm therefore trying
to just replace the symbol that looks like it would be an end of line
(found by examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as
a subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in
range(doc.getNumPages()):
page = doc.getPage(pageno)
textu = page.extractText()
print "pageno", pageno
print type(textu)
print repr(textu)

gives me <type 'unicode'and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>'\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as
expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John
Oct 18 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
0
by: Matt Price | last post by:
Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a...
2
by: Svennglenn | last post by:
I'm having problems replacing text in a unicode string. Here's the code: # -*- coding: cp1252 -*- titel = unicode("ä", "iso-8859-1") print titel print type(titel)
8
by: Hitesh Bagadiya | last post by:
Hi, Our database contains Hindi as well as English characters. We have specified the encoding to be unicode during initdb as well as createdb commands. Unfortunately sorting of the Hindi...
15
by: John Salerno | last post by:
Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it...
6
by: Bill Nguyen | last post by:
I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to...
2
by: John Nagle | last post by:
I'm trying to clean up a bad ASCII string, one read from a web page that is supposedly in the ASCII character set but has some characters above 127. And I get this: File...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.