473,606 Members | 2,110 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Help needed with python unicode cgi-bin script

Dear web gods:

After much, much, much struggle with unicode, many an hour reading all the
examples online, coding them, testing them, ripping them apart and putting
them back together, I am humbled. Therefore, I humble myself before you to
seek guidance on a simple python unicode cgi-bin scripting problem.

My problem is more complex than this, but how about I boil down one sticking
point for starters. I have a file with a Spanish word in it, "años", which I
wish to read with:
#!C:/Program Files/Python23/python.exe

STARTHTML= u'''Content-Type: text/html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt d">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
</head>
<body>
'''
ENDHTML = u'''
</body>
</html>
'''
print STARTHTML
print open('c:/test/spanish.txt','r ').read()
print ENDHTML
Instead of seeing "año" I see "a?o". BAD BAD BAD
Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS IS
WHAT I WANT

WHAT GIVES?

Next, I'll get into codecs and stuff, but how about starting with this?

The general question is, does anybody have a complete working example of a
cgi-bin script that does the above properly that they'd be willing to share?
I've tried various examples online but haven't been able to get any to work.
I end up seeing hex code for the non-ascii characters u'a\xf1o', and later
on 'a\xc3\xb1o', which are also BAD BAD BAD.

Thanks -- your humble supplicant.
Dec 10 '07 #1
20 2038
You probably need to set stdout mode to binary. They are not by default on
Windows.
"weheh" <we***@verizon. netwrote in message
news:DV57j.1171 0$OR.11141@trnd dc01...
Dear web gods:

After much, much, much struggle with unicode, many an hour reading all the
examples online, coding them, testing them, ripping them apart and putting
them back together, I am humbled. Therefore, I humble myself before you to
seek guidance on a simple python unicode cgi-bin scripting problem.

My problem is more complex than this, but how about I boil down one
sticking point for starters. I have a file with a Spanish word in it, "años",
which I wish to read with:
#!C:/Program Files/Python23/python.exe

STARTHTML= u'''Content-Type: text/html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt d">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
</head>
<body>
'''
ENDHTML = u'''
</body>
</html>
'''
print STARTHTML
print open('c:/test/spanish.txt','r ').read()
print ENDHTML
Instead of seeing "año" I see "a?o". BAD BAD BAD
Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS
IS WHAT I WANT

WHAT GIVES?

Next, I'll get into codecs and stuff, but how about starting with this?

The general question is, does anybody have a complete working example of a
cgi-bin script that does the above properly that they'd be willing to
share? I've tried various examples online but haven't been able to get any
to work. I end up seeing hex code for the non-ascii characters u'a\xf1o',
and later on 'a\xc3\xb1o', which are also BAD BAD BAD.

Thanks -- your humble supplicant.

Dec 10 '07 #2
My problem is more complex than this, but how about I boil down one sticking
point for starters. I have a file with a Spanish word in it, "años", which I
wish to read with:
What is the encoding of that file? Without a correct answer to that
question, you will not be able to achieve what you want.

Possible answers are "iso-8859-1", "utf-8", "windows-1252", and "cp850"
(these all support the word "años")
Instead of seeing "año" I see "a?o". BAD BAD BAD
I don't see anything here. Where do you see the question mark? Did you
perhaps run the CGI script in a web server, and pointed your web browser
to the web page, and saw the question mark in the web browser?
WHAT GIVES?
Sending "Content-type: text/html" is not enough. The web browser needs
to know what the encoding is. So you should send

Content-type: text/html; charset="your-encoding-here"

Use "extras/page information" in Firefox to find out what the web
browser thinks the encoding of the page is.

Regards,
Martin

P.S. Please, stop shouting.
Dec 10 '07 #3
Thanks for the reply, Jack. I tried setting mode to binary but it had no
affect.
"Jack" <no****@invalid .comwrote in message
news:y_******** *************** *******@comcast .com...
You probably need to set stdout mode to binary. They are not by default on
Windows.
"weheh" <we***@verizon. netwrote in message
news:DV57j.1171 0$OR.11141@trnd dc01...
>Dear web gods:

After much, much, much struggle with unicode, many an hour reading all
the examples online, coding them, testing them, ripping them apart and
putting them back together, I am humbled. Therefore, I humble myself
before you to seek guidance on a simple python unicode cgi-bin scripting
problem.

My problem is more complex than this, but how about I boil down one
sticking point for starters. I have a file with a Spanish word in it,
"años", which I wish to read with:
#!C:/Program Files/Python23/python.exe

STARTHTML= u'''Content-Type: text/html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt d">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
</head>
<body>
'''
ENDHTML = u'''
</body>
</html>
'''
print STARTHTML
print open('c:/test/spanish.txt','r ').read()
print ENDHTML
Instead of seeing "año" I see "a?o". BAD BAD BAD
Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS
IS WHAT I WANT

WHAT GIVES?

Next, I'll get into codecs and stuff, but how about starting with this?

The general question is, does anybody have a complete working example of
a cgi-bin script that does the above properly that they'd be willing to
share? I've tried various examples online but haven't been able to get
any to work. I end up seeing hex code for the non-ascii characters
u'a\xf1o', and later on 'a\xc3\xb1o', which are also BAD BAD BAD.

Thanks -- your humble supplicant.


Dec 10 '07 #4
Hi Martin, thanks for your response. My updates are interleaved with your
response below:

What is the encoding of that file? Without a correct answer to that
question, you will not be able to achieve what you want.
I don't know for sure the encoding of the file. I'm assuming it has no
intrinsic encoding since I copied the word "año" into vim and then saved it
as the example text file called, "spanish.tx t".
Possible answers are "iso-8859-1", "utf-8", "windows-1252", and "cp850"
(these all support the word "año")
>Instead of seeing "año" I see "a?o".

I don't see anything here. Where do you see the question mark? Did you
perhaps run the CGI script in a web server, and pointed your web browser
to the web page, and saw the question mark in the web browser?
The cgi-bin scripts prints to stdout, i.e. to my browser, and when I use
print I see a square box where the ñ should be. When I use print repr(...) I
see 'a\xf1o'. I never see the desired 'ñ' character.
Sending "Content-type: text/html" is not enough. The web browser needs
to know what the encoding is. So you should send

Content-type: text/html; charset="your-encoding-here"
Sorry, somehow my cut and paste job into outlook missed the exact line you
had above that specifies encoding tp be set as "utf8", but it's there in my
program. Not to worry.
Use "extras/page information" in Firefox to find out what the web
browser thinks the encoding of the page is.
Firefox says the page is UTF8.
P.S. Please, stop shouting.
OK, it's just that it hurts when I've been pulling my hair out for days on
end over a single line of code. I don't want to go bald just yet.
Dec 10 '07 #5
On Dec 11, 9:55 am, "weheh" <we...@verizon. netwrote:
Hi Martin, thanks for your response. My updates are interleaved with your
response below:
What is the encoding of that file? Without a correct answer to that
question, you will not be able to achieve what you want.

I don't know for sure the encoding of the file. I'm assuming it has no
intrinsic encoding since I copied the word "año" into vim and then savedit
as the example text file called, "spanish.tx t".
Every text file encoded, and very few of them are tagged with the name
of the encoding in any reliable fashion.
>
Possible answers are "iso-8859-1", "utf-8", "windows-1252", and "cp850"
(these all support the word "año")
Instead of seeing "año" I see "a?o".
I don't see anything here. Where do you see the question mark? Did you
perhaps run the CGI script in a web server, and pointed your web browser
to the web page, and saw the question mark in the web browser?

The cgi-bin scripts prints to stdout, i.e. to my browser, and when I use
print I see a square box where the ñ should be. When I use print repr(....) I
see 'a\xf1o'. I never see the desired 'ñ' character.

Sending "Content-type: text/html" is not enough. The web browser needs
to know what the encoding is. So you should send
Content-type: text/html; charset="your-encoding-here"

Sorry, somehow my cut and paste job into outlook missed the exact line you
had above that specifies encoding tp be set as "utf8", but it's there in my
program. Not to worry.
Use "extras/page information" in Firefox to find out what the web
browser thinks the encoding of the page is.

Firefox says the page is UTF8.
P.S. Please, stop shouting.

OK, it's just that it hurts when I've been pulling my hair out for days on
end over a single line of code. I don't want to go bald just yet.
Forget for the moment what you see in the browser. You need to find
out how your file is encoded.

Look at your file using
print repr(open('c:/test/spanish.txt','r b').read())

If you see 'a\xf1o' then use charset="window s-1252" else if you see
'a\xc3\xb1o' then use charset="utf-8" else ????

Based on your responses to Martin, it appears that your file is
actually windows-1252 but you are telling browsers that it is utf-8.

Another check: if the file is utf-8, then doing
open('c:/test/spanish.txt','r b').read().deco de('utf8')
should be OK; if it's not valid utf8, it will complain.

Yet another check: open the file with Notepad. Do File/SaveAs, and
look at the Encoding box -- ANSI or UTF-8?

HTH,
John
Dec 10 '07 #6
Just want to make sure, how exactly are you doing that?
Thanks for the reply, Jack. I tried setting mode to binary but it had no
affect.

Dec 11 '07 #7
import sys

if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode( sys.stdout.file no(), os.O_BINARY)
"Jack" <no****@invalid .comwrote in message
news:A7******** *************** *******@comcast .com...
Just want to make sure, how exactly are you doing that?
>Thanks for the reply, Jack. I tried setting mode to binary but it had no
affect.


Dec 11 '07 #8
Hi John:
Thanks for responding.
>Look at your file using
print repr(open('c:/test/spanish.txt','r b').read())
>If you see 'a\xf1o' then use charset="window s-1252"
I did this ... no change ... still see 'a\xf1o'
>else if you see 'a\xc3\xb1o' then use charset="utf-8" else ????
>Based on your responses to Martin, it appears that your file is
actually windows-1252 but you are telling browsers that it is utf-8.
>Another check: if the file is utf-8, then doing
open('c:/test/spanish.txt','r b').read().deco de('utf8')
should be OK; if it's not valid utf8, it will complain.
No. this causes decode error:

UnicodeDecodeEr ror: 'utf8' codec can't decode bytes in position 1-4: invalid
data
args = ('utf8', 'a\, 1, 5, 'invalid data')
encoding = 'utf8'
end = 5
object = 'a\xf1o'
reason = 'invalid data'
start = 1

>Yet another check: open the file with Notepad. Do File/SaveAs, and
look at the Encoding box -- ANSI or UTF-8?
Notepad says it's ANSI

Thanks. What now? Also, this is a general problem for me, whether I read
from a file or read from an html text field, or read from an html text area.
So I'm looking for a general solution. If it helps to debug by reading from
textarea or text field, let me know.
Dec 11 '07 #9
On Dec 12, 4:46 am, "weheh" <we...@verizon. netwrote:
Hi John:
Thanks for responding.
Look at your file using
print repr(open('c:/test/spanish.txt','r b').read())
If you see 'a\xf1o' then use charset="window s-1252"

I did this ... no change ... still see 'a\xf1o'
So it's not utf-8, it's windows-1252, so stop lying to browsers: like
I said, use charset="window s-1252"
>
else if you see 'a\xc3\xb1o' then use charset="utf-8" else ????
Based on your responses to Martin, it appears that your file is
actually windows-1252 but you are telling browsers that it is utf-8.
Another check: if the file is utf-8, then doing
open('c:/test/spanish.txt','r b').read().deco de('utf8')>shou ld be OK; if it's not valid utf8, it will complain.

No. this causes decode error:

UnicodeDecodeEr ror: 'utf8' codec can't decode bytes in position 1-4: invalid
data
No what? YES, the "decode error" is complaining that the data supplied
is NOT valid utf-8 data. So it's not utf-8, it's windows-1252, so stop
lying to browsers: like I said, use charset="window s-1252"
args = ('utf8', 'a\, 1, 5, 'invalid data')
encoding = 'utf8'
end = 5
object = 'a\xf1o'
reason = 'invalid data'
start = 1
Yet another check: open the file with Notepad. Do File/SaveAs, and
look at the Encoding box -- ANSI or UTF-8?

Notepad says it's ANSI
That's correct (in Microsoft jargon) -- it's NOT utf-8. It's
windows-1252, so stop lying to browsers: like I said, use
charset="window s-1252"
>
Thanks. What now?
Listen to the Bellman: "What I tell you three times is true".
Your file is encoded using windows-1252, NOT utf-8.
You need to use charset="window s-1252".

Also, this is a general problem for me, whether I read
from a file or read from an html text field, or read from an html text area.
So I'm looking for a general solution. If it helps to debug by reading from
textarea or text field, let me know.
If you are creating a file, you should know what its encoding is. As I
said earlier, *every* file is encoded -- so-called "Unicode" files on
Windows are encoded using utf16le. If you don't explicitly specify the
encoding, it will typically be the default encoding for your locale
(e.g. cp1252 in Western Europe etc).

If you are reading a file created by others and its encoding is not
known, you will have inspect the file and/or guess (using whatever
knowledge you have about the language/locale of the creator).

"whether I ... read from an html text field, or read from an html text
area": isn't that what "charset" is for?

HTH,
John
Dec 11 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4842
by: Elf M. Sternberg | last post by:
It's all Netscape's fault. RFC 2396 (URI Specifications) specifies that a space shall be encoded using %20 and the plus symbol is always safe. Netscape (and possibly even earlier browsers like Mosaic) used the plus symbol '+' as a substitute for the space in the last part of the URI, arguments to the object referenced (you know, all the stuff after the question mark in a URL). The ECMA-262 "Javascript" standard now supported by both...
9
2006
by: Brian Kelley | last post by:
I have been using gettext and various utilities to provide internationalization for a wxPython application and have not really been liking the process. Essentially it uses a macro-style notation to indicate which strings should be internationalized. Essentially, everything looks like this _("STRING") A good description of this system is located here http://wiki.wxpython.org/index.cgi/Internationalization
1
1821
by: Pekka Niiranen | last post by:
Hi there, how can I write out Python Unicode character's hexadecimal value in generic format? I need to loop thru characters in Unicode string and store each character in format \U+hhhh, where hhhh is the value of unicode character in hexadecimal? For example string:
0
1283
by: Kevin T. Ryan | last post by:
Hi All - I'm trying to develop web applications using python / Cheetah. I'm also trying to experiment with lighttpd (see www.lighttpd.net), which supports fast-cgi. So, I downloaded Robin Dunn's fcgi.py file (http://alldunn.com/python/fcgi.py), and everything is up and running nicely. Except, I'm a complete dummy - totally new to fast-cgi development. Therefore, when I run lighttpd and direct it to use fcgi as my fast-cgi app, it...
4
1571
by: Slalomsk8er | last post by:
I don't get it with the popen (popen3 or subprocess). 1. How do I establish my pipes? 2. And how do I interact with the pipes (interactive CGI-page)? Thanks, Dominik
12
2341
by: rurpy | last post by:
Is there an effcient way (more so than cgi) of using Python with Microsoft IIS? Something equivalent to Perl-ISAPI?
4
3495
by: Robin Haswell | last post by:
Okay I'm getting really frustrated with Python's Unicode handling, I'm trying everything I can think of an I can't escape Unicode(En|De)codeError no matter what I try. Could someone explain to me what I'm doing wrong here, so I can hope to throw light on the myriad of similar problems I'm having? Thanks :-) Python 2.4.1 (#2, May 6 2005, 11:22:24) on linux2 Type "help", "copyright", "credits" or "license" for more information.
0
4549
by: Kurt B. Kaiser | last post by:
Patch / Bug Summary ___________________ Patches : 349 open ( +7) / 3737 closed (+25) / 4086 total (+32) Bugs : 939 open (-12) / 6648 closed (+60) / 7587 total (+48) RFE : 249 open ( -8) / 278 closed (+12) / 527 total ( +4) New / Reopened Patches ______________________
145
4227
by: Dave Parker | last post by:
I've read that one of the design goals of Python was to create an easy- to-use English-like language. That's also one of the design goals of Flaming Thunder at http://www.flamingthunder.com/ , which has proven easy enough for even elementary school students, even though it is designed for scientists, mathematicians and engineers.
0
8428
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8078
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8299
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6753
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5962
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5456
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3919
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2442
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1548
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.