By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,516 Members | 1,819 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,516 IT Pros & Developers. It's quick & easy.

Encoding Questions

P: n/a
1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

3. Does python 2.4 support all encodings?

By the way, I have set my default encoding in Python to utf8.

I appreciate any help.

-JF

Jul 19 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
ja***@feghhi.com wrote:
1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?
Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see the docs for details.
http://docs.python.org/lib/string-methods.html
2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?
Yes, as long as you know reliably what the encoding is for the source pages.
3. Does python 2.4 support all encodings?


I doubt it :-) but it supports many encodings. The list is at
http://docs.python.org/lib/standard-encodings.html

Kent
Jul 19 '05 #2

P: n/a

<ja***@feghhi.com> schrieb im Newsbeitrag
news:11**********************@o13g2000cwo.googlegr oups.com...
| 1. I download a page in python using urllib and now want to convert and
| keep it as utf-8? I already know the original encoding of the page.
| What calls should I make to convert the encoding of the page to utf8?
| For example, let's say the page is encoded in gb2312 (simple chinese)
| and I want to keep it in utf-8?

Something like:

utf8_s = s.decode('gb2312').encode('utf-8')

- with s being the simplified chinese string - should work.

|
| 2. Is this a good approach? Can I keep any pages in any languages in
| this way and return them when requested using utf-8 encoding?
|
| 3. Does python 2.4 support all encodings?

See http://docs.python.org/lib/standard-encodings.html for an overview.

|
| By the way, I have set my default encoding in Python to utf8.
|

Why would you want to do that?

--

Vincent Wehren

|
| I appreciate any help.
|
| -JF
|
Jul 19 '05 #3

P: n/a
Kent Johnson wrote:
Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')

You may want to supply the errors parameter to decode() or encode(); see
the docs for details.
http://docs.python.org/lib/string-methods.html


In addition, for an HTML page, you might need to update the META element
for the content-type HTTP header. For an XHTML page, you might need to
update/remove the XML declaration.

Regards,
Martin
Jul 19 '05 #4

P: n/a
thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?

-JF

Jul 19 '05 #5

P: n/a
ja***@feghhi.com wrote:
thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense?


No. If reading the strings from the database already gives an exception
(i.e. without any processing of these strings), that is a bug in the
database. It is also unlikely that this is what actually happened.

More likely, you are reading the strings from the database, and then
combining them explicitly with Unicode strings. Instead of changing
the default encoding, you should tell your database adapter to return
the strings as Unicode objects; if this is not supported, you should
convert them to Unicode objects in the process of reading them.

Regards,
Martin
Jul 19 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.