Html character entity conversion

pak.andrei

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

Jul 30 '06 #1

Subscribe Post Reply

4824

Claudio Grondi

pa********@gmail.com wrote:

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = 'привет
питон'
strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
strUnicode = eval("u'%s'"%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi

Jul 30 '06 #2

danielx

pa********@gmail.com wrote:

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

I'm having trouble understanding how your script works (what would a
"BeautifulSoup" function do?), but assuming your intent is to find
character reference objects in an html document, you might try using
the HTMLParser class in the HTMLParser module. This class delegates
several methods. One of them is handle_charref. It will be called with
one argument, the name of the reference, which includes only the number
part. HTMLParser is alot more powerful than that though. There may be
something more light-weight out there that will accomplish what you
want. Then again, you might be able to find a use for all that power :P.

Jul 30 '06 #3

pak.andrei

Claudio Grondi wrote:

pa********@gmail.com wrote:
Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = 'привет
питон'
strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
strUnicode = eval("u'%s'"%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi

Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = 'привет
питон'

In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')

In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)

In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError Traceback (most
recent call last)

C:\Documents and Settings\dron\<ipython console>

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
16 def encode(self,input,errors='strict'):
17
---18 return codecs.charmap_encode(input,errors,encoding_map)
19
20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>

In [23]: print strUnicode.encode("utf-8")
ÑÐ’Ð—ÑÐ’Ð˜ÑÐ’ÐÑÐ‘â”¤ÑÐ‘â•–ÑÐ’Ð* ÑÐ’Ð—ÑÐ’ÐÑÐ’Ð*ÑÐ’Ð–ÑÐ’Ð•
<-- it's not my string "Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\ xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\x e1\x82\
x85' <-- and too many chars

Jul 30 '06 #4

pak.andrei

danielx wrote:

pa********@gmail.com wrote:
Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

I'm having trouble understanding how your script works (what would a
"BeautifulSoup" function do?), but assuming your intent is to find
character reference objects in an html document, you might try using
the HTMLParser class in the HTMLParser module. This class delegates
several methods. One of them is handle_charref. It will be called with
one argument, the name of the reference, which includes only the number
part. HTMLParser is alot more powerful than that though. There may be
something more light-weight out there that will accomplish what you
want. Then again, you might be able to find a use for all that power :P.

Thank you for response.
It doesn't matter what is 'BeautifulSoup'...
General question is:

How can I convert encoded string

sEncodedHtmlText = 'привет
питон'

into human readable:

sDecodedHtmlText == 'Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½'

Jul 30 '06 #5

Marc 'BlackJack' Rintsch

In <11**********************@m73g2000cwd.googlegroups .com>,
pa********@gmail.com wrote:

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

Have you tried a more recent version of BeautifulSoup? IIRC current
versions always decode text to unicode objects before returning them.

Ciao,
Marc

Jul 30 '06 #6

Claudio Grondi

pa********@gmail.com wrote:

Claudio Grondi wrote:

>>pa********@gmail.com wrote:

>>>Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = 'привет
питон'
strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
strUnicode = eval("u'%s'"%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi

Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = 'привет
питон'

In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')

In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)

In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError Traceback (most
recent call last)

C:\Documents and Settings\dron\<ipython console>

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
16 def encode(self,input,errors='strict'):
17
---18 return codecs.charmap_encode(input,errors,encoding_map)
19
20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>

In [23]: print strUnicode.encode("utf-8")
ÑÐ’Ð—ÑÐ’Ð˜ÑÐ’ÐÑÐ‘â”¤ÑÐ‘â•–ÑÐ’Ð* ÑÐ’Ð—ÑÐ’ÐÑÐ’Ð*ÑÐ’Ð–ÑÐ’Ð•
<-- it's not my string "Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\ xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\x e1\x82\
x85' <-- and too many chars

Have you considered, that the HTML page specifies charset=windows-1251
in its
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1251"tag ?
You are apparently on Linux or so, so I can't track this problem down
having only a Windows box here, but inbetween I know that there is
another problem with it:
I have erronously assumed, that the numbers in п are hexadecimal,
but they are decimal, so it is necessary to do hex(int('1087')) on them
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

>>lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
lstIntUnicodeDecimalCode

['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
'1090', '1086', '1085', '']

>>lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
lstHexUnicode

['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
'0x442', '0x43e', '0x43d']

>>eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )

u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438 \u0442\u043e\u043d'

>>strUnicode = eval(

'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )

>>print strUnicode

Ð¿Ñ€Ð¸Ð²ÐµÑ‚Ð¿Ð¸Ñ‚Ð¾Ð½

Sorry for that mess not taking the space into consideration, but I think
you can get the idea anyway.

Claudio Grondi

Jul 30 '06 #7

John Machin

Claudio Grondi wrote:

pa********@gmail.com wrote:
Claudio Grondi wrote:

>pa********@gmail.com wrote:

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = 'привет
питон'
strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
strUnicode = eval("u'%s'"%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi

Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = 'привет
питон'

In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')

In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)

In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError Traceback (most
recent call last)

C:\Documents and Settings\dron\<ipython console>

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
16 def encode(self,input,errors='strict'):
17
---18 return codecs.charmap_encode(input,errors,encoding_map)
19
20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>

In [23]: print strUnicode.encode("utf-8")
ÑÐ’Ð—ÑÐ’Ð˜ÑÐ’ÐÑÐ‘â”¤ÑÐ‘â•–ÑÐ’Ð* ÑÐ’Ð—ÑÐ’ÐÑÐ’Ð*ÑÐ’Ð–ÑÐ’Ð•
<-- it's not my string "Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\ xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\x e1\x82\
x85' <-- and too many chars
Have you considered, that the HTML page specifies charset=windows-1251
in its
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1251"tag ?
You are apparently on Linux or so, so I can't track this problem down
having only a Windows box here, but inbetween I know that there is
another problem with it:
I have erronously assumed, that the numbers in п are hexadecimal,
but they are decimal, so it is necessary to do hex(int('1087')) on them
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

>>lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
>>lstIntUnicodeDecimalCode

['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
'1090', '1086', '1085', '']

>>lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
>>lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
>>lstHexUnicode

['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
'0x442', '0x43e', '0x43d']

>>eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )

u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438 \u0442\u043e\u043d'

>>strUnicode = eval(

'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )

>>print strUnicode

Ð¿Ñ€Ð¸Ð²ÐµÑ‚Ð¿Ð¸Ñ‚Ð¾Ð½

Sorry for that mess not taking the space into consideration, but I think
you can get the idea anyway.

I hope he *doesn't* get that "idea".

#>>strHTML =
'приветпит&#
1086;н'
#>>strUnicode = [unichr(int(x)) for x in
strHTML.replace('&#','').split(';') if
x]
#>>strUnicode
[u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
u'\u043f', u'
\u0438', u'\u0442', u'\u043e', u'\u043d']
#>>>

Jul 31 '06 #8

Claudio Grondi

John Machin wrote:

Claudio Grondi wrote:

>>pa********@gmail.com wrote:

>>>Claudio Grondi wrote:
pa********@gmail.com wrote:
>Here is my script:
>

>from mechanize import *
>from BeautifulSoup import *

>import StringIO
>b = Browser()
>f = b.open("http://www.translate.ru/text.asp?lang=ru")
>b.select_form(nr=0)
>b["source"] = "hello python"
>html = b.submit().get_data()
>soup = BeautifulSoup(html)
>print soup.find("span", id = "r_text").string
>
>OUTPUT:
>привет
>питон
>----------
>In russian it looks like:
>"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"
>
>How can I translate this using standard Python libraries??
>
>--
>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
>

Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = 'привет
питон'
strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
strUnicode = eval("u'%s'"%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi
Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = 'привет
питон'

In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')

In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)

In [22]: print strUnicode
---------------------------------------------------------------------------
exceptions.UnicodeEncodeError Traceback (most
recent call last)

C:\Documents and Settings\dron\<ipython console>

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
16 def encode(self,input,errors='strict'):
17
---18 return codecs.charmap_encode(input,errors,encoding_map)
19
20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to <undefined>

In [23]: print strUnicode.encode("utf-8")
ÑÐ’Ð—ÑÐ’Ð˜ÑÐ’ÐÑÐ‘â”¤ÑÐ‘â•–ÑÐ’Ð* ÑÐ’Ð—ÑÐ’ÐÑÐ’Ð*ÑÐ’Ð–ÑÐ’Ð•
<-- it's not my string "Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

In [24]: strUnicode.encode("utf-8")
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\x b4\xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x8 6\xe1\x82\
x85' <-- and too many chars

Have you considered, that the HTML page specifies charset=windows-1251
in its
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1251"tag ?
You are apparently on Linux or so, so I can't track this problem down
having only a Windows box here, but inbetween I know that there is
another problem with it:
I have erronously assumed, that the numbers in п are hexadecimal,
but they are decimal, so it is necessary to do hex(int('1087')) on them
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

>>lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
lstIntUnicodeDecimalCode
['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
'1090', '1086', '1085', '']
>>lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
lstHexUnicode
['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
'0x442', '0x43e', '0x43d']
>>eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )
u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u04 38\u0442\u043e\u043d'
>>strUnicode = eval(
'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0 ' ) )
>>print strUnicode
Ð¿Ñ€Ð¸Ð²ÐµÑ‚Ð¿Ð¸Ñ‚Ð¾Ð½

Sorry for that mess not taking the space into consideration, but I think
you can get the idea anyway.

I hope he *doesn't* get that "idea".

#>>strHTML =
'приветпит&#
1086;н'
#>>strUnicode = [unichr(int(x)) for x in
strHTML.replace('&#','').split(';') if
x]
#>>strUnicode
[u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
u'\u043f', u'
\u0438', u'\u0442', u'\u043e', u'\u043d']
#>>>

Knowing about the built-in function unichr() is a good thing, but ...
there are still drawbacks, because (not tested!) e.g. :
'100x hallo Python' translates to
'100x привет
Питон'
and can't be handled by improving the core idea by usage of unichr()
instead of the eval() stuff because of the wrong approach with using
..replace() and .split() which work only on the given example but not in
general case.
I am just too lazy to sit down and work on code extracting from the HTML
the &#....; sequences to convert only them letting the other content of
the string unchanged in order to arrive at a solution that works in
general case (it should be not hard and I suppose the OP has it already
:-) if he is at a Python skill level of playing around with the
mechanize module).
I am still convinced, that there must be a more elegant and direct
solution, so the subject is still fully open for improvements towards
the actual final goal.
I suppose, that one can use in addition to unichr() also unicode() as
replacement for usage of eval().

To Andrei: can you please post here what you have finally arrived at?

Claudio Grondi

Jul 31 '06 #9

Duncan Booth

pa********@gmail.com wrote:

How can I convert encoded string

sEncodedHtmlText = 'привет
питон'

into human readable:

sDecodedHtmlText == 'Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½'

How about:

>>sEncodedHtmlText = 'text:

приветпито&#108
5;'

>>def unescape(m):

return unichr(int(m.group(0)[2:-1]))

>>print re.sub('&#[0-9]+;', unescape, sEncodedHtmlText)

text: ???????????

I'm afraid my newsreader couldn't cope with either your original text or my
output, but I think this gives the string you wanted. You probably also
ought to decode sEncodedHtmlText to unicode first otherwise anything which
isn't an entity escape will be converted to unicode using the default ascii
encoding.

Aug 1 '06 #10

yichun

pa********@gmail.com wrote:

danielx wrote:
>pa********@gmail.com wrote:
>>Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open("http://www.translate.ru/text.asp?lang=ru")
b.select_form(nr=0)
b["source"] = "hello python"
html = b.submit().get_data()
soup = BeautifulSoup(html)
print soup.find("span", id = "r_text").string

OUTPUT:
привет
питон
----------
In russian it looks like:
"Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½"

How can I translate this using standard Python libraries??

--

Thank you for response.
It doesn't matter what is 'BeautifulSoup'...

However, the best solution is to ask BeautifulSoup to do that for you.
if you do

soup = BeautifulSoup(your_html_page, convertEntities="html")

you should not be worrying about the problem you had. this converts all
the html entities (the five you see as soup.entitydefs) and all the
"&#xxx;" stuff to their python unicode string.

yichun

General question is:

How can I convert encoded string

sEncodedHtmlText = 'привет
питон'

into human readable:

sDecodedHtmlText == 'Ð¿Ñ€Ð¸Ð²ÐµÑ‚ Ð¿Ð¸Ñ‚Ð¾Ð½'

Sep 10 '06 #11

Html character entity conversion

Similar topics