473,320 Members | 1,838 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Unicode from Web to MySQL


I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = ???
....
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
2) Convert to unicode before UTF-8
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')

etc., etc., but after numerous tries in the end I still keep getting
either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

Can someone give me any explanation of how to do this easily?

Thanks,
Bill

Jul 18 '05 #1
8 5231
Bill Eldridge wrote in message ...
etc., etc., but after numerous tries in the end I still keep getting
either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.
Here's your clue: your string contains a byte which is not representable by
ascii.
Can someone give me any explanation of how to do this easily?


Quickstart guide:
You first need to decode your string to unicode. You do this by
'stringfromweb'.decode('encoding-of-the-string'). So if you grab a web page
that's in latin-1, you do 'stringfromweb'.decode('latin-1') and get unicode.
If you later want utf-8 (to plunk into SQL), take that unicode and
..encode('utf8').

Path-to-understanding:
You need to understand how unicode plays in to this first.

Unicode is not an encoding. Unicode is an abstract mapping of numbers
(called code points) to letters. Pure, undistilled "Unicode" is what you
see in those huge charts which show a number on the left and a long
uppercase letter/symbol description on the right. Unicode itself has nothing
to do with bytes, or even with computers.

A Python Unicode object is just that: an ordered sequence of unicode code
points. It has no natural byte representation. If you want that, you need
to encode it.

Note that unicode objects have no "decode" method. This is because unicode
is a LACK of encoding! Encoding maps symbols to byte representations, and a
unicode object is the explicit lack of a byte representation. So there are
no bytes to decode from. (Now of course the computer needs *some*
representation, becuase all it knows is bytes, but that could be anything,
and is entirely an implementation detail that you don't need to know about.
But you can see it with the 'unicode-internal' codec.)

A Python str object is an ordered sequence of 8-bit bytes. It is not really
a string--that's a holdover from the bygone days of pre-unicode Python.
When you encode a unicode object, you get raw bytes in some representation
of unicode characters, which are held by a str. When you want a unicode
object, you give it a str and a *known encoding*.

Now, what is the encoding of a str? You see this is like a strange Koan,
because bytes is bytes. Bytes have no intrinsic meaning until we give them
some. So whenever you decode a string to get unicode, you MUST supply the
encoding of the string!

There are ways to specify a default encoding for strings in Python (see your
site.py and sys.get/setdefaultencoding), but the default default is ascii.
Hence if byte '\xef' is found in a str, any attempt to encode it will choke,
because that byte is not in the 'ascii' encoding and thus the claim that
this str is encoded in ascii is false. (str.encode(codec) is really
shorthand for str.decode(default-encoding) -> unicode.encode(codec) )

Now, lets examine:
'abc'.decode('utf8') u'abc'

"Take three bytes 'abc', and decode it as if it were a unicode string
encoded as utf8."

'\xef'.encode('utf8') Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What you really need to do, then, is:
PureUnicodeUnsulliedByBits = stringfromtheBADBADweb.decode('latin-1')


Or:

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = data.decode(<the-encoding-of-this-string>).encode('utf8')
....
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Finding the encoding of that string from the web is where the tears come in.
If you're lucky, urllib.urlopen('http://....').info().getencoding() will
give it to you. However, this gets its info from the http headers, and if
they don't specify encoding, it defaults to '7bit'. But the html page
itself *might* have a different idea about its own encoding in the <meta>
element, 'content' attribute, which may be of the form "text/html;
charset=ISO-8859-1". Or it might not, who knows?

In other words, there is no standard, 100% reliable method of getting the
encoding of a web page. In an ideal world, the http header would have it,
and that's that. In the real world, you have to juggle various combinations
of information, missing information, and disinformation from the http
protocol header's info, the html file's meta info, and charset guessing
algorithms (look for Enca).

There might be a way to get urllib to request an encoding (as browsers do),
so that the http header will at least give some slightly more useful
information back, but I don't know how. As it is, it will almost always not
specify the charset if urllib is used, forcing you to look in the html file
itself.

But once you get the encoding, everything is fine....
--
Francis Avila

Jul 18 '05 #2

"Bill Eldridge" <bi**@rfa.org> wrote in message news:ma*************************************@pytho n.org...

I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()
You've got 8-bit data here. urllib doesn't currently handle encoding issues,
maybe submitting sf feature request will change that. But right now you have
to do it yourself. Scan (or parse) the html header for encoding, if it's absent
grad the encoding from the http header. If it's absent too, then the encoding
ASAIR is latin1. So you code should look like:
connection = urllib.urlopen('http://localhost/test.html')
encoding = 'latin-1'
header_encoding = get_http_header_encoding(connection)
data = connection.read()
content_encoding = get_http_content_encoding(data)
if header_encoding:
encoding = header_encoding
if content_encoding:
encoding = content_encoding

data2 = ???
data2 = data.decode(encoding,'replace')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?

where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)
Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.

As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.


and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
You don't need re.UNICODE if you don't use \w, \W, \b, or \B
2) Convert to unicode before UTF-8
Not sure what that means.
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')


It's not a unicode problem, is it?

-- Serge.
Jul 18 '05 #3

"Francis Avila" <fr***********@yahoo.com> wrote in message news:vu************@corp.supernews.com...
In other words, there is no standard, 100% reliable method of getting the
encoding of a web page.
There is a standard way. But you're right, it's not 100% reliable.
In an ideal world, the http header would have it, and that's that. That's actually is not a good idea, because it will force the http server
writers to parse html header for the encoding. They will get away
with configuration parameter forcing all server files to be in one
encoding. But one day somebody will store a file in the wrong
encoding *for sure*. http header encoding is a bad idea.
In the real world, you have to juggle various combinations
of information, missing information, and disinformation from the http
protocol header's info, the html file's meta info, and charset guessing
algorithms (look for Enca).


It's not so bad. Web server and content editor writers are slowing
getting a clue. It used to be very bad, I remember it. I think the peak of
problems was in 98-99 years. But nowadays more than 99% of
web documents get encoding right. So having a simple read_unicode()
method of urlopener class would be very useful.
Jul 18 '05 #4
Serge Orlov wrote:
"Bill Eldridge" <bi**@rfa.org> wrote in message news:ma*************************************@pytho n.org...

I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()
You've got 8-bit data here. urllib doesn't currently handle encoding issues,
maybe submitting sf feature request will change that. But right now you have
to do it yourself. Scan (or parse) the html header for encoding, if it's absent
grad the encoding from the http header.

This part is fairly known - I'm setting up feeds that I'll actually look
at to scrape content, so identifying the encoding will be part of that.
.. What's driving me crazy is knowing the encoding
but still not getting the data all the way through the chain to MySQL.
If it's absent too, then the encoding
ASAIR is latin1. So you code should look like:
connection = urllib.urlopen('http://localhost/test.html')
encoding = 'latin-1'
header_encoding = get_http_header_encoding(connection)
data = connection.read()
content_encoding = get_http_content_encoding(data)
if header_encoding:
encoding = header_encoding
if content_encoding:
encoding = content_encoding
The latin-1 stuff isn't giving me problems, it's the Asian languages,
but I'll look at the connection end.
data2 = ???


data2 = data.decode(encoding,'replace')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )


Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?

No, I haven't, I'll try it.
where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)


Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.

Alright, I've tried it both ways, but this makes it clearer why.
As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)


You don't need re.UNICODE if you don't use \w, \W, \b, or \B

Thanks, I don't.
2) Convert to unicode before UTF-8


Not sure what that means.

data.decode(None,'strict')
or
unicode(data,'unicode','strict')
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')


It's not a unicode problem, is it?


Occasionally instead of getting the encoding error I get a SQL syntax error,
and figured somewhere it was misinterpreting something like the end
delimiter.
No proof though, just a guess, so I tried the replaces.

Thanks much,
Bill

Jul 18 '05 #5
What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()


data2 = data.decode(encoding,'replace')
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?

I added that now, but it doesn't seem to make much difference
(I think it's more for returning data from MySQL, not storing it,
but that will still be useful)

I did a test where I grabbed the URL using the same routines and
dumped the thing to a file, and then edited out all the English and
various HTML, and the SQL insert works at that point.

It seems the mixed language is throwing stuff off, which wouldn't
bother me if my re.search for only the Vietnamese text were working
properly, but it isn't.
where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)


Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.

As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)


You don't need re.UNICODE if you don't use \w, \W, \b, or \B
2) Convert to unicode before UTF-8


Not sure what that means.
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')


It's not a unicode problem, is it?

-- Serge.

Jul 18 '05 #6
Serge Orlov wrote:
. What's driving me crazy is knowing the encoding
but still not getting the data all the way through the chain to MySQL.
That's because you're trying to create SQL statements manually.


As opposed to what?

2) Convert to unicode before UTF-8
Not sure what that means.

data.decode(None,'strict')
or
unicode(data,'unicode','strict')


See Francis' excellent post why both of these calls do not make sence at all.

Understand, I tried doing absolutely nothing
for the supposedly Unicode strings, just
taking them as-is and putting them into the
database, and I get these ASCII decoding errors.
I try decoding them as UTF-8, which supposedly
they are, and I get these errors. So then I
try stupid stuff just to see if something stupid
works.
3) replace quotation marks within the SQL statement:
data2.replace(u'"',u'\\"')
It's not a unicode problem, is it?

Occasionally instead of getting the encoding error I get a SQL syntax error,
and figured somewhere it was misinterpreting something like the end
delimiter.
No proof though, just a guess, so I tried the replaces.


Uh, I see. If you're creating SQL statement yourself, you have to take care
of escaping. I guess after data2.encode('utf-8') you've got illegal (in SQL)
characters and you have to escape them. Another quick scan of mysql-python
docs reveals that .execute method can be called as
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''', args=(data2,) )
where data2 is a unicode string. Then the python wrapper will take care of
unicode convertions and escaping. Don't forget that you need to .connect with
unicode='utf-8' parameter.

Again, what do you mean "creating SQL statement yourself"? and what is the
alternative?

I'll try out this version of the execute statement.


Jul 18 '05 #7
Serge Orlov wrote:
It seems the mixed language is throwing stuff off, which wouldn't
bother me if my re.search for only the Vietnamese text were working
properly, but it isn't.


Why? re.search works fine for my Russian unicode characters.

Yes, it seems and should be very easy and straight-forward,
do not know why, will have to spend more time on it.


Jul 18 '05 #8
>> That's because you're trying to create SQL statements manually.

Bill> As opposed to what?

Let MySQLdb do the data escaping for you:

conn = MySQLdb.Connection(host=..., etc)
curs = conn.cursor()
curs.execte("insert into sometable"
" (field1, field2)"
" values"
" (%s, %s)",
(val1, val2))

Val1 and val2 can be utf-8-encoded strings. MySQLdb will do the right thing
for you...

Skip

Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Vincent Bouret | last post by:
Hi, I have read that Unicode (UTF-8) was not enabled by default in the mysql-4.1.0-alpha binaries. How can I enable it? What should I change in the code to do so? Thanks Vincent
0
by: Stormblade | last post by:
Hey all, I have an existing JSP web application which retrieved data from a SQLServer database and displayed it. The data contained Unicode chars. This worked fine. I changed databases/JDBC...
0
by: Andy Fish | last post by:
Hi, I have ported a .NET application which supports unicode from MS SQL Server to mysql. The .net framework itself is unicode throughout and I am using utf-8 innodb tables in mysql. When...
4
by: JJ | last post by:
Hi, usually, I'm not using MS servers, but I have a big problem with a Access table. I should create a web application for a Historical Dipartment. They have created a populated a Access...
1
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was...
6
by: Bill Nguyen | last post by:
I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to...
3
by: roland.saad | last post by:
Hi Everyone, I have been trying to build a website that has multilingual support using the LAMP setup. I have created tables that store language information and correlate different strings ids...
1
by: erikcw | last post by:
Hi, I'm trying to insert some data from an XML file into MySQL. However, while importing one of the files, I got this error: Traceback (most recent call last): File "wa.py", line 304, in ?...
6
by: Bill Nguyen | last post by:
Below are sometext I extracted from a mySQL database. How can I decode them so that I can read them in Unicode? Thanks Bill ------------ Virginia Hamilton Adair / Lâm Thị Mỹ...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.