UTF-8 -> ISO8859-1 conversion problem

Cott Lang

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

Running 7.4.5, I frequently get this error, and ONLY on this particular
character despite seeing quite a bit of 8 bit. I don't really follow why
it can't be converted, it's the same character (239) in both character
sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.

Am I flubbing something up? I'm probably going to (reluctantly) convert
to UTF-8 in the database at some point, but it'd sure be nice if this
worked without that. :)

thanks!

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #1

Subscribe Post Reply

6896

J. Michael Crawford

In my experience, there are just some characters that don't want to be
converted, even if they appear to be part of the normal 8-bit character
system. We went to Unicode databases to hold our Latin1 characters because
of this. There was even a case where the client was cutting and pasting
ascii text into our database, and it just wouldn't take some of the
letters, giving the same error you reported.

I'm going to send a more detailed post on the topic, but in general,
we've found that there are four things that need to be done (four, if
you're not serving up web pages) for Latin1 characters to work on multiple
platforms.

1. Create the database in Unicode so that it will hold anything you
throw at it.

2. When importing data, set the encoding in the script that loads the
data, or if there's no script, use the "SET CLIENT_ENCODING TO (encoding)"
command. Setting the encoding in a tool like pgManager is not always
enough. Use this to be sure.

3. When retrieving data in a java application, the JVM encoding will
vary from JVM to JVM, and no attempt on our part to change the JVM encoding
or translate the encoding of the database strings has worked, either to or
from the database. We spent weeks going through every permutation
getBytes("ISO-8859-1") and related calls we could find, but to no
avail. The JVM will tell you it has a new encoding, but postgres will
return gibberish. You can translate the bytes, or get a translated string,
but it's all the same garbage. The solution: set the client encoding
manually through a jdbc prepared statement. Once you set the client
encoding properly, all seems to be fine:

String DBEncoding = "anEncoding" //use a real encoding, either returned
from the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
TO '" + DBEncoding + "'");
statement.execute();

4. If writing html for a web page, make sure the encoding of the web
page matches the encoding of the strings you're throwing at it. So if you
have a Linux JVM that has a "UTF-8" encoding, the web page will need the
html equivalent:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

---

This is likely far more information than you require, but I thought I'd
add it anyway so that the information is in the archives. It took us
months to solve our problem, even with help from the postgres community, so
I at least want the basics to be posted while I get my act together and
write something with more detail.

- Mike
At 12:12 PM 10/29/2004, Cott Lang wrote:

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

Running 7.4.5, I frequently get this error, and ONLY on this particular
character despite seeing quite a bit of 8 bit. I don't really follow why
it can't be converted, it's the same character (239) in both character
sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.

Am I flubbing something up? I'm probably going to (reluctantly) convert
to UTF-8 in the database at some point, but it'd sure be nice if this
worked without that. :)

thanks!

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #2

J. Michael Crawford

Correction: Four things that need to be done, THREE if you're not
serving up html. Sorry for the editing error.

- Mike
At 01:19 PM 10/29/2004, J. Michael Crawford wrote:

In my experience, there are just some characters that don't want to be
converted, even if they appear to be part of the normal 8-bit character
system. We went to Unicode databases to hold our Latin1 characters
because of this. There was even a case where the client was cutting and
pasting ascii text into our database, and it just wouldn't take some of
the letters, giving the same error you reported.

I'm going to send a more detailed post on the topic, but in general,
we've found that there are four things that need to be done (four, if
you're not serving up web pages) for Latin1 characters to work on
multiple platforms.

1. Create the database in Unicode so that it will hold anything you
throw at it.

2. When importing data, set the encoding in the script that loads the
data, or if there's no script, use the "SET CLIENT_ENCODING TO
(encoding)" command. Setting the encoding in a tool like pgManager is
not always enough. Use this to be sure.

3. When retrieving data in a java application, the JVM encoding will
vary from JVM to JVM, and no attempt on our part to change the JVM
encoding or translate the encoding of the database strings has worked,
either to or from the database. We spent weeks going through every
permutation getBytes("ISO-8859-1") and related calls we could find, but
to no avail. The JVM will tell you it has a new encoding, but postgres
will return gibberish. You can translate the bytes, or get a translated
string, but it's all the same garbage. The solution: set the client
encoding manually through a jdbc prepared statement. Once you set the
client encoding properly, all seems to be fine:

String DBEncoding = "anEncoding" //use a real encoding, either returned
from the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
TO '" + DBEncoding + "'");
statement.execute();

4. If writing html for a web page, make sure the encoding of the web
page matches the encoding of the strings you're throwing at it. So if
you have a Linux JVM that has a "UTF-8" encoding, the web page will need
the html equivalent:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

---

This is likely far more information than you require, but I thought I'd
add it anyway so that the information is in the archives. It took us
months to solve our problem, even with help from the postgres community,
so I at least want the basics to be posted while I get my act together
and write something with more detail.

- Mike
At 12:12 PM 10/29/2004, Cott Lang wrote:
ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

Running 7.4.5, I frequently get this error, and ONLY on this particular
character despite seeing quite a bit of 8 bit. I don't really follow why
it can't be converted, it's the same character (239) in both character
sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.

Am I flubbing something up? I'm probably going to (reluctantly) convert
to UTF-8 in the database at some point, but it'd sure be nice if this
worked without that. :)

thanks!

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #3

Ian Pilcher

Cott Lang wrote:

ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1

Running 7.4.5, I frequently get this error, and ONLY on this particular
character despite seeing quite a bit of 8 bit. I don't really follow why
it can't be converted, it's the same character (239) in both character
sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.

Am I flubbing something up? I'm probably going to (reluctantly) convert
to UTF-8 in the database at some point, but it'd sure be nice if this
worked without that. :)

Can you post a code snippet? There's really no such thing as a "UTF-8
character". Java chars and Strings are UTF-16 (or maybe UCS-2 in JVMs
prior to 1.5), not UTF-8.

Note that 0xEF should not appear by itself in a UTF-8 bytestream. The
UTF-8 representation of U+00EF is 0xC3 0xAF.

--
================================================== ======================
Clearly, there is no political benefit to expediting the admission of
legal immigrants into the United States. Nevertheless, I believe that
our elected officials have an obligation to do more than simply pander
to the thinly veiled racism of their constituents.
Ian Pilcher
================================================== ======================
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #4

Cott Lang

Thanks for the detailed reply, you've confirmed what I suspected. :)

I guess I have some work to do!
On Fri, 2004-10-29 at 10:19, J. Michael Crawford wrote:

In my experience, there are just some characters that don't want to be
converted, even if they appear to be part of the normal 8-bit character
system. We went to Unicode databases to hold our Latin1 characters because
of this. There was even a case where the client was cutting and pasting
ascii text into our database, and it just wouldn't take some of the
letters, giving the same error you reported.

I'm going to send a more detailed post on the topic, but in general,
we've found that there are four things that need to be done (four, if
you're not serving up web pages) for Latin1 characters to work on multiple
platforms.

1. Create the database in Unicode so that it will hold anything you
throw at it.

2. When importing data, set the encoding in the script that loads the
data, or if there's no script, use the "SET CLIENT_ENCODING TO (encoding)"
command. Setting the encoding in a tool like pgManager is not always
enough. Use this to be sure.

3. When retrieving data in a java application, the JVM encoding will
vary from JVM to JVM, and no attempt on our part to change the JVM encoding
or translate the encoding of the database strings has worked, either to or
from the database. We spent weeks going through every permutation
getBytes("ISO-8859-1") and related calls we could find, but to no
avail. The JVM will tell you it has a new encoding, but postgres will
return gibberish. You can translate the bytes, or get a translated string,
but it's all the same garbage. The solution: set the client encoding
manually through a jdbc prepared statement. Once you set the client
encoding properly, all seems to be fine:

String DBEncoding = "anEncoding" //use a real encoding, either returned
from the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
TO '" + DBEncoding + "'");
statement.execute();

4. If writing html for a web page, make sure the encoding of the web
page matches the encoding of the strings you're throwing at it. So if you
have a Linux JVM that has a "UTF-8" encoding, the web page will need the
html equivalent:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

---

This is likely far more information than you require, but I thought I'd
add it anyway so that the information is in the archives. It took us
months to solve our problem, even with help from the postgres community, so
I at least want the basics to be posted while I get my act together and
write something with more detail.

- Mike
At 12:12 PM 10/29/2004, Cott Lang wrote:
>ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
>
>Running 7.4.5, I frequently get this error, and ONLY on this particular
>character despite seeing quite a bit of 8 bit. I don't really follow why
>it can't be converted, it's the same character (239) in both character
>sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8.
>
>Am I flubbing something up? I'm probably going to (reluctantly) convert
>to UTF-8 in the database at some point, but it'd sure be nice if this
>worked without that. :)
>
>thanks!
>
>
>
>
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #5

by: Faheem Mitha | last post by:

Hi, I'm not sure what would be more appropriate, so I'm ccing it to both alt.comp.lang.learn.c-c++ and comp.lang.python, with followup to alt.comp.lang.learn.c-c++. While working with a...

Python

Type conversion problem .

by: ann | last post by:

Does somebody know why I get a blank string in strA? Last time post the wrong code. " Option Strict On Option Explicit On Public Class Cast Private Sub FuncA()

.NET Framework

UTF8 to Unicode conversion

by: Spamtrap | last post by:

I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy...

Perl

XML -> XSL conversion problem " for-each"

by: Spam sucks | last post by:

hello, i create a logging xml file with dom that could have an unknown count of results now it is 0 to 7 but it could be i have 14 or 50 results how can you read this out with xsl, with php you...

.NET Framework

utf8 -> ascii in c language??

by: chunhui_true | last post by:

i have a class, it can read one line(\r\n ended) from string,when i read line from utf8 string i can't get any thing! maybe i should conversion utf8 to ascii??there is any function can conversion...

C / C++

VB .NET -> C# conversion problem

by: Mika M | last post by:

Hi! I try to convert some VB.NET 2003 code into C# 2003 code. Code uses Sax.Communications COM-port control, and VB code is working fine. Here is part of VB Class code I'm trying to convert......

C# / C Sharp

Conversion Problem ... Register TagPrefix

by: Michael Tissington | last post by:

I'm trying to convert a project from VS2003 to VS2005 After conversion all of my TagPrefix are not recognized in the body. <%@ Register TagPrefix="Oaklodge" TagName="Curve"...

ASP.NET

Conversion Problem!

by: egbert.beuker | last post by:

Hi, I encountered a conversion problem in my .net web app (c#), and I hope someone can help me: I'm working on a generic way to store data in a database with a few generated classes. I want...

C# / C Sharp

MSSQL insert into data conversion problem

by: ipoxygen | last post by:

Hi, I do have 6 identical tables on six different databases (same server). I would like to merge them into one single table for reporting purposes. For the majority of the table it does work...

Microsoft SQL Server

Unicode to non unicode conversion problem

by: santhescript01 | last post by:

Unicode to non unicode conversion problem -------------------------------------------------------------------------------- Hi All, I am using C dll in macro which converts Unicode data to...

Visual Basic 4 / 5 / 6

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

UTF-8 -> ISO8859-1 conversion problem

Similar topics