Connecting Tech Pros Worldwide Forums | Help | Site Map

Charset decoding problem

Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,660
#1: Feb 2 '09
Hi,

I've got a very strange problem with UTF-8 encoded data outside ASCII range.

While on localhost all went smoothly, the same pages on the server show � (Latin-1 chars (ä, ö, ü, ß, ...)) and ? (above Latin-1 range (typographics)). Even the support does not really have a clue (that could help me).

reference: http://test.kulturbeutel-leipzig.net/main.php?f=presse

Javascript on – all works fine (data are fetched directly from a MySQL DB via AJAX)
Javascript off – (a bit more complicated) data are fetched from DB (stored there as WDDX serialized data) and deserialized into an object, which in turn is responsible for output.

maybe there's some problem with the deserialization.....

Does anyone have an idea, how I can find out the source of the problem?

thanks

PS: the DB should contain the same data, because I used a SQL dump of one to build the other.

PPS: if you need class definitions, just ask (it would be too much to list all incorporated classes at once)

local system: Darwin Melchior 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386 / PHP 5.2.8.
(= Mac OS 10.5)

public system: Linux Custom Build 64 Bit prohost.de XEON SMP x86_64 (Red Hat Enterprise Linux) / PHP 5.2.6.

Atli's Avatar
Moderator
 
Join Date: Nov 2006
Location: Iceland
Posts: 3,752
#2: Feb 3 '09

re: Charset decoding problem


Hi.

I don't really know much about WDDX, but as I understand it, it is basically XML?
I had similar problems when passing XML files around a while ago, where the server was sending stuff as Unicode, the browser was rendering using Unicode, but the output was all mangled.

Turned out all I had to do to fix this was add:
Expand|Select|Wrap|Line Numbers
  1. <?xml version="1.0" encoding="UTF-8" ?>
And everybody suddenly started understanding each other.

My mistake was to assume that the XML file would adopt the charset passed with a Content-Type header like HTML pages do.

Perhaps you left this out as well?
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,660
#3: Feb 3 '09

re: Charset decoding problem


yepp, WDDX is XML (useful if you have your configuration stored as XML)

but the XML header was there from the start.... and obviously Javascript has no problems at all with it.

sample WDDX:
Expand|Select|Wrap|Line Numbers
  1. <?xml version="1.0" encoding="UTF-8" ?>
  2. <wddxPacket version='1.0'>
  3.   <header>
  4.     <comment>Zeitungsausschnitte (Text)</comment>
  5.   </header>
  6.   <data>
  7.     <array length='4'>
  8.       <string>Helena – von Äpfeln, Göttern und anderen Helden</string>
  9.       <struct>
  10.         <var name='php_class_name'>
  11.           <string>wddx_presse</string>
  12.         </var>
  13.         <var name='name'>
  14.           <string>p</string>
  15.         </var>
  16.         <var name='content'>
  17.           <string>Auch 2004 erfreut die Schau*spiel*gruppe „Kultur*beutel“ wieder […]</string>
  18.         </var>
  19. […]
  20.       </struct>
  21.     </array>
  22.   </data>
  23. </wddxPacket>
note * = soft hyphen (escaped by bytes' editor)
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,660
#4: Feb 3 '09

re: Charset decoding problem


there seems to be something wrong with the deserializer, after some testing I can say the problems occur right after deserialization.

does anyone know, how I can determine the encoding/charset of a variable content? (that would be interesting to know)

thanks
Atli's Avatar
Moderator
 
Join Date: Nov 2006
Location: Iceland
Posts: 3,752
#5: Feb 3 '09

re: Charset decoding problem


PHP strings (until version 6) don't have any native support for Unicode, or any other charset for that matter.
A string character is essentially the same as a byte.

Try running the variable content through utf8_encode. See if that helps any.
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,660
#6: Feb 3 '09

re: Charset decoding problem


Quote:

Originally Posted by Atli View Post

Try running the variable content through utf8_encode. See if that helps any.

Though it converts the Latin-1 characters, it's no help with the characters initially showing up as '?' („ “ – ’ … and the like)
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,660
#7: Feb 13 '09

re: Charset decoding problem


finally got the problem somehow sorted by converting all non-ascii characters using unicode entities and this little function: http://de2.php.net/manual/de/functio...code.php#75941
xaxis's Avatar
Newbie
 
Join Date: Feb 2009
Location: California, United States
Posts: 15
#8: Feb 13 '09

re: Charset decoding problem


Quote:

Originally Posted by Dormilich View Post

does anyone know, how I can determine the encoding/charset of a variable content? (that would be interesting to know)

Very interesting indeed. Interesting enough that I scoured the net and I believe this resource: http://www.mozilla.org/projects/intl...Detection.html to be the most detailed and closest any person/group has yet come to solving this extremely challenging problem.
Reply