469,889 Members | 1,107 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,889 developers. It's quick & easy.

Screenscraping UTF-8 characters problem

Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!
Some relevant parts from the PHP5:
/******************/

header ('Content-type: text/html; charset=utf-8');
....
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
....
function getTranslation($q, $lang)
{
$out = '';
// the Google page is supposed to be UTF-8 too:
$in = getFileText( "http://google.com/translate_t?langpair=en|" .
urlencode($lang) . "&text=".urlencode($q) );
preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
$out);

$translation = $out[1]; // garbled!
$translation = trim($translation);
$translation = utf8_encode($translation); // garbled with or
without this line...
return $translation;
}

/******************/

Feb 24 '07 #1
1 1821
Philipp Lenssen kirjoitti:
Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!
Some relevant parts from the PHP5:
/******************/

header ('Content-type: text/html; charset=utf-8');
...
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
...
function getTranslation($q, $lang)
{
$out = '';
// the Google page is supposed to be UTF-8 too:
$in = getFileText( "http://google.com/translate_t?langpair=en|" .
urlencode($lang) . "&text=".urlencode($q) );
preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
$out);

$translation = $out[1]; // garbled!
$translation = trim($translation);
$translation = utf8_encode($translation); // garbled with or
without this line...
return $translation;
}

/******************/
Seems to me what you need are the multibyte functions. You should
replace the preg_match with the multibyte compatible mb_ereg_match:

http://fi2.php.net/manual/en/function.mb-ereg-match.php

Note that mb-functions aren't included in the default installation, you
need to add them, check the instructions for installing:
http://fi2.php.net/manual/en/ref.mbstring.php

--
"En ole paha ihminen, mutta omenat ovat elinkeinoni." -Perttu Sirviö
sp**@outolempi.net | Gedoon-S @ IRCnet | rot13(xv***@bhgbyrzcv.arg)
Feb 24 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

38 posts views Thread by Haines Brown | last post: by
1 post views Thread by stevelooking41 | last post: by
6 posts views Thread by jmgonet | last post: by
2 posts views Thread by Rob Reagan | last post: by
1 post views Thread by David Bertoni | last post: by
7 posts views Thread by Jimmy Shaw | last post: by
23 posts views Thread by Allan Ebdrup | last post: by
35 posts views Thread by Bjoern Hoehrmann | last post: by
1 post views Thread by Dan Stromberg - Datallegro | last post: by
4 posts views Thread by =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post: by
1 post views Thread by Waqarahmed | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.