473,406 Members | 2,343 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Screenscraping UTF-8 characters problem

Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!
Some relevant parts from the PHP5:
/******************/

header ('Content-type: text/html; charset=utf-8');
....
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
....
function getTranslation($q, $lang)
{
$out = '';
// the Google page is supposed to be UTF-8 too:
$in = getFileText( "http://google.com/translate_t?langpair=en|" .
urlencode($lang) . "&text=".urlencode($q) );
preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
$out);

$translation = $out[1]; // garbled!
$translation = trim($translation);
$translation = utf8_encode($translation); // garbled with or
without this line...
return $translation;
}

/******************/

Feb 24 '07 #1
1 1964
Philipp Lenssen kirjoitti:
Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!
Some relevant parts from the PHP5:
/******************/

header ('Content-type: text/html; charset=utf-8');
...
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
...
function getTranslation($q, $lang)
{
$out = '';
// the Google page is supposed to be UTF-8 too:
$in = getFileText( "http://google.com/translate_t?langpair=en|" .
urlencode($lang) . "&text=".urlencode($q) );
preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
$out);

$translation = $out[1]; // garbled!
$translation = trim($translation);
$translation = utf8_encode($translation); // garbled with or
without this line...
return $translation;
}

/******************/
Seems to me what you need are the multibyte functions. You should
replace the preg_match with the multibyte compatible mb_ereg_match:

http://fi2.php.net/manual/en/function.mb-ereg-match.php

Note that mb-functions aren't included in the default installation, you
need to add them, check the instructions for installing:
http://fi2.php.net/manual/en/ref.mbstring.php

--
"En ole paha ihminen, mutta omenat ovat elinkeinoni." -Perttu Sirviö
sp**@outolempi.net | Gedoon-S @ IRCnet | rot13(xv***@bhgbyrzcv.arg)
Feb 24 '07 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
1
by: stevelooking41 | last post by:
Can someone explain why I don't seem unable to use document.write to produce a valid UTF-8 none breaking space sequence (Hex: C2A0) ? I've tried everyway I've been able to find to tell the...
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
2
by: Rob Reagan | last post by:
I'm writing a screenscraper in Visual Basic .NET that is scraping an ASP .NET website. I've used a tool that echos what my browser submits to the website and what my scraper submits to the website....
1
by: David Bertoni | last post by:
Hi all, I'm trying to resolve what appears to me an inconsistency in the XML 1.0 recommendation involving entities encoding in UTF-16 and the requirement for a byte order mark. Section 4.3.3...
7
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
35
by: Bjoern Hoehrmann | last post by:
Hi, For a free software project, I had to write a routine that, given a Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds the UTF-8 encoded form of it, for example, U+00F6...
1
by: Dan Stromberg - Datallegro | last post by:
Is there a method, with python, of screenscraping a web page, if that web page uses javascript? I know about BeautifulSoup, but AFAIK at this time, BeautifulSoup is for HTML that doesn't have...
4
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.