By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,746 Members | 1,928 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,746 IT Pros & Developers. It's quick & easy.

Weird loadHTML behaviour

P: n/a
Hi all,

I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:
$str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');

file_put_contents ('dmp.htm', $str);

$dom = DOMDocument::loadHTML ($str);
$elem = $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n = $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));

What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ" (0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.

Any thoughts?

Cheers, Christoph

May 8 '07 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On May 9, 12:43 am, monochro...@gmail.com wrote:
Hi all,

I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:

$str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');

file_put_contents ('dmp.htm', $str);

$dom = DOMDocument::loadHTML ($str);
$elem = $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n = $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));

What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ"(0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.

Any thoughts?

Cheers, Christoph

After some :-) research, it turns out that the encoding of the
contents of the first <h5tag
has acutally changed to UTF-8 - hence the strange byte sequence. This
begs the question
if the default encoding for parsed HTML strings in the DOM package is
UTF-8 (if we are looking
at HTML-ENTITIES-conformant encoding initially). Is this a bug of
DOMDocument or a feature?

Cheers, Christoph

May 9 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.