472,141 Members | 1,383 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,141 software developers and data experts.

UTF8: file_put_contents doesn't seem to write UTF8 content properly

Hi,

I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.

This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
....and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

....where...
SITEMAP_FILE is the filename constant
....and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...
Any adeas of how I can make this work?

Thanks for the input.
Jun 13 '07 #1
7 12563
On Wed, 13 Jun 2007 22:25:44 +0200, "amygdala" <no*****@noreply.comwrote:
>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.
Well, that's not a foolproof method...
>This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
...and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

...where...
SITEMAP_FILE is the filename constant
...and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...

Any adeas of how I can make this work?
Start from the beginning; what character set encoding is the original data in?
The error implies that it's not ISO-8859-1 (which does have some gaps where
characters aren't valid...)

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Jun 13 '07 #2
C.
On 13 Jun, 21:25, "amygdala" <nore...@noreply.comwrote:
>
The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.
ROFL.

Try inserting a BOM in front of the content.

C.

Jun 13 '07 #3

"Andy Hassall" <an**@andyh.co.ukschreef in bericht
news:nb********************************@4ax.com...
On Wed, 13 Jun 2007 22:25:44 +0200, "amygdala" <no*****@noreply.com>
wrote:
>>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file
doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file
in
notepad and choose 'save as...'. Normally the coding option should be set
to
UTF8, but now it just shows ANSI.

Well, that's not a foolproof method...

I was afraid of that.

>>This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
...and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

...where...
SITEMAP_FILE is the filename constant
...and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...

Any adeas of how I can make this work?

Start from the beginning; what character set encoding is the original data
in?
The error implies that it's not ISO-8859-1 (which does have some gaps
where
characters aren't valid...)
Well... I discovered the 'Set Code Page...' option in UltraEdit, the main
editor I use to code PHP. And it tells me my PHP code files are encoded in
'1252 (ANSI - Latin I)'. So, now my next question is... what would be the
correct first parameter for the iconv function to tell it that the original
data is '1252 (ANSI - Latin I)'. I've tried numerous stings, which include:

'1252 (ANSI - Latin I)'
'1252'
'1252 ANSI'
'1252-ANSI'
'ANSI-1252'
'ANSI 1252'

....and variations.

Is there any iconv encoding table with acceptable encodings I can consult?
Also, isn't '1252 (ANSI - Latin I)' just a pimped version of ISO-8859-1?

Although I'm still curious of this. Please read my reply to C. also.

Thanks.
Jun 14 '07 #4

"C." <co************@gmail.comschreef in bericht
news:11**********************@e9g2000prf.googlegro ups.com...
On 13 Jun, 21:25, "amygdala" <nore...@noreply.comwrote:
>>
The way I test to see if the content is UTF8, is by opening the XML file
in
notepad and choose 'save as...'. Normally the coding option should be set
to
UTF8, but now it just shows ANSI.

ROFL.
Yes, very amusing. :-/
Try inserting a BOM in front of the content.
Ok, I did a little research on BOM. And came up with information that tells
me the BOM isn't particularly necessary for UTF-8. Then I ran a simple test
with utf8_encode:

<?php
echo utf8_encode( 'ï' );
?>

Which output looks like it works just fine:

ï

Since I've concluded (see my reply to Andy Hassall) that my files are
encoded in '1252 (ANSI - Latijn I)' and this test file was also '1252
(ANSI - Latijn I)' I guess it works. And I don't necessary have to provide a
BOM and don't have to resort to iconv. Correct?

Or am I missing something vital here?

Thanks.
Jun 14 '07 #5
On Thu, 14 Jun 2007 03:39:34 +0200, "amygdala" <no*****@noreply.comwrote:
>Start from the beginning; what character set encoding is the original data
in?
The error implies that it's not ISO-8859-1 (which does have some gaps
where
characters aren't valid...)

Well... I discovered the 'Set Code Page...' option in UltraEdit, the main
editor I use to code PHP. And it tells me my PHP code files are encoded in
'1252 (ANSI - Latin I)'.
Well... again, that's not foolproof. It's generally not possible to
definitively detect the encoding of a file. You can work out whether it's
impossible to be in a particular encoding (invalid characters or byte
sequences), and you can make some guesses on character distribution or
spellings of words, but unless it's tagged in some way (like HTML and XML, or
through another channel like HTTP headers) then it's not certain.

"Windows Codepage 1252" is a Windows character set encoding that is similar,
but not exactly the same as ISO-8859-1. It (1252) differs on the location of
the Euro character, and has a few extra characters in a range that is reserved
in ISO-8859-1.

Do you have any Euro currency symbols in the file?
>So, now my next question is... what would be the
correct first parameter for the iconv function to tell it that the original
data is '1252 (ANSI - Latin I)'. I've tried numerous stings, which include:

'1252 (ANSI - Latin I)'
'1252'
'1252 ANSI'
'1252-ANSI'
'ANSI-1252'
'ANSI 1252'

...and variations.

Is there any iconv encoding table with acceptable encodings I can consult?
http://www.gnu.org/software/libiconv/

You possibly want:

CP1252
>Also, isn't '1252 (ANSI - Latin I)' just a pimped version of ISO-8859-1?
I should read the entire message before typing ;-)

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Jun 14 '07 #6
I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).
How can you tell? YOU tell the system what encoding is used. The system
rarely tells you, as bytes can be perfectly valid text in a lot of
encodings and look very different in each of them.

Even if the system tells you, it usually does so separately from the
text itself. Which is obvious, because you need the encoding to be able
to read the text! In webpages and e-mail, for example, headers are used
to set the encoding of the data.

I suggest you search the net for encodings and how to work with them.
This is a good start:

http://www.joelonsoftware.com/articles/Unicode.html

Good luck with the onions,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/
Jun 15 '07 #7

"Willem Bogaerts" <w.********@kratz.maardanzonderditstuk.nlschreef in
bericht news:46*********************@news.xs4all.nl...
>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file
doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

How can you tell? YOU tell the system what encoding is used. The system
rarely tells you, as bytes can be perfectly valid text in a lot of
encodings and look very different in each of them.
Yes, you're right of course, what was I thinking.
Even if the system tells you, it usually does so separately from the
text itself. Which is obvious, because you need the encoding to be able
to read the text! In webpages and e-mail, for example, headers are used
to set the encoding of the data.

I suggest you search the net for encodings and how to work with them.
This is a good start:

http://www.joelonsoftware.com/articles/Unicode.html
Great article. Thanks for the pointer.
Good luck with the onions,
Hopefully that won't be necessary anymore.
Cheers.
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/

Jun 16 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Michael Preminger | last post: by
5 posts views Thread by Richard Lewis | last post: by
149 posts views Thread by Christopher Benson-Manica | last post: by
8 posts views Thread by elyob | last post: by
1 post views Thread by Abe Simpson | last post: by
4 posts views Thread by EmeraldShield | last post: by
5 posts views Thread by ^AndreA^ | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.