By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,375 Members | 1,116 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,375 IT Pros & Developers. It's quick & easy.

UTF8: file_put_contents doesn't seem to write UTF8 content properly

P: n/a
Hi,

I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.

This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
....and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

....where...
SITEMAP_FILE is the filename constant
....and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...
Any adeas of how I can make this work?

Thanks for the input.
Jun 13 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a
On Wed, 13 Jun 2007 22:25:44 +0200, "amygdala" <no*****@noreply.comwrote:
>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.
Well, that's not a foolproof method...
>This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
...and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

...where...
SITEMAP_FILE is the filename constant
...and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...

Any adeas of how I can make this work?
Start from the beginning; what character set encoding is the original data in?
The error implies that it's not ISO-8859-1 (which does have some gaps where
characters aren't valid...)

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Jun 13 '07 #2

P: n/a
C.
On 13 Jun, 21:25, "amygdala" <nore...@noreply.comwrote:
>
The way I test to see if the content is UTF8, is by opening the XML file in
notepad and choose 'save as...'. Normally the coding option should be set to
UTF8, but now it just shows ANSI.
ROFL.

Try inserting a BOM in front of the content.

C.

Jun 13 '07 #3

P: n/a

"Andy Hassall" <an**@andyh.co.ukschreef in bericht
news:nb********************************@4ax.com...
On Wed, 13 Jun 2007 22:25:44 +0200, "amygdala" <no*****@noreply.com>
wrote:
>>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file
doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

The way I test to see if the content is UTF8, is by opening the XML file
in
notepad and choose 'save as...'. Normally the coding option should be set
to
UTF8, but now it just shows ANSI.

Well, that's not a foolproof method...

I was afraid of that.

>>This is what I have tried to write UTF8 content with:

file_put_contents( '.' . SITEMAP_FILE, utf8_encode(
$this->sitemapForCrawlers ) );
...and...
file_put_contents( '.' . SITEMAP_FILE, iconv( "ISO-8859-1", "UTF8",
$this->sitemapForCrawlers ) );

...where...
SITEMAP_FILE is the filename constant
...and...
$this->sitemapForCrawlers is the string with XML data

With the last attempt I even got an error saying:

Wrong charset, conversion from `ISO-8859-1' to `UTF8' is not allowed in...

Any adeas of how I can make this work?

Start from the beginning; what character set encoding is the original data
in?
The error implies that it's not ISO-8859-1 (which does have some gaps
where
characters aren't valid...)
Well... I discovered the 'Set Code Page...' option in UltraEdit, the main
editor I use to code PHP. And it tells me my PHP code files are encoded in
'1252 (ANSI - Latin I)'. So, now my next question is... what would be the
correct first parameter for the iconv function to tell it that the original
data is '1252 (ANSI - Latin I)'. I've tried numerous stings, which include:

'1252 (ANSI - Latin I)'
'1252'
'1252 ANSI'
'1252-ANSI'
'ANSI-1252'
'ANSI 1252'

....and variations.

Is there any iconv encoding table with acceptable encodings I can consult?
Also, isn't '1252 (ANSI - Latin I)' just a pimped version of ISO-8859-1?

Although I'm still curious of this. Please read my reply to C. also.

Thanks.
Jun 14 '07 #4

P: n/a

"C." <co************@gmail.comschreef in bericht
news:11**********************@e9g2000prf.googlegro ups.com...
On 13 Jun, 21:25, "amygdala" <nore...@noreply.comwrote:
>>
The way I test to see if the content is UTF8, is by opening the XML file
in
notepad and choose 'save as...'. Normally the coding option should be set
to
UTF8, but now it just shows ANSI.

ROFL.
Yes, very amusing. :-/
Try inserting a BOM in front of the content.
Ok, I did a little research on BOM. And came up with information that tells
me the BOM isn't particularly necessary for UTF-8. Then I ran a simple test
with utf8_encode:

<?php
echo utf8_encode( '' );
?>

Which output looks like it works just fine:

ï

Since I've concluded (see my reply to Andy Hassall) that my files are
encoded in '1252 (ANSI - Latijn I)' and this test file was also '1252
(ANSI - Latijn I)' I guess it works. And I don't necessary have to provide a
BOM and don't have to resort to iconv. Correct?

Or am I missing something vital here?

Thanks.
Jun 14 '07 #5

P: n/a
On Thu, 14 Jun 2007 03:39:34 +0200, "amygdala" <no*****@noreply.comwrote:
>Start from the beginning; what character set encoding is the original data
in?
The error implies that it's not ISO-8859-1 (which does have some gaps
where
characters aren't valid...)

Well... I discovered the 'Set Code Page...' option in UltraEdit, the main
editor I use to code PHP. And it tells me my PHP code files are encoded in
'1252 (ANSI - Latin I)'.
Well... again, that's not foolproof. It's generally not possible to
definitively detect the encoding of a file. You can work out whether it's
impossible to be in a particular encoding (invalid characters or byte
sequences), and you can make some guesses on character distribution or
spellings of words, but unless it's tagged in some way (like HTML and XML, or
through another channel like HTTP headers) then it's not certain.

"Windows Codepage 1252" is a Windows character set encoding that is similar,
but not exactly the same as ISO-8859-1. It (1252) differs on the location of
the Euro character, and has a few extra characters in a range that is reserved
in ISO-8859-1.

Do you have any Euro currency symbols in the file?
>So, now my next question is... what would be the
correct first parameter for the iconv function to tell it that the original
data is '1252 (ANSI - Latin I)'. I've tried numerous stings, which include:

'1252 (ANSI - Latin I)'
'1252'
'1252 ANSI'
'1252-ANSI'
'ANSI-1252'
'ANSI 1252'

...and variations.

Is there any iconv encoding table with acceptable encodings I can consult?
http://www.gnu.org/software/libiconv/

You possibly want:

CP1252
>Also, isn't '1252 (ANSI - Latin I)' just a pimped version of ISO-8859-1?
I should read the entire message before typing ;-)

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Jun 14 '07 #6

P: n/a
I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).
How can you tell? YOU tell the system what encoding is used. The system
rarely tells you, as bytes can be perfectly valid text in a lot of
encodings and look very different in each of them.

Even if the system tells you, it usually does so separately from the
text itself. Which is obvious, because you need the encoding to be able
to read the text! In webpages and e-mail, for example, headers are used
to set the encoding of the data.

I suggest you search the net for encodings and how to work with them.
This is a good start:

http://www.joelonsoftware.com/articles/Unicode.html

Good luck with the onions,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/
Jun 15 '07 #7

P: n/a

"Willem Bogaerts" <w.********@kratz.maardanzonderditstuk.nlschreef in
bericht news:46*********************@news.xs4all.nl...
>I'm trying to let PHP write a 'sitemap.xml' sitemap for Google and other
searchengines. It's working, except that the content in the XML file
doesn't
seem to be UTF8. (Which it should be, judging by the information given on
Google's webmaster helpcenter).

How can you tell? YOU tell the system what encoding is used. The system
rarely tells you, as bytes can be perfectly valid text in a lot of
encodings and look very different in each of them.
Yes, you're right of course, what was I thinking.
Even if the system tells you, it usually does so separately from the
text itself. Which is obvious, because you need the encoding to be able
to read the text! In webpages and e-mail, for example, headers are used
to set the encoding of the data.

I suggest you search the net for encodings and how to work with them.
This is a good start:

http://www.joelonsoftware.com/articles/Unicode.html
Great article. Thanks for the pointer.
Good luck with the onions,
Hopefully that won't be necessary anymore.
Cheers.
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/

Jun 16 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.