469,609 Members | 1,657 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,609 developers. It's quick & easy.

need to mass convert pages to UTF-8 encoding

Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F
Jul 17 '05 #1
10 5290
On 29 Sep 2004 13:50:30 -0700, lk******@geocities.com (lawrence) wrote:
Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F


Convert them from what? You can't convert without knowing the source encoding.

What tools do you have available? I'd be inclined to knock up a short Perl
script using the Encode module and possibly File::Find::Rule if there's a lot
to change.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2
lawrence wrote:
Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F


http://www.php.net/utf8_decode

.... and check your meta tags

--
USENET would be a better place if everybody read: | to email me: use |
http://www.catb.org/~esr/faqs/smart-questions.html | my name in "To:" |
http://www.netmeister.org/news/learn2quote2.html | header, textonly |
http://www.expita.com/nomime.html | no attachments. |
Jul 17 '05 #3
On 29 Sep 2004 21:13:30 GMT, Pedro Graca <he****@hotpop.com> wrote:
lawrence wrote:
Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F


http://www.php.net/utf8_decode

... and check your meta tags


Isn't that the wrong way around? He's sending non-UTF8 data but flagging it as
UTF8, resulting in errors - if the headers remain that way, then isn't what he
wants is to encode it to UTF8, not decode?

The other big question is - why did the OP start sending UTF8 headers if he's
not actually sending UTF8?

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #4
Andy Hassall wrote:
On 29 Sep 2004 13:50:30 -0700, lk******@geocities.com (lawrence) wrote:

[edited]
http://www.krubner.com/


Convert them from what? You can't convert without knowing the source encoding.


According to his meta tags, convert from iso-8859-1 :)
--
USENET would be a better place if everybody read: | to email me: use |
http://www.catb.org/~esr/faqs/smart-questions.html | my name in "To:" |
http://www.netmeister.org/news/learn2quote2.html | header, textonly |
http://www.expita.com/nomime.html | no attachments. |
Jul 17 '05 #5
Andy Hassall wrote:
On 29 Sep 2004 21:13:30 GMT, Pedro Graca <he****@hotpop.com> wrote:
lawrence wrote:
[...] because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?
http://www.php.net/utf8_decode

... and check your meta tags


Isn't that the wrong way around? He's sending non-UTF8 data but flagging it as
UTF8, resulting in errors - if the headers remain that way, then isn't what he
wants is to encode it to UTF8, not decode?


Yes, of course.
The other big question is - why did the OP start sending UTF8 headers if he's
not actually sending UTF8?


maybe he wants to 'upgrade' his system to UTF8 :)
.... and he thought the easiest way would be to change

<?php
header('Content-Type: text/html; charset=UTF-8');
?>

and presto! :-)

--
USENET would be a better place if everybody read: | to email me: use |
http://www.catb.org/~esr/faqs/smart-questions.html | my name in "To:" |
http://www.netmeister.org/news/learn2quote2.html | header, textonly |
http://www.expita.com/nomime.html | no attachments. |
Jul 17 '05 #6
On 29 Sep 2004 21:31:42 GMT, Pedro Graca <he****@hotpop.com> wrote:
Andy Hassall wrote:
On 29 Sep 2004 13:50:30 -0700, lk******@geocities.com (lawrence) wrote:

[edited]
http://www.krubner.com/


Convert them from what? You can't convert without knowing the source encoding.


According to his meta tags, convert from iso-8859-1 :)


Ah, I should have looked at that.

There does appear to be evidence of damage caused by improper encoding already
on there - various occurrences of '?' characters where there should be some
form of quote. Most likely from pasting in from Microsoft Word with the "smart
quotes" option turned on - ISO-8859-1 doesn't have the 'left' and 'right'
double quotes produced by this, only 'plain' double quotes.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #7
aa
To convert ANSI into UTF8 open this file with Notepad (w2k or XP) and save
as the same name but change encoding to UTF8
Then either remove the carset tag or change it to UTF-8

"lawrence" <lk******@geocities.com> wrote in message
news:da**************************@posting.google.c om...
Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F

Jul 17 '05 #8
Andy Hassall <an**@andyh.co.uk> wrote in message news:<ko********************************@4ax.com>. ..
On 29 Sep 2004 21:13:30 GMT, Pedro Graca <he****@hotpop.com> wrote:
lawrence wrote:
Validator chokes on my pages now because I started sending an
character encoding header of UTF-8 but the page is full of non UTF-8
characters. Anyway quick way to convert them?

http://validator.w3.org/check?uri=ht...krubner.com%2F


http://www.php.net/utf8_decode

... and check your meta tags


Isn't that the wrong way around? He's sending non-UTF8 data but flagging it as
UTF8, resulting in errors - if the headers remain that way, then isn't what he
wants is to encode it to UTF8, not decode?

The other big question is - why did the OP start sending UTF8 headers if he's
not actually sending UTF8?

The current site was broken in the sense that it is a weblog and I'd
like to put an RSS feed on it, because all weblogs have RSS feeds
nowadays.
But RSS feeds won't validate if the feed is sent out without a
character encoding. So I have to give it a character encoding of some
kind. So I decided on UTF-8 after hashing it out some over on
comp.lang.php. And now that I'm forcing the issue, there is a lot of
code that was input previously that is balking.

The site has been built-in a hodge-podge way over the last 6 years and
has debris from previous incarnations. The weblog software I now use
started long before I knew what a character encoding was. Developing
the software has been a process of finding out about stuff and then
trying to make the existing content fit whatever the new issue is. No
doubt this process will continue for many more years, as there will
always be things I don't know, and then there will always be new
technologies or uses I want to try for.

The way the software now works is that any input gets hit with
utf8_encode() and therefore any output from now on should be UTF-8.
But in the meantime I've got to clean up the old stuff.
Jul 17 '05 #9
Pedro Graca <he****@hotpop.com> wrote in message news:<sl*******************@ID-203069.user.uni-berlin.de>...
Isn't that the wrong way around? He's sending non-UTF8 data but flagging it as
UTF8, resulting in errors - if the headers remain that way, then isn't what he
wants is to encode it to UTF8, not decode?


Yes, of course.
The other big question is - why did the OP start sending UTF8 headers if he's
not actually sending UTF8?


maybe he wants to 'upgrade' his system to UTF8 :)
... and he thought the easiest way would be to change

<?php
header('Content-Type: text/html; charset=UTF-8');
?>

and presto! :-)


Well, as I said. All new input is encoded to UTF-8. It's the old stuff
that needs to be cleaned up. Simon Stienen was nice enough to suggest
a way of possibly finding the non-UTF-8 code:

http://groups.google.com/groups?hl=e...ngerouscat.net
Jul 17 '05 #10
"aa" <aa@virgin.net> wrote in message news:<41***********************@ptn-nntp-reader01.plus.net>...
To convert ANSI into UTF8 open this file with Notepad (w2k or XP) and save
as the same name but change encoding to UTF8
Then either remove the carset tag or change it to UTF-8


I have to do this for over 100 websites, and we're talking about many
thousands of pages, so it would be best to have an automated solution.
What I'm looking for is to recreate exactly the event you describe -
but what is it exactly that Notepad is doing?

Having looked into it, I guess I need to do something with
utf8_encode(), but I haven't decided what yet.

http://us4.php.net/manual/en/function.utf8-encode.php
Jul 17 '05 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Unknown User | last post: by
4 posts views Thread by Uwe Mayer | last post: by
2 posts views Thread by Oliver Kurz | last post: by
5 posts views Thread by Steven | last post: by
2 posts views Thread by Joey Lee | last post: by
8 posts views Thread by Roger Dodger | last post: by
3 posts views Thread by Jared Wiltshire | last post: by
reply views Thread by Solution2021 | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.