Encoding/characterset/font family confusion

Erwin Moller

Hi group,

I could use a bit of guidance on the following matter.

I am starting a new project now and must make some decisions regarding
encoding.
Environment: PHP4.3, Postgres7.4.3

I must be able to receive forminformation and store that in a database and
later produce it on screen on the client (just plain HTML).
Nothing special. I do this for many years, but I never paid a lot of
attention to special characters.

A few day ago I discovered that the euro-sign is not defined in all
fontfamilies.
They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.
After a little research I found I could put font-tags around the euro-sign
with another font-family (Arial in this case) to get the Euro sign.

I am completely graphical impaired, and only understand programmingcode (and
HTML/JavaScript of course) , so this is a weak point on my side, hence this
question.

I target on Europe only at the moment (no need for Chineese
charactersupport)
That said, will the following setup make sense?

Postgresql db encoding scheme: LATIN1
In the headers of all my HTML: content-type: text/html charset: iso-8859-1

A few related questions:
1) Will people be able to copy/paste info from other sources (like
wordprocessing programs and other websites) into my forms?

2) Can I use regular expressions as I am used to (ASCII) in my PHP code?
Will I match e acute, eurosign, etc?

3) Will the roundtrip describe here under have problems with normal expected
european characters?

client copies some text from some source ->
paste in the form ->
receive by PHP ->
insert in Postgresql (or update) ->
retrieve from postgresql ->
display as HTML (with content-type: text/html charset: iso-8859-1)

Is that OK?
Any pitfalls?
Should I maybe use UTF-8?

Any pointers are hugely appriciated because, to me, this is all quite
confusing.

Thanks in advance!

Regards,
Erwin Moller

Mar 30 '07 #1

Subscribe Post Reply

2655

Willem Bogaerts

A few day ago I discovered that the euro-sign is not defined in all

fontfamilies.

This is a client issue - nothing you can do about. All you can do is
using HTML entities (€) so the browser knows what you mean (and
maybe switch fonts, depending on how intelligent the browser is)

They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.
After a little research I found I could put font-tags around the euro-sign
with another font-family (Arial in this case) to get the Euro sign.

I am completely graphical impaired, and only understand programmingcode (and
HTML/JavaScript of course) , so this is a weak point on my side, hence this
question.

I target on Europe only at the moment (no need for Chineese
charactersupport)
That said, will the following setup make sense?

Postgresql db encoding scheme: LATIN1
In the headers of all my HTML: content-type: text/html charset: iso-8859-1

Latin-1 does not include a euro sign at all. However, latin-1 is
sometimes replaced by enhanced encodings (like cp-1252 or Windows
encoding) and the euro sign does appear.

A few related questions:
1) Will people be able to copy/paste info from other sources (like
wordprocessing programs and other websites) into my forms?

In short: yes. It is up to the browser to convert the encoding to the
one used by the OS. I never had any trouble with it.

2) Can I use regular expressions as I am used to (ASCII) in my PHP code?
Will I match e acute, eurosign, etc?

Yes. All latin-1 characters are just one byte. No problem.

3) Will the roundtrip describe here under have problems with normal expected
european characters?

client copies some text from some source ->
paste in the form ->
receive by PHP ->
insert in Postgresql (or update) ->
retrieve from postgresql ->
display as HTML (with content-type: text/html charset: iso-8859-1)

Is that OK?
Any pitfalls?
Should I maybe use UTF-8?

I switched to using utf-8 a few months ago, and I still have trouble
with it. For some vague reason, so can set all encoding startup
variables to utf-8, and connections are STILL made with latin-1 unless
you specifically use the SET NAMES command. Someone wrote an article
"utf-8, love at fifth site". That is so true! It can do a lot, but it is
a real hell to configure all systems to use it. Furthermore, the
implementations are all non-encoding-aware. The problem is that a text
always has an encoding, while a string does not. And texts are treated
as strings, so with every string operation, you will have to make sure
that the correct encoding is used.

>
Any pointers are hugely appriciated because, to me, this is all quite
confusing.

Here are some links:
http://www.phpwact.org/php/i18n/charsets
http://www.gravitonic.com/downloads/...hp_unicode.pdf

Best regards
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/

Mar 30 '07 #2

Erwin Moller

Willem Bogaerts wrote:

>A few day ago I discovered that the euro-sign is not defined in all
fontfamilies.

This is a client issue - nothing you can do about. All you can do is
using HTML entities (€) so the browser knows what you mean (and
maybe switch fonts, depending on how intelligent the browser is)

>They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.
After a little research I found I could put font-tags around the
euro-sign with another font-family (Arial in this case) to get the Euro
sign.

I am completely graphical impaired, and only understand programmingcode
(and HTML/JavaScript of course) , so this is a weak point on my side,
hence this question.

I target on Europe only at the moment (no need for Chineese
charactersupport)
That said, will the following setup make sense?

Postgresql db encoding scheme: LATIN1
In the headers of all my HTML: content-type: text/html charset:
iso-8859-1

Latin-1 does not include a euro sign at all. However, latin-1 is
sometimes replaced by enhanced encodings (like cp-1252 or Windows
encoding) and the euro sign does appear.

>A few related questions:
1) Will people be able to copy/paste info from other sources (like
wordprocessing programs and other websites) into my forms?

In short: yes. It is up to the browser to convert the encoding to the
one used by the OS. I never had any trouble with it.

>2) Can I use regular expressions as I am used to (ASCII) in my PHP code?
Will I match e acute, eurosign, etc?

Yes. All latin-1 characters are just one byte. No problem.

>3) Will the roundtrip describe here under have problems with normal
expected european characters?

client copies some text from some source ->
paste in the form ->
receive by PHP ->
insert in Postgresql (or update) ->
retrieve from postgresql ->
display as HTML (with content-type: text/html charset: iso-8859-1)

Is that OK?
Any pitfalls?
Should I maybe use UTF-8?

I switched to using utf-8 a few months ago, and I still have trouble
with it. For some vague reason, so can set all encoding startup
variables to utf-8, and connections are STILL made with latin-1 unless
you specifically use the SET NAMES command. Someone wrote an article
"utf-8, love at fifth site". That is so true! It can do a lot, but it is
a real hell to configure all systems to use it. Furthermore, the
implementations are all non-encoding-aware. The problem is that a text
always has an encoding, while a string does not. And texts are treated
as strings, so with every string operation, you will have to make sure
that the correct encoding is used.

>>
Any pointers are hugely appriciated because, to me, this is all quite
confusing.

Here are some links:
http://www.phpwact.org/php/i18n/charsets
http://www.gravitonic.com/downloads/...hp_unicode.pdf

Best regards

Thank you Willem.
Excactly the kind of info I needed to read.

I like the link to www.joelonsoftware.com/articles/Unicode.html
He describes a type of programmer that excactly fits myself: the one trying
to ignore issues with charactersets. :-)

So I have an announcement to make: if you are a programmer working in 2003
and you don't know the basics of characters, character sets, encodings, and
Unicode, and I catch you, I'm going to punish you by making you peel onions
for 6 months in a submarine. I swear I will.

And one more thing: IT'S NOT THAT HARD.

In this article I'll fill you in on exactly what every working programmer
should know. All that stuff about "plain text = ascii = characters are 8
bits" is not only wrong, it's hopelessly wrong, and if you're still
programming that way, you're not much better than a medical doctor who
doesn't believe in germs. Please do not write another line of code until
you finish reading this article.

I think I follow his advise (treat). ;-)
Time to grow up/read up.

Thanks.

Regards,
Erwin Moller

Mar 30 '07 #3

Toby A Inkster

Erwin Moller wrote:

A few day ago I discovered that the euro-sign is not defined in all
fontfamilies.

Browsers are *supposed* to switch fonts when they encounter a character
that does not exist in the current font. Unfortunately, Internet Explorer
is famously bad at this.

They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.

Yep -- it's not a problem with the way you've specified the character,
just a problem that the browser is trying to display it using a font that
doesn't contain that character.

After a little research I found I could put font-tags around the euro-sign
with another font-family (Arial in this case) to get the Euro sign.

If you care about the symbol being rendered correctly in legacy browsers,
then this is the best solution. Either change the fonts of your whole
page, or use a little PHP+HTML+CSS:

$str = str_replace('€',
'<acronym class="e" title="euro">€</acronym>',
$str);

With CSS:

acronym.e { border-bottom:none; font-family: "Arial"; }

If you use output buffering, then you should be able to do this with
minimal code changes.

I'm actually doing something fairly similar on a current project, but with
ampersands instead of euro-signs. I wanted them all rendered in a
particular font which has an especially nice ampersand, but didn't want
the rest of the page to appear in that font.

Also, take a look at Jukka's page on the euro sign:
http://www.cs.tut.fi/~jkorpela/html/euro.html

As far as character sets are concerned, do not worry too much. HTML
documents effectively have two character sets: the one they're transmitted
in and the one they're translated into by the browser. The one they're
translated into is always Unicode, so always includes the euro symbol. So
you just need to worry about the one they're transmitted in -- you've
chosen ISO-8859-1, which does not include the euro symbol, but all that
means is that you need to use an entity instead -- you can't just type in
a raw â‚¬.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!

Mar 30 '07 #4

Willem Bogaerts

..., and I catch you, I'm going to punish you by making you peel onions

for 6 months in a submarine. I swear I will.

Good luck with the onions!

And one more thing: IT'S NOT THAT HARD.

I completely disagree. The theory is not hard at all, but the difference
between strings and texts is one that I have never encountered on the
web. Encodings simply are not linked with the strings themselves, and
that makes it almost impossible. And it is really hard to find which
programs do translate encodings, and which don't. MySQL alone has far
too many encoding settings that are counter-intuitive at best. Also, the
complete lack of proper escaping possibilities makes it even more
difficult, unless you want to escape the characters by turning them into
HTML entities.

Best regards,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/

Mar 30 '07 #5

Erwin Moller

Toby A Inkster wrote:

Hi Toby,

Erwin Moller wrote:

>A few day ago I discovered that the euro-sign is not defined in all
fontfamilies.

Browsers are *supposed* to switch fonts when they encounter a character
that does not exist in the current font. Unfortunately, Internet Explorer
is famously bad at this.

Yes, the 'browser' in question was IE6.

Changing fontfamily to Arial helped, but from a designers point of view it
would have been nicer to have the same font all over the page.

>
>They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.

Yep -- it's not a problem with the way you've specified the character,
just a problem that the browser is trying to display it using a font that
doesn't contain that character.

>After a little research I found I could put font-tags around the
euro-sign with another font-family (Arial in this case) to get the Euro
sign.

If you care about the symbol being rendered correctly in legacy browsers,
then this is the best solution. Either change the fonts of your whole
page, or use a little PHP+HTML+CSS:

It tried to switch to Arial, but my designer doesn't let me change the font
for the site.
These people are so strict!
I stopped asking what is wrong with monospaced fonts, such questions are
dangerous for my wellbeing. ;-)

>
$str = str_replace('€',
'<acronym class="e" title="euro">€</acronym>',
$str);

With CSS:

acronym.e { border-bottom:none; font-family: "Arial"; }

If you use output buffering, then you should be able to do this with
minimal code changes.

I'm actually doing something fairly similar on a current project, but with
ampersands instead of euro-signs. I wanted them all rendered in a
particular font which has an especially nice ampersand, but didn't want
the rest of the page to appear in that font.

Also, take a look at Jukka's page on the euro sign:
http://www.cs.tut.fi/~jkorpela/html/euro.html

Found that page earlier in my quest for the euro.

Thanks for your comment.

Regards,
Erwin Moller

>
As far as character sets are concerned, do not worry too much. HTML
documents effectively have two character sets: the one they're transmitted
in and the one they're translated into by the browser. The one they're
translated into is always Unicode, so always includes the euro symbol. So
you just need to worry about the one they're transmitted in -- you've
chosen ISO-8859-1, which does not include the euro symbol, but all that
means is that you need to use an entity instead -- you can't just type in
a raw ?.

Mar 30 '07 #6

Erwin Moller

Willem Bogaerts wrote:

>..., and I catch you, I'm going to punish you by making you peel onions
for 6 months in a submarine. I swear I will.

Good luck with the onions!

>And one more thing: IT'S NOT THAT HARD.

I completely disagree. The theory is not hard at all, but the difference
between strings and texts is one that I have never encountered on the
web. Encodings simply are not linked with the strings themselves, and
that makes it almost impossible. And it is really hard to find which
programs do translate encodings, and which don't. MySQL alone has far
too many encoding settings that are counter-intuitive at best. Also, the
complete lack of proper escaping possibilities makes it even more
difficult, unless you want to escape the characters by turning them into
HTML entities.

Best regards,

Yes, I completely agree.
After reading through a few pages of UTF-8, and after discovering that the
length of strings in PHP is something completely different than I want to
see, I decided to NOT use UTF-8 (yet).
Maybe when PHP6 is out, and debugged, and I switch my server to PHP6,
well... maybe then.

I'll stick to LATIN1 for now, that is something I understand, and something
I can use my existing functionlib on without a headache.

In case my app becomes so popular people outside Europe want it, I'll dive
into Unicode again. ;-)

Thanks for your time!

Regards,
Erwin Moller

Mar 30 '07 #7

Umberto Salsi

Erwin Moller <si******************************************@spam yourself.comwrote:

Hi group,

I could use a bit of guidance on the following matter.

I am starting a new project now and must make some decisions regarding
encoding.
Environment: PHP4.3, Postgres7.4.3

Ok for PostgreSQL, but since you are starting a new project better to use
PHP 5.

I must be able to receive forminformation and store that in a database and
later produce it on screen on the client (just plain HTML).
Nothing special. I do this for many years, but I never paid a lot of
attention to special characters.

A few day ago I discovered that the euro-sign is not defined in all
fontfamilies.
They cannot produce the right sign no matter if I use € or the
hexadecimal equivalent.
After a little research I found I could put font-tags around the euro-sign
with another font-family (Arial in this case) to get the Euro sign.

I am completely graphical impaired, and only understand programmingcode (and
HTML/JavaScript of course) , so this is a weak point on my side, hence this
question.

I target on Europe only at the moment (no need for Chineese
charactersupport)
That said, will the following setup make sense?

Postgresql db encoding scheme: LATIN1
In the headers of all my HTML: content-type: text/html charset: iso-8859-1

Latin1 (aka ISO-8859-1) does not include the Euro sign.
ISO-8859-15 was updated just to include the Euro sign.
Since more and more countries are joining the european community, Latin1
cannot cover all the writing systems (polish and turkish peoples will
encounter some problems sending their name and address, for example).

UTF-8 (the most used encoding of the UNICODE charset) would be the best
solution, since it includes ALL the charsets currently used in the world,
Euro sign included.

A few related questions:
1) Will people be able to copy/paste info from other sources (like
wordprocessing programs and other websites) into my forms?

Browsers all internally work in UNICODE: pages are converted from the
encoding of the page (ISO-8859-1 in your case) to UNICODE once received,
then data provided by the user are converted back to the original encoding
of the page (ISO-8859-1) before being sent back to the server; characters
that do not fit that encoding are coded as &HHH; were HHH is the UNICODE
value of the character. Definitely, UTF-8 (the recommended interchange
encoding for the UNICODE charset) is the best choice.

2) Can I use regular expressions as I am used to (ASCII) in my PHP code?
Will I match e acute, eurosign, etc?

preg_*() functions support the /u modifier for UTF-8 strings, required
only if the pattern contains non-ASCII chars.

3) Will the roundtrip describe here under have problems with normal expected
european characters?

client copies some text from some source ->

Good programs and OS should copy the text as UNICODE chars.

paste in the form ->

Since browsers internally already use UNICODE, no conversion take place here.

receive by PHP ->

The browser convert the text into the encoding of the page containing the
FORM. PHP handle every string as a sequence of bytes, whatever its charset
or encoding may be.

Every string must be validated:

$s = (string) $_POST['address'];
# Remove ASCII control chars 0-32,127:
$s = preg_replace("/[\\000-\\037\\177]/", "", $s);
# Ensure the UTF-8 encoding; bad sequences are dropped:
$s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
# Ensure the max length (50 chars):
if( mb_strlen($s, 'UTF-8') 50 )
$s = mb_strcut($s, 0, 50, 'UTF-8');

insert in Postgresql (or update) ->

PostgreSQL requires to declare the charset used when the DB is created.
For example

CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UNICODE';

will create a new DB where all the text fields are UTF-8, so VARCHAR(50)
might actually store up to 50*6 bytes. The non-standard PostgreSQL type
TEXT is often more convenient, since the manual states that VARCHAR and
TEXT are treated internally exactly in the same way; control for the max
length of every field can be left to WEB interface implemented via PHP as
in the example above.

$db = pg_connect("dbname=mydb");

# The strings we are sending to the DB server are encoded
# as UTF-8; since the DB we created already uses UTF-8, no
# conversion take place between PHP and DB:
pg_set_client_encoding($db, "UTF-8");

pg_query($db, "INSERT INTO sometable (aString) VALUES "
. "'" . pg_escape_string($s) . "')")

retrieve from postgresql ->

$db = pg_connect("dbname=mydb");
pg_set_client_encoding($db, "UTF-8");
$table = pg_query($db, "SELECT * FORM sometable");

An UTF-8 string is returned.

display as HTML (with content-type: text/html charset: iso-8859-1)

(there is a missing ";" before "charset")

If the encoding of the DB match that of the page, no conversion is required.

If the string appears as HTML text, apply htmlspecialchars():

echo "Your address is: " . htmlspecialchars($s);

If the string must be inserted inside an attribute, enclose between double
quotes and apply htmlspecialchars():

echo "<input type=text value=\"" . htmlspecialchars($s) . "\" name=xxx>";

If the string must be inserted inside a <textareaapply htmlspecialchars()
and nl2br():

echo
"Your new address: <textarea name=newaddress>\n", # required \n
nl2br( htmlspecialchars($s) ),
"</textarea>";

Is that OK?
Any pitfalls?

Don't try to dereference single chars from an UTF-8 string.
Don't use str*(), always use their mb_str*() counterpart.

Every static HTML page must be UTF-8 encoded and must contain
<meta http-equiv="Content-Type" contents="text/html; charset=UTF-8">

Every PHP page must be UTF-8 encoded and must contain
header("Content-Type: text/html; charset=UTF-8");

Should I maybe use UTF-8?

Definitively.

Regards,
___
/_|_\ Umberto Salsi
\/_\/ www.icosaedro.it

Mar 30 '07 #8

Toby A Inkster

Umberto Salsi wrote:

Erwin Moller wrote:

>I am starting a new project now and must make some decisions regarding
encoding. Environment: PHP4.3, Postgres7.4.3

Ok for PostgreSQL, but since you are starting a new project better to use
PHP 5.

Well, if we're talking database upgrades, PostgreSQL 8.x has been out for
over two years. It includes full text indexing via the TSearch2 module.
(TSearch2 was previously available as an add-on for 7.4.x, but the newer
versions work better.)

And from 8.1 onwards, OIDs are disabled by default Erwin! ;-)

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!

Mar 30 '07 #9

Encoding/characterset/font family confusion

Similar topics