By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,722 Members | 1,874 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,722 IT Pros & Developers. It's quick & easy.

Having trouble converting few characters using htmlentities function

P: n/a
hi

We are using the normal html controls (textarea) in the posting form.
The form page has the utf-8 character set.

Users are copying the text from MS Word or Openoffice doc etc.

Our PHP code is handling the conversion of RTF text characters and utf
characters into HTML entities (e.g. & is being converted to & by
the inbuilt php function 'htmlentities')

However many common characters/symbols are not being converted
properly. When I say common, even the ones like '-' (hyphen) are not
being converted by htmlentities. It gets converted to junk characters
like ?? or &((&^.

Is there any fix for this problem?

regards,

Mahesh
Aug 14 '08 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Aug 14, 11:24*am, BG Mahesh <mah...@mahesh.comwrote:
The form page has the utf-8 character set.
However many common characters/symbols are not being converted
properly. When I say common, even the ones like '-' (hyphen) are not
being converted by htmlentities. It gets converted to junk characters
like ?? or &((&^.
Hyphen does not need to be converted by htmlentities. It can be in
HTML just like any letter or number.

However, the source of your problem is probably that you are reading
UTF8 data but not outputting it as UTF8. The hyphen may be some
special hyphen which occupies two bytes. If you print this in ASCII or
Latin-1 or anything other than UTF8, something else than a hyphen will
show.
Aug 14 '08 #2

P: n/a
BG Mahesh wrote:
hi

We are using the normal html controls (textarea) in the posting form.
The form page has the utf-8 character set.

Users are copying the text from MS Word or Openoffice doc etc.

Our PHP code is handling the conversion of RTF text characters and utf
characters into HTML entities (e.g. & is being converted to &amp; by
the inbuilt php function 'htmlentities')

However many common characters/symbols are not being converted
properly. When I say common, even the ones like '-' (hyphen) are not
being converted by htmlentities. It gets converted to junk characters
like ?? or &((&^.

Is there any fix for this problem?

regards,

Mahesh
MS Word has a habit of converting a pair of hyphens to a dash (see
AutoCorrect options, tag AutoFormat) and chaning 3 full stops to an elipsis
(see AutoCorrect options, tag AutoCorrect).
It is these that are causing your problems due to the reason that Sjoerd
explains.
Aug 14 '08 #3

P: n/a
I V
On Thu, 14 Aug 2008 02:24:31 -0700, BG Mahesh wrote:
We are using the normal html controls (textarea) in the posting form.
The form page has the utf-8 character set.

Users are copying the text from MS Word or Openoffice doc etc.

Our PHP code is handling the conversion of RTF text characters and utf
characters into HTML entities (e.g. & is being converted to &amp; by the
inbuilt php function 'htmlentities')

However many common characters/symbols are not being converted properly.
When I say common, even the ones like '-' (hyphen) are not being
converted by htmlentities. It gets converted to junk characters like ??
or &((&^.
htmlentities assumes a ISO-8859-1 character set by default; so, it will
mis-interpret the UTF-8 characters supplied by your users. You could
specify the character set explicitly with

htmlentities($some_utf8_string, ENT_COMPAT, 'UTF-8')

or you could use htmlspecialchars, which only converts ampersands and
quote marks, and should pass your UTF-8 characters through unchanged.
Aug 14 '08 #4

P: n/a
On Aug 14, 9:30*pm, I V <ivle...@gmail.comwrote:
On Thu, 14 Aug 2008 02:24:31 -0700, BG Mahesh wrote:
We are using the normal html controls (textarea) in the posting form.
The form page has the utf-8 character set.
Users are copying the text from MS Word or Openoffice doc etc.
Our PHP code is handling the conversion of RTF text characters and utf
characters into HTML entities (e.g. & is being converted to &amp; by the
inbuilt php function 'htmlentities')
However many common characters/symbols are not being converted properly..
When I say common, even the ones like '-' (hyphen) are not being
converted by htmlentities. It gets converted to junk characters like ??
or &((&^.

htmlentities assumes a ISO-8859-1 character set by default; so, it will
mis-interpret the UTF-8 characters supplied by your users. You could
specify the character set explicitly with

htmlentities($some_utf8_string, ENT_COMPAT, 'UTF-8')

or you could use htmlspecialchars, which only converts ampersands and
quote marks, and should pass your UTF-8 characters through unchanged.

Thank you everybody. It works now.

Aug 18 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.