473,411 Members | 2,083 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,411 software developers and data experts.

"smart" quotes in PHP

Hello all,

I've been struggling for a few days with the question of how to convert
"smart" (curly) quotes into straight quotes. I tried playing with the
htmlentities() function, but all that is doing is changing the smart
quotes into nonsense characters. I also searched the web for quite a
while and was unsuccessful in finding a solution.

What puzzles me is that doing it the other way around is simple enough.
For example, this works fine in converting a straight quote into an
"open" smart quote:

if ($content[$k] == "\"")
$content = substr($content, 0, $k) . "“" . substr
($content, $k+1, strlen($content)-$k+1);

But the other way around doesn't work. Any ideas?

Thanks,

Martin Goldman
My e-mail addresse's correct domain name is mgoldman.com.
Jul 17 '05 #1
9 12180
Martin Goldman <ww*@nowhere.foo> wrote:
I've been struggling for a few days with the question of how to convert
"smart" (curly) quotes into straight quotes.
Smart/curly quotes? straight quotes? What are these?
What puzzles me is that doing it the other way around is simple enough.
For example, this works fine in converting a straight quote into an
"open" smart quote:

if ($content[$k] == "\"")
$content = substr($content, 0, $k) . "“" . substr
($content, $k+1, strlen($content)-$k+1);


Funny way to do a str_replace :)

What character is represented by #147? AFAIK it's not in any characters
set I know (ASCII or ISO-8859-x). So your actual problem might be that
you are using an other encoding for the character you want to preplace
that PHP is actually using!

BTW 3rd parameter in htmlentities specifies the character set.

--

Daniel Tryba

Jul 17 '05 #2
On Fri, 14 Nov 2003 17:42:08 GMT, Martin Goldman <ww*@nowhere.foo> wrote:
I've been struggling for a few days with the question of how to convert
"smart" (curly) quotes into straight quotes. I tried playing with the
htmlentities() function, but all that is doing is changing the smart
quotes into nonsense characters. I also searched the web for quite a
while and was unsuccessful in finding a solution.
You've got to work out what character set the text is encoded in, for
starters, since 'smart quotes' exist in Microsoft's Codepage 1522 but not in
the standard ISO 8859 character sets, e.g. iso-8859-15.

In codepage 1522:

hex dec Unicode Unicode name
91 145 8216 LEFT SINGLE QUOTATION MARK
92 146 8217 RIGHT SINGLE QUOTATION MARK
93 147 8220 LEFT DOUBLE QUOTATION MARK
94 148 8221 RIGHT DOUBLE QUOTATION MARK

But in iso-8859-15, 145-148 aren't defined as printable characters; 128-159
are reserved for control characters.

So if you change it to &#147, but output your page encoded in iso-8859-1,
you're just changing it to the code for a non-printable character. The same
entity will appear as a left double quotation mark if encoded in Windows-1522
though.
What puzzles me is that doing it the other way around is simple enough.
For example, this works fine in converting a straight quote into an
"open" smart quote:

if ($content[$k] == "\"")
$content = substr($content, 0, $k) . "“" . substr
($content, $k+1, strlen($content)-$k+1);

But the other way around doesn't work. Any ideas?


In what way doesn't it work? What does str_replace($content, chr(147), '"');
appear to do in your setup?

--
Andy Hassall (an**@andyh.co.uk) icq(5747695) (http://www.andyh.co.uk)
Space: disk usage analysis tool (http://www.andyhsoftware.co.uk/space)
Jul 17 '05 #3
Martin Goldman wrote:
I've been struggling for a few days with the question of how to convert
"smart" (curly) quotes into straight quotes.
As D. Tryba hinted at, str_replace should work fine. After all,
you're replacing one character with another.

$string = str_replace($chr,'"',$string)

where $chr is the character you want to replace.
I tried playing with the htmlentities() function, but all that is doing
is changing the smart quotes into nonsense characters.
I'd be interested in seeing what you actually tried. Since so-called
smart quotes aren't in the Latin-1 repertoire, you'd have to specify
a charset other than the default ISO-8859-1. Say you typed smart
quotes on a bog standard Windows system by holding down Alt and
pressing 0, 1, 4, and 7 (or 8) on the numeric keypad, you'd use

$string = htmlentities($string,ENT_COMPAT,'cp1252')

where $string is the string containing smart quotes. That converts
smart quotes to their respective entity references.
What puzzles me is that doing it the other way around is simple enough.
Eek! I'd have thought that was *more* difficult...
if ($content[$k] == "\"")
$content = substr($content, 0, $k) . "“" . substr
($content, $k+1, strlen($content)-$k+1);


How does your script know that the quotation mark was intended as an
opening quotation mark? ;-)

In HTML, the character reference “ is undefined. The LEFT DOUBLE
QUOTATION MARK can be represented using the character reference
“ or the entity reference &ldquo;. The RIGHT DOUBLE QUOTATION
MARK can be represented using the character reference ” or the
entity reference &rdquo;.

--
Jock
Jul 17 '05 #4
John Dunlop <jo*********@johndunlop.info> wrote in
news:MP************************@news.freeserve.net :
Martin Goldman wrote: I'd be interested in seeing what you actually tried. Since so-called
smart quotes aren't in the Latin-1 repertoire, you'd have to specify
a charset other than the default ISO-8859-1. Say you typed smart
quotes on a bog standard Windows system by holding down Alt and
pressing 0, 1, 4, and 7 (or 8) on the numeric keypad, you'd use

$string = htmlentities($string,ENT_COMPAT,'cp1252')

where $string is the string containing smart quotes. That converts
smart quotes to their respective entity references.
This results in the smart quotes being replaced with nonsense characters.
The thing is, though, that I'm totally unfamiliar with character sets,
the differences between them, etc. I've never had any reason to care
about them. So I'm a little confused about what you guys are talking
about when it comes to them.
How does your script know that the quotation mark was intended as an
opening quotation mark? ;-)

Well, I didn't paste the whole thing. :) I wrote a loop that goes through
the string. It toggles a flag each time a quotation mark is found. If the
flag is set, it makes it an open quote; if it's not, it makes it a closed
quote. Hence the reason I'm not just using a str_replace for that. :)

Oh, and to answer Mr. Hassall's question -- str_replace(chr(147), "\"",
$content) doesn't do anything. The exact same string is returned.

-Martin
Jul 17 '05 #5
Martin Goldman <ww*@nowhere.foo> wrote:
[consufed about charsets]
Oh, and to answer Mr. Hassall's question -- str_replace(chr(147), "\"",
$content) doesn't do anything. The exact same string is returned.


That might mean that there is nog chr(147) in the string although you
_see_ a character that might be represented as the character you know as
147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol and
totally lacks the eurosymbol. Thats why if you want to display the uero
symbol one is encouraged to use the htmlentitie &euro;, which can be
rendered in any font and any character set (with a fallback to EUR).

So you job is to figure out how you quote is encoded (just step through
the string and print the chr value for each character)...

BTW unicode kind of solves the problem by defining every known character
in one set, the problem is that not every program supports it yet. But
unicode also introduces an other problem, the way the characters are
encoded (eg utf7, utf8, utf16...), I don't know if PHP supports utf16+.

--

Daniel Tryba

Jul 17 '05 #6
Daniel Tryba <ne****************@canopus.nl> wrote in news:bp5nhq$d0e$1
@news.tue.nl:
That might mean that there is nog chr(147) in the string although you
_see_ a character that might be represented as the character you know as 147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol and totally lacks the eurosymbol. Thats why if you want to display the uero
symbol one is encouraged to use the htmlentitie &euro;, which can be
rendered in any font and any character set (with a fallback to EUR).

So you job is to figure out how you quote is encoded (just step through
the string and print the chr value for each character)...

Interesting you should suggest this, because I just did that. And indeed,
it's not coming out as 147. It's coming out as 226, followed by 128,
followed by 156. I suppose I could do a str_replace for these 3
characters and replace it with 147. Although, then I'd have to do that
for every character I want to support. What a drag.

Thanks,
Martin
Jul 17 '05 #7
On Sat, 15 Nov 2003 19:57:14 GMT, Martin Goldman <ww*@nowhere.foo> wrote:
Daniel Tryba <ne****************@canopus.nl> wrote in news:bp5nhq$d0e$1
@news.tue.nl:
That might mean that there is nog chr(147) in the string although you
_see_ a character that might be represented as the character you know
as 147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol
and totally lacks the eurosymbol. Thats why if you want to display the uero
symbol one is encouraged to use the htmlentitie &euro;, which can be
rendered in any font and any character set (with a fallback to EUR).

So you job is to figure out how you quote is encoded (just step through
the string and print the chr value for each character)...


Interesting you should suggest this, because I just did that. And indeed,
it's not coming out as 147. It's coming out as 226, followed by 128,
followed by 156. I suppose I could do a str_replace for these 3
characters and replace it with 147. Although, then I'd have to do that
for every character I want to support. What a drag.


Your text is encoded in UTF-8. Going back to the characters again:

hex dec Unicode Unicode name
91 145 8216 LEFT SINGLE QUOTATION MARK
92 146 8217 RIGHT SINGLE QUOTATION MARK
93 147 8220 LEFT DOUBLE QUOTATION MARK
94 148 8221 RIGHT DOUBLE QUOTATION MARK

226,128,147 in binary is:

11100010
10000000
10011100

'1110' in the first few bits of the first byte indicates it is a lead byte for
a three-byte character. The remaining two are trail bytes, as they start with
10. So separating out the data gets:

1110 0010
10 000000
10 011100

=> 0010000000011100 (binary)
= 8220 (decicmal)

Which is LEFT DOUBLE QUOTATION MARK.

--
Andy Hassall (an**@andyh.co.uk) icq(5747695) (http://www.andyh.co.uk)
Space: disk usage analysis tool (http://www.andyhsoftware.co.uk/space)
Jul 17 '05 #8
Andy Hassall <an**@andyh.co.uk> wrote:
So you job is to figure out how you quote is encoded (just step through
the string and print the chr value for each character)...


Interesting you should suggest this, because I just did that. And indeed,
it's not coming out as 147. It's coming out as 226, followed by 128,
followed by 156. I suppose I could do a str_replace for these 3
characters and replace it with 147. Although, then I'd have to do that
for every character I want to support. What a drag.


Your text is encoded in UTF-8. Going back to the characters again:

[in depth UTF-8 decoding :)]

So Martin, you should take a look at iconv or if your server lacks
support utf8_decode(). The latter has also a usercontrib on how to use
str_replace on UTF-8 encoded string.

--

Daniel Tryba

Jul 17 '05 #9
Daniel Tryba <ne****************@canopus.nl> wrote in
news:bp**********@news.tue.nl:
Andy Hassall <an**@andyh.co.uk> wrote: So Martin, you should take a look at iconv or if your server lacks
support utf8_decode(). The latter has also a usercontrib on how to use
str_replace on UTF-8 encoded string.


Great. Thanks to everyone to replied.

-Martin
my correct domain name is mgoldman.com
Jul 17 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Tim Hochberg | last post by:
During the recent, massive, painful Lisp-Python crossposting thread the evils of Python's whitespace based indentation were once again brought to light. Since Python' syntax is so incredibly...
14
by: David B. Held | last post by:
I wanted to post this proposal on c.l.c++.m, but my news server apparently does not support that group any more. I propose a new class of exception safety known as the "smart guarantee". ...
11
by: Ron | last post by:
Hello, I'm having an aggravating time getting the "html" spewed by Word 2003 to display correctly in a webpage. The situation here is that the people creating the documents only know Word, and...
2
by: BobAchgill | last post by:
Is there a way to let the User click on a button on a web site and have that download and install my prepackaged compressed data directory and place it nicely under my existing VB .Net Form...
3
by: red floyd | last post by:
I've got some code where somebody cut&pasted some comments from MS Word, and so these comments have "smart quotes" (in particular apostrophes) embedded. The apostrophe is character hex 0x92. ...
5
by: Noozer | last post by:
I'm looking for a "smart folder" program to run on my Windows XP machine. I'm not having any luck finding it and think the logic behind the program is pretty simple, but I'm not sure how I'd...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.