By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,086 Members | 1,919 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,086 IT Pros & Developers. It's quick & easy.

html tidy, word 2003 and "smart quotes"

P: n/a
Ron
Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

The situation here is that the people creating the documents only know
Word, and aren't very computer savvy. I created a system where they
can save their Word documents as "html" and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

As you know, Word defaults to replacing straight quotes with fancy
quotes using an encoding that doesn't work on web pages. When you
"save as html", the resulting code doesn't display correctly. You can
turn off "smart quotes" (which I have suggested) but that only counts
towards *new* documents -- existing documents still have the problem.

Now when I use TidyUI on Windows XP, I can SEE the fancy quotes turn
into straight quotes. But when I use tidy on the command line or
tidylib through the php extension, the substitution does *not* take
place. (Freshly downloaded version of tidy in every case.)

On the Linux box I have "bare", "clean" and "word-2000" turned on.
(The code looks different if I turn any of them off, so I'm sure
they're getting turned on.) What it seems to come down to is that
tidy, with the same options, cleans up *different* things on Linux than
it does on Windows.

What are my options at this point? The users will continue to use Word
2003 -- no help there. My web server is Apache on Linux -- that's not
going to change. How do I get from here to there, dynamically, with no
user intervention?

Thanks very much for any and all suggestions. If I can solve this,
I've made it that much less likely that we'll switch to IIS.

Ron (ro**@europa.com)

Jul 24 '05 #1
Share this Question
Share on Google+
11 Replies


P: n/a
Ron wrote:
Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.
Not a good idea to use word for HTML at all, but at least your trying to
clean it up.
I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".


There's nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(‘), 146 (’), 147 (“) and 148 (”). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that's not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jul 24 '05 #2

P: n/a
Lachlan Hunt wrote:
Ron wrote:
Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.


Not a good idea to use word for HTML at all, but at least your trying to
clean it up.

That was not his idea...
The same problem occurs, when you are using a WYSIWYG editor component (like
HTMLArea) on a webpage and people copy&paste stuff from Word - I hate these
things (beside the fact that the web and the WYSIWYG concept are completely
incompatible, they are only causing problems), but I was not able to
prevent the decision to embed WYSIWYG editors :(

I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".


There's nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(?), 146 (?), 147 (?) and 148 (?). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that's not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

I had this problem myself often enough and I usually used a list of
str_replace expressions to turn these characters into the corrent &#...;
counterparts. After reading Lachlan's comment an untested idea popped up in
my head: you could try using the iconv module of PHP to convert the
Windows-1252 into UTF-8 on the fly.
I have neither Word nor Windows available, so I can't test it now...

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
Jul 24 '05 #3

P: n/a
On Thu, 14 Apr 2005, Lachlan Hunt wrote:
There's nothing inherently wrong with the curly quotes, the problem
with them is only that people fail to understand the character
encoding issues properly. Word documents are saved in the
Windows-1252 encoding by default. The quotes you are referring to
are in the positions 145 (), 146 (), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}
However, these code points (and all others in the range from 128 to
159 are control codes in ISO-8859-1 and others. Thus, the main
problem is only caused by declaring the incorrect character
encoding.
agreed
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).
in fact, support is pretty widespread, but I'd still counsel against
using it.
The best options are to either save the files as UTF-8 and declare
that encoding in the HTTP headers or, continue to use ISO-8859-1 and
replace the quotes (and other special windows-1252 chars) with
numeric character references.
I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc. : funnily enough,
historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode's point of view.

The correct Unicode
code points for these characters are all greater than 255, as you
obviously already know (there's a somewhat official table of them
with hex equivalents at
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT )
I think word does have an option to save files as UTF-8, which I
recommend.
I guess it depends on what version you're using. The subject line
mentioned 2003, but plenty of folks aren't there yet.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html


good cite.

all the best
Jul 24 '05 #4

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).


in fact, support is pretty widespread, but I'd still counsel against
using it.


As a matter of principle or based on practical concerns?

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 24 '05 #5

P: n/a
On Thu, 14 Apr 2005, Henri Sivonen wrote:
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).


in fact, support is pretty widespread, but I'd still counsel against
using it.


As a matter of principle or based on practical concerns?


A bit of both, really. You can catch me out with some lukewarm
support for it at
http://ppewww.ph.gla.ac.uk/~flavell/.../checklist#s3a but support
for utf-8 is getting steadily better, whereas I doubt that support for
win-1252, widespread though it is, is going to improve much more (any
new browser versions which don't support it might well be leaving it
out on principle).
Jul 24 '05 #6

P: n/a
On 13 Apr 2005, Ron wrote:
The situation here is that the people creating the documents only know
Word, and aren't very computer savvy. I created a system where they
can save their Word documents as "html"
In which encoding? Code page 1252?
and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".


Just add
\221 -> ‘
\222 -> ’
\223 -> “
\224 -> ”

Jul 24 '05 #7

P: n/a
Alan J. Flavell wrote:
On Thu, 14 Apr 2005, Lachlan Hunt wrote:

There's nothing inherently wrong with the curly quotes, the problem
with them is only that people fail to understand the character
encoding issues properly. Word documents are saved in the
Windows-1252 encoding by default. The quotes you are referring to
are in the positions 145 (), 146 (), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}


I assume you meant my inclusion of the characters within the post.
Well, my news client did encode them as UTF-8 and correctly declare the
encoding in the headers. Although, yours decided to reply in US-ASCII
anyway, hence their removal from the quote.
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).


in fact, support is pretty widespread, but I'd still counsel against
using it.


I suspected as much from Mac and Linux, but what about Phones/PDAs
running an OS like Symbian? Although, I'm sure it would be much simpler
to fully support Windows-1252 on such devices than it would to support
the entire unicode repertoire.
...numeric character references.


I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc.


Good point. I decided not to mention them as they were discussed in
more detail in the cited document, but perhaps I should have anyway.
funnily enough, historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode's point of view.


Indeed, I don't even understand why MS decided to support such invalid
constructs. If they hadn't, invalid uses like that wouldn't even be
widely used on the web today. I guess it's just like all the other crap
they're responsible for introducing.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jul 24 '05 #8

P: n/a

Thanks everyone for the excellent suggestions!
Ron
-
http://www.christianfamilywebsite.com
http://www.iswizards.com
Definition: Nelp: Contraction of "no help". Colloquial: Help
messages that are of no help whatsoever. Pertains to help files,
messages or documentation that convey no useful information, or
pedantically repeat the blindingly obvious.
Jul 24 '05 #9

P: n/a
On Thu, 14 Apr 2005, Lachlan Hunt wrote:
Alan J. Flavell wrote:
default. The quotes you are referring to are in the positions 145 (), 146
(), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}


I assume you meant my inclusion of the characters within the post.


Well, I really mean the exclusion of the characters by PINE.
Well, my news client did encode them as UTF-8 and correctly declare
the encoding in the headers.
Indeed. But since I had my side configured to use iso-8859-1, PINE
would've noticed they couldn't be represented in that, and so filtered
them out.

(In this respect I have to say that Lynx's behaviour is superior to
PINE's) (Of course they do different tasks, too).
Although, yours decided to reply in US-ASCII anyway,


That would be because, after removal of the above characters, there
was nothing left that needed iso-8859-1 coding, and so PINE
deliberately downgrades the advertisement to us-ascii, in the
interests of cross compatibility.

_If_ you had included any iso-8859-1 characters, then PINE would have
posted in iso-8859-1. But it would still have filtered out the
extraneous codes which couldn't be represented in 8859-1. At least,
that's how I understand the behaviour of this version of PINE, and
others like it.
Jul 24 '05 #10

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
Alan J. Flavell <fl*****@ph.gla.ac.uk> wrote:
}On Thu, 14 Apr 2005, Lachlan Hunt wrote:
}
}> Alan J. Flavell wrote:
}> > > default. The quotes you are referring to are in the positions 145 (), 146
}> > > (), 147 () and 148 ().
}> >
}> > Thereby neatly presenting yet another demonstration of the problem ;-}
}>
}> I assume you meant my inclusion of the characters within the post.
}
}Well, I really mean the exclusion of the characters by PINE.
}
}> Well, my news client did encode them as UTF-8 and correctly declare
}> the encoding in the headers.
}
}Indeed. But since I had my side configured to use iso-8859-1, PINE
}would've noticed they couldn't be represented in that, and so filtered
}them out.
}
}(In this respect I have to say that Lynx's behaviour is superior to
}PINE's) (Of course they do different tasks, too).
}
}> Although, yours decided to reply in US-ASCII anyway,
}
}That would be because, after removal of the above characters, there
}was nothing left that needed iso-8859-1 coding, and so PINE
}deliberately downgrades the advertisement to us-ascii, in the
}interests of cross compatibility.
}
}_If_ you had included any iso-8859-1 characters, then PINE would have
}posted in iso-8859-1. But it would still have filtered out the
}extraneous codes which couldn't be represented in 8859-1. At least,
}that's how I understand the behaviour of this version of PINE, and
}others like it.

There are an awful lot of news reading programs out there that do not
understand UTF-8. Many of them don't even understand the Content-Type
header at all, let alone HTML formated messages. None of that was in
the original news spec. I don't know that the current spec includes
it even now.
--
= Eric Bustad, Norwegian bachelor programmer
Jul 24 '05 #11

P: n/a
In article <57******************@news-server.bigpond.net.au>,
Lachlan Hunt <sp***********@gmail.com> wrote:
I suspected as much from Mac and Linux, but what about Phones/PDAs
running an OS like Symbian? Although, I'm sure it would be much simpler
to fully support Windows-1252 on such devices than it would to support
the entire unicode repertoire.


At least Ericsson phones support the UTF-8 encoding (well, the BMP part
of it) even when they do not support all of Unicode. Interestingly
enough, eg. T610 does not support the Windows-1252 *repertoire* although
it supports the UTF-8 *encoding*.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 24 '05 #12

This discussion thread is closed

Replies have been disabled for this discussion.