html tidy, word 2003 and "smart quotes"

Ron

Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

The situation here is that the people creating the documents only know
Word, and aren't very computer savvy. I created a system where they
can save their Word documents as "html" and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

As you know, Word defaults to replacing straight quotes with fancy
quotes using an encoding that doesn't work on web pages. When you
"save as html", the resulting code doesn't display correctly. You can
turn off "smart quotes" (which I have suggested) but that only counts
towards *new* documents -- existing documents still have the problem.

Now when I use TidyUI on Windows XP, I can SEE the fancy quotes turn
into straight quotes. But when I use tidy on the command line or
tidylib through the php extension, the substitution does *not* take
place. (Freshly downloaded version of tidy in every case.)

On the Linux box I have "bare", "clean" and "word-2000" turned on.
(The code looks different if I turn any of them off, so I'm sure
they're getting turned on.) What it seems to come down to is that
tidy, with the same options, cleans up *different* things on Linux than
it does on Windows.

What are my options at this point? The users will continue to use Word
2003 -- no help there. My web server is Apache on Linux -- that's not
going to change. How do I get from here to there, dynamically, with no
user intervention?

Thanks very much for any and all suggestions. If I can solve this,
I've made it that much less likely that we'll switch to IIS.

Ron (ro**@europa.com)

Jul 24 '05 #1

Subscribe Reply

7002

Lachlan Hunt

Ron wrote:

Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.
Not a good idea to use word for HTML at all, but at least your trying to
clean it up.
I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

There's nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(â€˜), 146 (â€™), 147 (â€œ) and 148 (â€). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that's not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 24 '05 #2

Benjamin Niemann

Lachlan Hunt wrote:

Ron wrote:
Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

Not a good idea to use word for HTML at all, but at least your trying to
clean it up.

That was not his idea...
The same problem occurs, when you are using a WYSIWYG editor component (like
HTMLArea) on a webpage and people copy&paste stuff from Word - I hate these
things (beside the fact that the web and the WYSIWYG concept are completely
incompatible, they are only causing problems), but I was not able to
prevent the decision to embed WYSIWYG editors :(

I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

There's nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(?), 146 (?), 147 (?) and 148 (?). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that's not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

I had this problem myself often enough and I usually used a list of
str_replace expressions to turn these characters into the corrent &#...;
counterparts. After reading Lachlan's comment an untested idea popped up in
my head: you could try using the iconv module of PHP to convert the
Windows-1252 into UTF-8 on the fly.
I have neither Word nor Windows available, so I can't test it now...

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Jul 24 '05 #3

Alan J. Flavell

On Thu, 14 Apr 2005, Lachlan Hunt wrote:

There's nothing inherently wrong with the curly quotes, the problem
with them is only that people fail to understand the character
encoding issues properly. Word documents are saved in the
Windows-1252 encoding by default. The quotes you are referring to
are in the positions 145 (), 146 (), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}
However, these code points (and all others in the range from 128 to
159 are control codes in ISO-8859-1 and others. Thus, the main
problem is only caused by declaring the incorrect character
encoding.
agreed
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).
in fact, support is pretty widespread, but I'd still counsel against
using it.
The best options are to either save the files as UTF-8 and declare
that encoding in the HTTP headers or, continue to use ISO-8859-1 and
replace the quotes (and other special windows-1252 chars) with
numeric character references.
I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc. : funnily enough,
historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode's point of view.

The correct Unicode
code points for these characters are all greater than 255, as you
obviously already know (there's a somewhat official table of them
with hex equivalents at
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT )
I think word does have an option to save files as UTF-8, which I
recommend.
I guess it depends on what version you're using. The subject line
mentioned 2003, but plenty of folks aren't there yet.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

good cite.

all the best

Jul 24 '05 #4

Henri Sivonen

In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).

in fact, support is pretty widespread, but I'd still counsel against
using it.

As a matter of principle or based on practical concerns?

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 24 '05 #5

Alan J. Flavell

On Thu, 14 Apr 2005, Henri Sivonen wrote:

Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).

in fact, support is pretty widespread, but I'd still counsel against
using it.

As a matter of principle or based on practical concerns?

A bit of both, really. You can catch me out with some lukewarm
support for it at
http://ppewww.ph.gla.ac.uk/~flavell/.../checklist#s3a but support
for utf-8 is getting steadily better, whereas I doubt that support for
win-1252, widespread though it is, is going to improve much more (any
new browser versions which don't support it might well be leaving it
out on principle).

Jul 24 '05 #6

Andreas Prilop

On 13 Apr 2005, Ron wrote:

The situation here is that the people creating the documents only know
Word, and aren't very computer savvy. I created a system where they
can save their Word documents as "html"
In which encoding? Code page 1252?
and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

Just add
\221 -> ‘
\222 -> ’
\223 -> “
\224 -> ”

Jul 24 '05 #7

Lachlan Hunt

Alan J. Flavell wrote:

On Thu, 14 Apr 2005, Lachlan Hunt wrote:

There's nothing inherently wrong with the curly quotes, the problem
with them is only that people fail to understand the character
encoding issues properly. Word documents are saved in the
Windows-1252 encoding by default. The quotes you are referring to
are in the positions 145 (), 146 (), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}

I assume you meant my inclusion of the characters within the post.
Well, my news client did encode them as UTF-8 and correctly declare the
encoding in the headers. Although, yours decided to reply in US-ASCII
anyway, hence their removal from the quote.

Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that's not guarenteed).

in fact, support is pretty widespread, but I'd still counsel against
using it.

I suspected as much from Mac and Linux, but what about Phones/PDAs
running an OS like Symbian? Although, I'm sure it would be much simpler
to fully support Windows-1252 on such devices than it would to support
the entire unicode repertoire.

...numeric character references.

I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc.

Good point. I decided not to mention them as they were discussed in
more detail in the cited document, but perhaps I should have anyway.
funnily enough, historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode's point of view.

Indeed, I don't even understand why MS decided to support such invalid
constructs. If they hadn't, invalid uses like that wouldn't even be
widely used on the web today. I guess it's just like all the other crap
they're responsible for introducing.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 24 '05 #8

Ronald O. Christian

Thanks everyone for the excellent suggestions!
Ron
-
http://www.christianfamilywebsite.com
http://www.iswizards.com
Definition: Nelp: Contraction of "no help". Colloquial: Help
messages that are of no help whatsoever. Pertains to help files,
messages or documentation that convey no useful information, or
pedantically repeat the blindingly obvious.

Jul 24 '05 #9

Alan J. Flavell

On Thu, 14 Apr 2005, Lachlan Hunt wrote:

Alan J. Flavell wrote:
default. The quotes you are referring to are in the positions 145 (), 146
(), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}

I assume you meant my inclusion of the characters within the post.

Well, I really mean the exclusion of the characters by PINE.
Well, my news client did encode them as UTF-8 and correctly declare
the encoding in the headers.
Indeed. But since I had my side configured to use iso-8859-1, PINE
would've noticed they couldn't be represented in that, and so filtered
them out.

(In this respect I have to say that Lynx's behaviour is superior to
PINE's) (Of course they do different tasks, too).
Although, yours decided to reply in US-ASCII anyway,

That would be because, after removal of the above characters, there
was nothing left that needed iso-8859-1 coding, and so PINE
deliberately downgrades the advertisement to us-ascii, in the
interests of cross compatibility.

_If_ you had included any iso-8859-1 characters, then PINE would have
posted in iso-8859-1. But it would still have filtered out the
extraneous codes which couldn't be represented in 8859-1. At least,
that's how I understand the behaviour of this version of PINE, and
others like it.

Jul 24 '05 #10

Eric Kenneth Bustad

In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
Alan J. Flavell <fl*****@ph.gla.ac.uk> wrote:
}On Thu, 14 Apr 2005, Lachlan Hunt wrote:
}
}> Alan J. Flavell wrote:
}> > > default. The quotes you are referring to are in the positions 145 (), 146
}> > > (), 147 () and 148 ().
}> >
}> > Thereby neatly presenting yet another demonstration of the problem ;-}
}>
}> I assume you meant my inclusion of the characters within the post.
}
}Well, I really mean the exclusion of the characters by PINE.
}
}> Well, my news client did encode them as UTF-8 and correctly declare
}> the encoding in the headers.
}
}Indeed. But since I had my side configured to use iso-8859-1, PINE
}would've noticed they couldn't be represented in that, and so filtered
}them out.
}
}(In this respect I have to say that Lynx's behaviour is superior to
}PINE's) (Of course they do different tasks, too).
}
}> Although, yours decided to reply in US-ASCII anyway,
}
}That would be because, after removal of the above characters, there
}was nothing left that needed iso-8859-1 coding, and so PINE
}deliberately downgrades the advertisement to us-ascii, in the
}interests of cross compatibility.
}
}_If_ you had included any iso-8859-1 characters, then PINE would have
}posted in iso-8859-1. But it would still have filtered out the
}extraneous codes which couldn't be represented in 8859-1. At least,
}that's how I understand the behaviour of this version of PINE, and
}others like it.

There are an awful lot of news reading programs out there that do not
understand UTF-8. Many of them don't even understand the Content-Type
header at all, let alone HTML formated messages. None of that was in
the original news spec. I don't know that the current spec includes
it even now.
--
= Eric Bustad, Norwegian bachelor programmer

Jul 24 '05 #11

Henri Sivonen

In article <57******************@news-server.bigpond.net.au>,
Lachlan Hunt <sp***********@gmail.com> wrote:

I suspected as much from Mac and Linux, but what about Phones/PDAs
running an OS like Symbian? Although, I'm sure it would be much simpler
to fully support Windows-1252 on such devices than it would to support
the entire unicode repertoire.

At least Ericsson phones support the UTF-8 encoding (well, the BMP part
of it) even when they do not support all of Unicode. Interestingly
enough, eg. T610 does not support the Windows-1252 *repertoire* although
it supports the UTF-8 *encoding*.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 24 '05 #12

Similar topics

12213

"smart" quotes in PHP

by: Martin Goldman | last post by:

Hello all, I've been struggling for a few days with the question of how to convert "smart" (curly) quotes into straight quotes. I tried playing with the htmlentities() function, but all that is...

PHP

4259

"Smart" Block Copy & Paste

by: Tim Hochberg | last post by:

During the recent, massive, painful Lisp-Python crossposting thread the evils of Python's whitespace based indentation were once again brought to light. Since Python' syntax is so incredibly...

Python

2655

The "smart guarantee"?

by: David B. Held | last post by:

I wanted to post this proposal on c.l.c++.m, but my news server apparently does not support that group any more. I propose a new class of exception safety known as the "smart guarantee". ...

C / C++

1809

"smart" indexing

by: Capstar | last post by:

Hi NG, I am developing a piece of software, which contains basicly one loop, which consinsts of 2 parts. Thos two parts are again basicly the same, but they deiffer in some variables. So I...

C / C++

20883

Replace double quotes (") with single quotes (')

by: gar | last post by:

Hi, I need to replace all the double quotes (") in a textbox with single quotes ('). I used this code text= Replace(text, """", "'" This works fine (for normal double quotes).The problem...

Visual Basic .NET

3205

"Smart Quotes" in comments?

by: red floyd | last post by:

I've got some code where somebody cut&pasted some comments from MS Word, and so these comments have "smart quotes" (in particular apostrophes) embedded. The apostrophe is character hex 0x92. ...

C / C++

80505

Quotes (') and Double-Quotes (") - Where and When to use them

by: NeoPa | last post by:

Background Whenever code is used there must be a way to differentiate the actual code (which should be interpreted directly) with literal strings which should be interpreted as data. Numbers don't...

Microsoft Access / VBA

2879

Implementing a "Smart" folder in Windows XP?

by: Noozer | last post by:

I'm looking for a "smart folder" program to run on my Windows XP machine. I'm not having any luck finding it and think the logic behind the program is pretty simple, but I'm not sure how I'd...

Visual Basic .NET

7347

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7073

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7506

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5656

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5062

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

3218

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3207

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1571

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

443

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General