By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,106 Members | 2,679 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,106 IT Pros & Developers. It's quick & easy.

Cleaning MS Word input - last resort!!

P: n/a
Dear all,

I have a problem with a form, and I have tried various permutations of
htmlentities() and html_entity_decode() to resolve, but without success.

Here is the workflow.

1: User pastes MS Word formatted text into form field.
2: Server uses mail() to send input text to mail client.
3: Recipient pastes text into html file.

The problem is that MS Word contains peculiar characters for things like
bullets, which come out as tabs, which then come out as different, but
spurious, html characters in the html translation.

Does anyone know of a function(s) that can clean up MS Word input into
something that can be simply pasted as plain text without spurious
characters?

Turner
Feb 21 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Il se trouve que turnitup a formulé :
Dear all,

I have a problem with a form, and I have tried various permutations of
htmlentities() and html_entity_decode() to resolve, but without success.

Here is the workflow.

1: User pastes MS Word formatted text into form field.
2: Server uses mail() to send input text to mail client.
3: Recipient pastes text into html file.

The problem is that MS Word contains peculiar characters for things like
bullets, which come out as tabs, which then come out as different, but
spurious, html characters in the html translation.

Does anyone know of a function(s) that can clean up MS Word input into
something that can be simply pasted as plain text without spurious
characters?

Turner


From a comment on the PHP documentation for the utf8_decode() function
http://us2.php.net/manual/en/function.utf8-decode.php
peter dot mescalchin at geemail dot com
27-Dec-2005 06:43

Adding to below I have a few more MS word characters that need
replacing. Found this was required when "fixing" some phpmyadmin export
scripts from a live server where MS word characters were all through
the
content - before importing them back into my local mySQL database.

The code I wrote for this process also does a strpos for any extra
"\\xe2\\x80" strings - which are the tell-tale sign of any funny
characters I want removed.

Here are my updated arrays()

<?php
$badchr = array(
"\\xe2\\x80\\xa6", // ellipsis
"\\xe2\\x80\\x93", // long dash
"\\xe2\\x80\\x94", // long dash
"\\xe2\\x80\\x98", // single quote opening
"\\xe2\\x80\\x99", // single quote closing
"\\xe2\\x80\\x9c", // double quote opening
"\\xe2\\x80\\x9d", // double quote closing
"\\xe2\\x80\\xa2" // dot used for bullet points
);

$goodchr = array(
'...',
'-',
'-',
'\\'',
'\\'',
'"',
'"',
'*'
);
?>
--
Julien CROUZET - DSI Theoconcept
julien.crouzet@/enlever ca\theoconcept.com
http://www.theoconcept.com
Feb 21 '06 #2

P: n/a
turnitup wrote:
Dear all,

I have a problem with a form, and I have tried various permutations of
htmlentities() and html_entity_decode() to resolve, but without success.

Here is the workflow.

1: User pastes MS Word formatted text into form field.
2: Server uses mail() to send input text to mail client.
3: Recipient pastes text into html file.

The problem is that MS Word contains peculiar characters for things like
bullets, which come out as tabs, which then come out as different, but
spurious, html characters in the html translation.

Does anyone know of a function(s) that can clean up MS Word input into
something that can be simply pasted as plain text without spurious
characters?


tidy perhaps?

http://us3.php.net/manual/en/ref.tidy.php

http://www.zend.com/php5/articles/php5-tidy.php

http://www.w3.org/People/Raggett/tidy/

--
Justin Koivisto, ZCE - ju****@koivi.com
http://koivi.com
Feb 21 '06 #3

P: n/a
Julien CROUZET wrote:
Il se trouve que turnitup a formulé :
Dear all,

I have a problem with a form, and I have tried various permutations of
htmlentities() and html_entity_decode() to resolve, but without success.

Here is the workflow.

1: User pastes MS Word formatted text into form field.
2: Server uses mail() to send input text to mail client.
3: Recipient pastes text into html file.

The problem is that MS Word contains peculiar characters for things
like bullets, which come out as tabs, which then come out as
different, but spurious, html characters in the html translation.

Does anyone know of a function(s) that can clean up MS Word input into
something that can be simply pasted as plain text without spurious
characters?

Turner


From a comment on the PHP documentation for the utf8_decode() function
http://us2.php.net/manual/en/function.utf8-decode.php
peter dot mescalchin at geemail dot com
27-Dec-2005 06:43

Adding to below I have a few more MS word characters that need
replacing. Found this was required when "fixing" some phpmyadmin export
scripts from a live server where MS word characters were all through the
content - before importing them back into my local mySQL database.

The code I wrote for this process also does a strpos for any extra
"\\xe2\\x80" strings - which are the tell-tale sign of any funny
characters I want removed.

Here are my updated arrays()

<?php
$badchr = array(
"\\xe2\\x80\\xa6", // ellipsis
"\\xe2\\x80\\x93", // long dash
"\\xe2\\x80\\x94", // long dash
"\\xe2\\x80\\x98", // single quote opening
"\\xe2\\x80\\x99", // single quote closing
"\\xe2\\x80\\x9c", // double quote opening
"\\xe2\\x80\\x9d", // double quote closing
"\\xe2\\x80\\xa2" // dot used for bullet points
);

$goodchr = array(
'...',
'-',
'-',
'\\'',
'\\'',
'"',
'"',
'*'
);
?>


Merci!!
Feb 25 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.