By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,701 Members | 2,006 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,701 IT Pros & Developers. It's quick & easy.

converting word to xml

P: n/a
Hello Everybody,
I have to conert the word doc to multiple html files,according to the templates in the word doc.

I had converted the word to xml.Also through Exsl ,had finished the multiple output html files.

The problem is while reading through the worddoc paragraph,the special characters are not identified.

So in the xml file,it's just storing that as "?".So I couldn't able to retrive the characters in my ouput html files.


Nov 12 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
using cdata, XML is identifying the special chars.but while parsing through xsl it's not identifying.

Nov 12 '05 #2

P: n/a
prabha wrote:
using cdata, XML is identifying the special chars.but while parsing through xsl it's not identifying.


What do you mean?

--
Oleg Tkachenko
XmlInsider
http://blog.tkachenko.com
Nov 12 '05 #3

P: n/a

I am reading from the word doc tables,which includes special chars and html tags.when i convert the word doc to a structured xml it was converting that < to &lt; and special characters like "¢" to &cen;.When i pass the xml element text,i used CDATA.CDATA will be storing the same value within that.
In the XSL ,i had used EXSL for converting the xml to multiple html files.
the xsl parser is parsing the < to &lt; .I had tried the disable-output-escaping and the xsl:copy-of.But i couldn't able to convert with that.Then i used HTTPUTILITY's HTMLDECODE method to convert the "&lt;" to "<".
Now my problem is solved.Also For indenting the HTML files i used TIDY Component.

-- this is my XML o/p --
<Screen><Id>4_1_0</Id><Graphics>40</Graphics><Textcontent>60</Textcontent><TitleText><![CDATA[Refinance for a Higher Rate? / Welcome¢£¤]]></TitleText><Text><![CDATA[<UL><LI>fdfdgdgg¢£¤<LI>In this module, you will learn about situations where refinancing for a higher rate is a great benefit for your customers<UL><LI>Rtertert¢£¤<LI>Fseretert<UL><L I>Fsdgj¢£¤ <UL><LI>Ereret¢£¤<LI>Etetewt<UL><LI>dfndgdfg¢ ¤<LI>dfsdfdsg<LI>dsfgdsg<UL><LI>dfhdejthjrt©<LI >rrtrttrret<LI>ygfhgfhh<UL><LI>fgfdgdh<LI>d gdfggdfg<LI>dfdhjfdfjkgf</UL></UL></UL><LI>etetretry©</UL><LI>Dgkf<LI>Ldfldg©</UL></UL><LI>DOES THE CLIENT WANT TO ADD TARGETED OBJECTIVES FOR THIS MODULE? (do not code this) ©]]></Text></Screen>

anyway thanks for u'r response.

----- Oleg Tkachenko wrote: -----

prabha wrote:
using cdata, XML is identifying the special chars.but while parsing through xsl it's not identifying.


What do you mean?

--
Oleg Tkachenko
XmlInsider
http://blog.tkachenko.com

Nov 12 '05 #4

P: n/a
prabha wrote:
I am reading from the word doc tables,which includes special chars
and html tags.when i convert the word doc to a structured xml it was
converting that < to &lt; and special characters like "¢" to
&cen;. While former is ok, latter means your output encoding doesn't allow ¢
character to be placed natively.
When i pass the xml element text,i used CDATA.CDATA will be
storing the same value within that. In the XSL ,i had used EXSL for
converting the xml to multiple html files. the xsl parser is parsing
the < to &lt; .I had tried the disable-output-escaping and the
xsl:copy-of.But i couldn't able to convert with that.

That's known limitation of EXSLT.NET implementation of exsl:document
extension element. disable-output-escaping is ignored as always when you
transforming to XmlWriter.
--
Oleg Tkachenko
XmlInsider
http://blog.tkachenko.com
Nov 12 '05 #5

P: n/a
Can u just tell me that,whether my approach is ok or not
Is there any efficient approach for my requirements

----- Oleg Tkachenko wrote: ----

prabha wrote
I am reading from the word doc tables,which includes special char
and html tags.when i convert the word doc to a structured xml it wa
converting that < to &lt; and special characters like "¢" t
&cen; While former is ok, latter means your output encoding doesn't allow ¢
character to be placed natively
When i pass the xml element text,i used CDATA.CDATA will b
storing the same value within that. In the XSL ,i had used EXSL fo
converting the xml to multiple html files. the xsl parser is parsin
the < to &lt; .I had tried the disable-output-escaping and th
xsl:copy-of.But i couldn't able to convert with that

That's known limitation of EXSLT.NET implementation of exsl:document
extension element. disable-output-escaping is ignored as always when you
transforming to XmlWriter
--
Oleg Tkachenk
XmlInside
http://blog.tkachenko.co

-

I am reading from the word doc tables,which includes special chars and html tags.when i convert the word doc to a structured xml it was converting that < to &lt; and special characters like "¢" to &cen;.When i pass the xml element text,i used CDATA.CDATA will be storing the same value within that.
In the XSL ,i had used EXSL for converting the xml to multiple html files
the xsl parser is parsing the < to &lt; .I had tried the disable-output-escaping and the xsl:copy-of.But i couldn't able to convert with that.Then i used HTTPUTILITY's HTMLDECODE method to convert the "&lt;" to "<"
Now my problem is solved.Also For indenting the HTML files i used TIDY Component

-- this is my XML o/p -
<Screen><Id>4_1_0</Id><Graphics>40</Graphics><Textcontent>60</Textcontent><TitleText><![CDATA[Refinance for a Higher Rate? / Welcome¢£¤]]></TitleText><Text><![CDATA[<UL><LI>fdfdgdgg¢£¤<LI>In this module, you will learn about situations where refinancing for a higher rate is a great benefit for your customers<UL><LI>Rtertert¢£¤<LI>Fseretert<UL><L I>Fsdgj¢£¤ <UL><LI>Ereret¢£¤<LI>Etetewt<UL><LI>dfndgdfg¢ ¤<LI>dfsdfdsg<LI>dsfgdsg<UL><LI>dfhdejthjrt©<LI >rrtrttrret<LI>ygfhgfhh<UL><LI>fgfdgdh<LI>d gdfggdfg<LI>dfdhjfdfjkgf</UL></UL></UL><LI>etetretry©</UL><LI>Dgkf<LI>Ldfldg©</UL></UL><LI>DOES THE CLIENT WANT TO ADD TARGETED OBJECTIVES FOR THIS MODULE? (do not code this) ©]]></Text></Screen

anyway thanks for u'r response
Nov 12 '05 #6

P: n/a
prabha wrote:
Can u just tell me that,whether my approach is ok or not? I think it's ok.
Is there any efficient approach for my requirements?

The best way would be to avoid escaped HTML markup (make it XHTML for
instance). As you see now escaped markup is almost always a trouble.
--
Oleg Tkachenko
XmlInsider
http://blog.tkachenko.com
Nov 12 '05 #7

P: n/a

=?Utf-8?B?cHJhYmhh?= wrote:
*Hello Everybody,
I have to conert the word doc to multiple html files,according to th
templates in the word doc.

I had converted the word to xml.Also through Exsl ,had finished th
multiple output html files.

The problem is while reading through the worddoc paragraph,th
special characters are not identified.

So in the xml file,it's just storing that as "?".So I couldn't abl
to retrive the characters in my ouput html files. *


*************
Hi Could you please let me know how you were able to convert Word (i
it 2003) to xml, I am looking for something that can parse out custome
supplied xml tags from word 2003 doc. Any help is appreciated.

Thanks,
Me

memore
-----------------------------------------------------------------------
Posted via http://www.mcse.m
-----------------------------------------------------------------------
View this thread: http://www.mcse.ms/message298337.htm

Nov 12 '05 #8

P: n/a
memorex wrote:
Hi Could you please let me know how you were able to convert Word (is
it 2003) to xml, I am looking for something that can parse out customer
supplied xml tags from word 2003 doc. Any help is appreciated.


If your customer can save Word documents as XML, then this "something"
is XML parser.

--
Oleg Tkachenko [XML MVP, XmlInsider]
http://blog.tkachenko.com
Nov 12 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.