By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,288 Members | 3,027 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,288 IT Pros & Developers. It's quick & easy.

Word 2007 XML document and the w namespace

P: n/a
Hi, I have an XML document that uses namespaces (it is from a Word 2007
file). I want to retrieve all the "t" elements that belong to the
"w" namespace (<w:t>) using XPath from VB.NET 2003 (.NET framework
1.1).

I've successfully loaded the document into a XmlDocument DOM parser
(I can dump the contents using OuterXML).

And, I've created a XmlNamespaceManager and assigned it the "w"
namespace.

But, when I SelectNodes on the document using the XPath expression w:*,
(this is supposed to return ALL elements in the "w" namespace),
only the w:document element is returned. And, if I change the
expression to w:*\*, only the w:body element is returned. If I try to
use the expression w:t, then nothing is returned.

Why isn't XPath returning all <w:tnodes?

My code is:

resultXML.LoadXml(xmlDoc)
Dim manager As XmlNamespaceManager = New
XmlNamespaceManager(resultXML.NameTable)
manager.AddNamespace("w",
"http://schemas.openxmlformats.org/wordprocessingml/2006/main")

XPath = "w:t"
dim nodes as XmlNodeList = resultXML.SelectNodes(XPath, manager)
If nodes.Count 0 Then
Dim node As XmlNode
For Each node In nodes
result = result + node.Name + " "
Next
Else
result = "Can't find w:t"
End If

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:o12="http://schemas.microsoft.com/office/2004/7/core"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"

xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"

xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"

xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"

xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body>
<w:tbl>
<w:tr w:rsidR="00000000">
<w:tc>
<w:p>
<w:r w:rsidR="111111111">
<w:t xml:space="preserve">THIS IS SOME TEXT</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
</w:body>
</w:document>

Dec 5 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Found it!

XPath should be: XPath = "//w:document/descendant::w:t"
Andy

Dec 5 '06 #2

P: n/a

Hi, I thought I'ld share how my program works.

The word2007 docx format follows the EMCA Office Open XML standard
which defines the XML that describes word processing, spreadsheet, and
other commonly used text documents (see
http://www.ecma-international.org/ne...te%20Paper.pdf)

The docx file itself is a standard zip file that contains several of
these xml documents in various folders, much like the older structured
storage MS word files that use sections in their tables.

To read the contents of the file, you have to first decompress it, and
then load the document.xml file located in the Word folder into an XML
parser. You then have to use XPath to retrieve the document's text
nodes and concatenate them into a single string; this will give you all
the text (less its presentation) contained in the Word document.

To decompress the file, there are several .NET libraries available in
the .NET 2.0 and 3.0 frameworks. But, if you have to do this using
..NET 1.1 and VS2003, you will have to resort to using some J#.NET
libraries. I've had to use VS2003 and .NET 1.1.

To use VS2003 and .NET 1.1, first ensure you have the vjslib.dll on
your computer as J# doesn't ship with .NET 1.1. If its not there, you
will have to install the .NET 1.1 Visual J# Redistributable Package
from Microsoft.

Next, add a reference to vjslib.dll in your project (which can be in
VB.NET, C# or managed C). And, add the following using/import
namespaces to the top of your listing:

java.util
java.util.zip
System.Collections
System.Xml
System.Xml.XPath

This will give you access to two J# zip classes (ZipFile and ZipEntry),
an enumerator, and an XPath manager that can resolve XML namespace
references inside the word document as well as the Xml DOM model.

To locate and extract the document.xml file, I used the following
VB.NET 2003 code:

Dim zFile As ZipFile = New ZipFile(filename) 'open zip file
Dim zEntry As ZipEntry 'holds a zip file entry
Dim zEnum As Enumeration = zFile.entries() 'walks the zip file
While zEnum.hasMoreElements 'walk the directory
zEntry = zEnum.nextElement() 'goto the next directory entry
If zEnum.hasMoreElements() = True Then 'was there an entry?
If LCase(zEntry.getName() = "word/document.xml" Then 'was it
document.xml?
Dim entrySize As System.Int32 = zEntry.getSize() 'get the
uncompressed size
Dim xmlDoc As System.String="" 'holds your extracted xml
data
Dim sbuffer() As System.SByte =
Array.CreateInstance(GetType(System.SByte), entrySize) 'java uses SByte
for text
Dim buffer() As System.Byte =
Array.CreateInstance(GetType(System.Byte), entrySize) 'everyone else
uses Byte for text
Dim zStream As java.io.InputStream =
zFile.getInputStream(zEntry) 'open the compressed document
Dim totalread As System.Int32 = 0
Dim readcount As System.Int32 = 0
While totalread < entrySize
readcount = zStream.read(sbuffer, totalread, 1024)
'read only 1k at a time to prevent overruns
totalread = totalread + readcount
End While
System.Buffer.BlockCopy(sbuffer, 0, buffer, 0, totalread)
'convert from SByte to Byte
xmlDoc = System.Text.Encoding.UTF8.GetString(buffer)
'convert Byte to string
end if
end if
end while
zFile.close() 'close the zip file

At this point, xmlDoc contains a word document similar to the one that
appears in the first post of this chain. If you got garbage, its
probably the Text.Encoding used by the Word document wasn't UTF8; check
this class out for other encodings.

The next step is to walk the <w:tnodes and collect all the text data
scattered throughout the document. Depending on how the document was
edited, contiguous text can be broken across several <w:tnodes. Text
from <w:tsiblings on the same parent node has to be concatenated
together, while text from <w:tnodes on different parents has to be
concatenated together but separated by a space. I've also interpreted
<w:tabnodes as a single text space character.

I've used the following VB.NET 2003 code to do this:

Dim resultXML As New XmlDocument
resultXML.LoadXml(xmlDoc) 'load the document.xml into the parser
Dim manager As XmlNamespaceManager = New
XmlNamespaceManager(resultXML.NameTable)
'namespace w is defined at the top in the word document xml
manager.AddNamespace("w",
"http://schemas.openxmlformats.org/wordprocessingml/2006/main")

Dim text As System.String = "" 'holds your extracted text
Dim nodes As XmlNodeList
Dim node As XmlNode
'get all the text, tab, and paragraph nodes in the order that they
appear
XPath =
"//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab"
'concatenate all the text
nodes = resultXML.SelectNodes(XPath, manager)
If nodes.Count 0 Then
For Each node In nodes
If node.Name = "w:p" Or node.Name = "w:tab" Then
text = text + " "
Else
text = text + node.InnerText
End If
Next
End If
At this point, text contains all the textual content of my Word2007
document.

Dec 6 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.