I'm feeling very stupid about this ...
pdf2html (http://pdf2html.sourceforge.net) is an app that reads a PDF
and can generate HTML or XML; in my case I'm using the XML. The PDF I'm
working with is a concatenation of many reports; my objective is to
find the first page of each report, which I've discovered can be found
in this particular instance by looking for an xml element with a
particular attribute "left" equal to 277.
So I want to consume this XML using XPath, to find all "page" elements
that contain "text" elements that have an attribute of 277. The XPath
expression is therefore:
"/pdf2xml/page/text[@left=277]"
Works great ... IF I change the XML output by the tool to remove the
DTD reference. If I leave the DTD reference in there, it stops finding
any nodes. Why? Does the presence of the DTD reference automatically
assign a namespace? Do I need a XmlNamespaceManager? What do I use it
with?
Altering the input XML is not the preferred option here. I also have a
version that just uses the Reader to walk the tree ... I want to get
away from that because I eventually want to be able to specify an XPath
query as input.
My code:
Sub test()
Dim inputfile As String = "test.xml"
Dim r As New XmlTextReader(inputfile)
Dim xd as New Xml.XPathDocument(r)
Dim nav As XPath.XPathNavigator = xd.CreateNavigator()
Dim expr As XPath.XPathExpression =
nav.Compile("/pdf2xml/page/text[@left=277]")
Dim ni As XPath.XPathNodeIterator = nav.Select(expr)
Do While ni.MoveNext()
Dim node As XPath.XPathNavigator = ni.Current
Dim ani As XPath.XPathNodeIterator = _
node.SelectAncestors(XPath.XPathNodeType.Element, False)
ani.MoveNext()
Dim pagenum As Integer = ani.Current.GetAttribute("number",
"")
Debug.WriteLine(pagenum)
Loop
End Sub
My XML is below, showing two pages; the desired result is to get the
first page. It's actual output from pdf2html, slightly stripped and
censored.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="1188"
width="918">
<text top="805" left="277" width="0" height="18" font="0"><i><b>Person
Name</b></i></text>
<text top="805" left="298" width="0" height="18" font="0"><i><b>123
Main St</b></i></text>
<text top="805" left="319" width="0" height="18"
font="0"><i><b>Hometown, IL 60000</b></i></text>
</page>
<page number="2" position="absolute" top="0" left="0" height="1188"
width="918">
<text top="245" left="144" width="136" height="18"
font="0"><i><b>Person Name</b></i></text>
<text top="266" left="144" width="124" height="18" font="0"><i><b>123
Main St</b></i></text>
<text top="287" left="144" width="168" height="18"
font="0"><i><b>Hometown, IL 60000</b></i></text>
<text top="470" left="143" width="319" height="19"
font="1"><b>STATEMENT OF MANAGEMENT FEES</b></text>
</page>
</pdf2xml>