473,396 Members | 1,872 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Interested in System ID only, not the whole parsing ...

I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

A way I can think of is to write an entity resolver and somehow
arrange for the implementation of resolveEntity()
return an appropriate InputSource and preserve the system ID because
System/Public ID are passed to the method.

If that's the only way to achieve it, my question is:
- will this have performance impact and overhead of doing it this way,
because I have to give a call to the parse() method?

If there are other ways of achieving this (again, noting that I am
only interested in the declaration part), please
let me know.

Thank you!

Mar 3 '07 #1
5 1747
Dhurandhar Bhatvadekar wrote:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.
Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code
that calls the parser will want to catch and recognize this particular
exception as a "normal abnormal exit.")

However, when I proposed that to one manager, he held his nose and
insisted that I let the parser finish spinning instead. And I can't
_entirely_ disagree with him.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 3 '07 #2
Hi Joe,

Thanks for your reply. So, here is some code-review time for you. Can
you please let me know if the following
will work? With my preliminary tests it appears to work. But I want to
be sure.

----------------------------------------------
private String getSystemIdFromDtd() {
//Use Streaming XML parser, returns null in case of parsing
error
BufferedInputStream bis = null;
try {
bis = new BufferedInputStream(new FileInputStream(xml)); //
xml is defined elsewhere
final XMLReader xr =
XMLReaderFactory.createXMLReader();
final InputSource is = new InputSource(bis);
xr.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(final String pid,
final String sid)
throws SAXException, IOException {
if (sid != null) {
mSystemId = sid.trim(); //mSystemId is
defined elsewhere
//resolve the entities locally somehow and
return a meaningful InputSource instance
} //else default resolution
} //else default resolution
return ( null );
}
});
xr.parse(is);
return ( mSystemId );
} catch (final Exception ioe) {
throw new RuntimeException(ioe);
} finally {
try {
if (bis != null)
bis.close();
} catch(Exception ee) {
//squelching ee on purpose
}
}
}
------------------------------------------------------------------

Thanks again!

Mar 3 '07 #3
Sorry, but code review goes beyond what you get for free.
Mar 3 '07 #4
Dhurandhar Bhatvadekar wrote:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.
All XML tools conduct a formal parse, either for well-formedness or for
validity as well. This implies they read to the end of the file. Most
XML tools don't provide for fragmentary reading, so the penalty when you
"just" want something from the top of the file is enormous unless you do
the "crash me when I find it" trick.

If you can guarantee that the entire Document type Declaration will be
contained in the first nn lines of the file, and that the double quote
has been used to delimit the identifiers, then the following Unix
commands will do the job, returning two lines: the first is the SYSTEM
identifier, and the second (if non-empty) is the FPI:

head -nn yourfile.xml|tr '\012\015<' '\040\040\012'|grep -m 1
'^!DOCTYPE'|awk -F\" '{print $2 "\n" $4}'

The commands head, tr, grep, and awk are also available for Windows.

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Mar 4 '07 #5
Joe Kesselman wrote:
Dhurandhar Bhatvadekar wrote:
>I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code
Following Joe's idea (and assuming there always
_is_ a DOCTYPE declaration in your file), I
implemented this in XMLgawk:

XMLSTARTDOCT {
print XMLATTR["PUBLIC"], XMLATTR["SYSTEM"]
exit
}

The "exit" statement ensures that the XML data
will only be read up to the point where the
DOCTYPE declaration is. Immediately after this,
parsing will be terminated. I described such an
approach in the XMLgawk doc:

http://home.vrweb.de/~juergen.kahrs/...ling-with-DTDs
Mar 4 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Penn Markham | last post by:
Hello all, I am writing a script where I need to use the system() function to call htpasswd. I can do this just fine on the command line...works great (see attached file, test.php). When my...
12
by: R | last post by:
Hello everybody. I'm writing my own Content System in PHP5. I've written so far main classes for handling DB connections, XML, XForms and Sessions. But I've got problem with one thing - it's...
8
by: Siemel Naran | last post by:
Hi. I'm writing a command shell that reads commands from standard input. At this point I have the command in a std::string. Now I want to execute this command in the shell. From the Borland...
1
by: Martin Honnen | last post by:
With both .NET 1.0 and 1.1 I have found the following strange behaviour where System.Xml.XmlDocument.LoadXml doesn't throw an error when parsing a text node with a character reference to an invalid...
22
by: markus | last post by:
Hi, There are more than 1000 defined system calls in the Unix standard specification, however, a majority of them are optional and the availability of system calls are dependent on the OS...
14
by: Jon Davis | last post by:
I have put my users through so much crap with this bug it is an absolute shame. I have a product that reads/writes RSS 2.0 documents, among other things. The RSS 2.0 spec mandates an en-US style...
18
by: Atara | last post by:
In my apllication I use the following code: '-- My Code: Public Shared Function strDate2Date(ByVal strDate As String) As System.DateTime Dim isOk As Boolean = False If (strDate Is Nothing)...
66
by: QuantumG | last post by:
Decompilation is the process of recovering human readable source code from a program executable. Many decompilers exist for Java and .NET as the program executables (class files) maintain much of...
12
by: Atropo | last post by:
Hi all. Having several strings how do i combine them to construct a command; lets say to run the date command. string str = "14/10/08 19:06:09"; strDD = str.substr(0,2); strMM =...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.