473,326 Members | 2,111 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Hacks for parsing non well-formed XML ?

Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"

I don't know how it arises. I suspect that it's a character string
with "<" in that isn't being encoded properly. Although it might be
some crazy tag-name getting squirted into the wrong end of the XML
generator. Anyway, it's the badly-formed output of a major bluechip
dot-com and it's likely to stay that way. Our problem is how to chow
down on it, despite its bad formation. 8-(

It's not too important to preserve the content here. The good stuff is
elsewhere in the document, this is just grit in the way.

So, any suggestions on how best to abuse XML standards or tools and
get it parsed with minimum work?

I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list. I don't much like either though. Most robust
so far seems to be a parser where "<dc:title>" becomes part of the
syntax itself and has special handling. Any better ideas?

Mar 16 '07 #1
7 2023
In article <11**********************@l75g2000hse.googlegroups .com>,
Andy Dingley <di*****@codesmiths.comwrote:
>Given this badly-formed fragment, any suggestions on how best to parse
it?
><dc:title><Browse By Subject></dc:title>
[...]
>I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list.
You could make a pass through to determine probably-legal element
names, by looking for end tags. "</Browse" is much less likely to
occur than "<Browse". Then escape less-thans that don't precede an
element name for which you found a plausible end tag. Empty tags
are less clear cut, but you could probably find a 99% solution.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Mar 16 '07 #2
Andy Dingley wrote:
Given this badly-formed fragment, any suggestions on how best to parse
it?
Best suggestions I've got are:

1) XML tools won't touch this. Write a text-processing layer which finds
and fixes these abuses before even thinking about it as XML. It's going
to be messy, fragile, ad-hoc programming.

2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 16 '07 #3
On 16 Mar, 12:49, Joe Kesselman <keshlam-nos...@comcast.netwrote:
2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.
It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu! 8-)

Mar 16 '07 #4
in message <11**********************@l75g2000hse.googlegroups .com>, Andy
Dingley ('d******@codesmiths.com') wrote:
Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"
sed 's/<Browse By Subject>//'

There's no particular reason why you shouldn't use old and proven text
manipulation tools on XML.

--
si***@jasmine.org.uk (Simon Brooke) http://www.jasmine.org.uk/~simon/

A message from our sponsor: This site is now in free fall

Mar 16 '07 #5
It's! a! big! famous! dotcom! not! my! own! code!

Talk! To! Them! About! It!.

Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers... in which case you should
probably be talking to them about getting more official access, since
they're probably changing the poison on a regular basis and anything you
attempt to do to bypass it is likely to break again in a few weeks.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Mar 16 '07 #6
Andy Dingley wrote:
On 16 Mar, 12:49, Joe Kesselman <keshlam-nos...@comcast.netwrote:
>2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.

It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu! 8-)
Nevertheless, charge them extra and mark it on the invoice as overhead
for manual handling of non-XML material. If they're that big, they'll
pay, and if they're that stupid, they'll continue to pay you rather than
fix the bug.

///Peter

Mar 16 '07 #7
On 16 Mar, 17:35, Joseph Kesselman <keshlam-nos...@comcast.netwrote:
Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers...
Oh, I _wish_ they were that smart.

Just to clarify, it's a public interface to their services that they
encourage(sic) the use of. The likelihood of them fixing it is on the
avian-pig scale. It's also not a static string, so any sed-ing would
need a slightly more sophisticated regex to work on it, although it's
entirely viable. Sadly it's also an embedded app, so Unix tools just
aren't present. A similar pre-processor approach seems best though,
rather than frobbing a parser.

Thanks for all your suggestions.

Mar 19 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Nick Coghlan | last post by:
Time for another random syntax idea. . . So, I was tinkering in the interactive interpreter, and came up with the following one-size-fits-most default argument hack: Py> x = 1 Py> def...
14
by: Arthur J. O'Dwyer | last post by:
Well, I'm trying to write that program that was requested a few weeks back, the one that could take struct definitions and create portable functions to read and write those structs. Hence the...
11
by: Homam | last post by:
The ASP.NET model is touted as a clean OO approach to developing web pages, yet sometimes the page lifecycle poses silly obstacles that forces you revert to the Olde ASP 3.0 Ways. Here's a rough...
4
by: Earl | last post by:
I'm curious if there are others who have a better method of accepting/parsing phone numbers. I've used a couple of different techniques that are functional but I can't really say that I'm totally...
1
by: code | last post by:
Hi Grp http://www.books-download.com/?Book=1493-PHP+Hacks+%3a+Tips+%26+Tools+For+Creating+Dynamic+Websites+(Hacks) Description Programmers love its flexibility and speed; designers love its...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
16
by: deko | last post by:
As I understand it, IE7 is still not standards compliant (although it has cleaned up some bugs). Can anyone point me to a summary of IE bugs and fixes? I've recently discovered Conditional...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
53
by: brave1979 | last post by:
Please check out my javascript library that allows you to create any layout for your web page, nested as deep as you like, adjusting to width and height of a browser window. You just describe it in...
1
by: eyeore | last post by:
Hello everyone my String reverse code works but my professor wants me to use pop top push or Stack code and parsing code could you please teach me how to make this code work with pop top push or...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.