Connecting Tech Pros Worldwide Forums | Help | Site Map

Validating XML/XHTML in email

Kenneth Porter
Guest
 
Posts: n/a
#1: Oct 23 '08
I'm thinking it might be a good idea to use the "quality" of an XML/XHTML
email's structure as a metric for spamminess. More errors are likely to
imply spam. Does there exist a lightweight validator that can quickly
produce a metric of how many errors exist in a message? Ideally this would
be something I could invoke from a Perl process, perhaps over a pipe to a
validation server (similar to the way ClamAV and SpamAssassin can be
invoked).

Peter Flynn
Guest
 
Posts: n/a
#2: Oct 23 '08

re: Validating XML/XHTML in email


Kenneth Porter wrote:
Quote:
I'm thinking it might be a good idea to use the "quality" of an XML/XHTML
email's structure as a metric for spamminess. More errors are likely to
imply spam. Does there exist a lightweight validator that can quickly
produce a metric of how many errors exist in a message? Ideally this would
be something I could invoke from a Perl process, perhaps over a pipe to a
validation server (similar to the way ClamAV and SpamAssassin can be
invoked).

onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l

onsgmls is in the OpenSP package.

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Kenneth Porter
Guest
 
Posts: n/a
#3: Oct 24 '08

re: Validating XML/XHTML in email


Peter Flynn <peter.nosp@m.silmaril.iewrote in news:6mcaioFg2irhU1
@mid.individual.net:
Quote:
onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
>
onsgmls is in the OpenSP package.
That sounds good. Now to see what's involved in incorporating that into a
SpamAssassin plugin....
Kenneth Porter
Guest
 
Posts: n/a
#4: Oct 24 '08

re: Validating XML/XHTML in email


Peter Flynn <peter.nosp@m.silmaril.iewrote in news:6mcaioFg2irhU1
@mid.individual.net:
Quote:
onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
>
onsgmls is in the OpenSP package.
With that hint I found that "tidy -eq" gives a pretty good result. To
normalize the score, I figure it makes sense to divide the resulting line
count by the byte count of the input file.
Peter Flynn
Guest
 
Posts: n/a
#5: Oct 28 '08

re: Validating XML/XHTML in email


Kenneth Porter wrote:
Quote:
Peter Flynn <peter.nosp@m.silmaril.iewrote in news:6mcaioFg2irhU1
@mid.individual.net:
>
Quote:
>onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
>>
>onsgmls is in the OpenSP package.
>
With that hint I found that "tidy -eq" gives a pretty good result. To
normalize the score, I figure it makes sense to divide the resulting line
count by the byte count of the input file.
Ah. If it's only HTML you're handling, Tidy will be much easier to work
with. OpenSP requires well-formed XML at least, which would mean running
Tidy on the HTML first anyway.

///Peter
Closed Thread