By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,441 Members | 1,766 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,441 IT Pros & Developers. It's quick & easy.

hCard parsing

P: n/a
Hi group,

I am new to xgawk (and seemingly to xml also), and I've been struggling
all afternoon to have xgawk¹ parsing an XHTML file containing a hCard²,
without luck. I wonder if you guys could give me a push...

Let's say I have the following XHTML file:

#v+

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>hCard example</title>
</head>
<body>
<h1>hCard</h1>
<div class="vcard">
<h2 class="fn n">
<span class="given-name">John</span>
<span class="additional-name">Brian</span>
<span class="family-name">Doe</span>
</h2>
<p class="adr">
<div class="street-address">123 Circle Drive</div>
<div class="locality">South Metropolis</div>
<span class="region">XYZ</span<span class="postal-code">012345</span><br />
<abbr title="Aaland Islands"><span class="country">AX</span></abbr>
</p>
</div>
</body>
</html>

#v-

I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

I have been playing with xgawk and the ECB forex reference rates³,
and I have no problems extracting the exchange rates and have xgawk
calculating cross rates, but I can't seem to get xgawk parsing the
simple hCard above. I have read the XMLgawk documentation and
studied the examples, and I have been googling this group. Still
no luck.

Thanks for any help or hints.

Cheers,
Klaus.

¹) http://home.vrweb.de/~juergen.kahrs/gawk/XML/
²) http://microformats.org/wiki/hcard
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://seistrup.dk/klaus/
Oct 17 '06 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Klaus Alexander Seistrup wrote:
I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }
xgawk -lxml -f hcard.awk hcard.xml

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
Thanks for the link.
Oct 17 '06 #2

P: n/a
Jürgen Kahrs wrote:
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }
Thanks. I guess I should have been more specific -- or general, as
it is.

What I want to achieve is a general hCard parser. While your code
above works for the specific hCard sample, it doesn't trigger on
hCard markup, and it will fail for most other hcards in the universe.

Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

Cheers,
Klaus.

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 17 '06 #3

P: n/a
Klaus Alexander Seistrup wrote:
Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.
That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}
Oct 17 '06 #4

P: n/a
Jürgen Kahrs wrote:
>Rather than relying on certain tags, the parser should look
for things like 'class="given-name"', 'class="street-address"',
and only if found within a 'class="vcard"'.

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}
Either it's a bit more complicated than that, or my brain is just
not working...

The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

but it could be a <p/>, a <ul/or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

Cheers,

--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://streetkids.dk/
Oct 17 '06 #5

P: n/a
Klaus Alexander Seistrup wrote:
The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>
Yes, you are right, I overlooked that "vcard" is
not a tag but an attribute of any tag.
but it could be a <p/>, a <ul/or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?
You are on the right track. It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}
Oct 17 '06 #6

P: n/a
Jürgen Kahrs wrote:
It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}
We're getting closer, but still no cigar.

In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

Cheers,
Klaus.

--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://surdej.dk/
Oct 18 '06 #7

P: n/a
Klaus Alexander Seistrup wrote:
In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in
This should be equivalent to

XMLENDELEM == vcard_in && XMLDEPTH == 1

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...
The data variable is filled with the content of the very
last character data that occurred immediately before
the </div(usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
Oct 18 '06 #8

P: n/a
Jürgen Kahrs escribió:
Klaus Alexander Seistrup wrote:
>...
However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

The data variable is filled with the content of the very
last character data that occurred immediately before
the </div(usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
Parsing hCard is a really challenging task. The XSLT transformation program
from Brian Suda exceeds 1900 lines of text. This is a touchstone for any
XML processor.

Recognizing the elements with hCard content is the easy part. Collecting
values is much more difficult. The following code shows how to collect text
data for hCard properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack. The hCard structure is lost, and converted to a
flat list of property name/values. Only text data is collected. Properties
with value in attributes (like 'href') are not handled properly.

Maybe someday I could write a useable hCard parser, just to show the
capabilities of xmlgawk.
-- hcard.awk ---------------------------------------------------------
# Extract content of hCard

# Partial work. Just a quick and dirty hack!
# Author: Manuel Collado

# Global variables:
# hcroot (string) : path of the hCard root element
# hcard (array prop.name->value) : the extracted hCard (flat structure)
# hclevel (number) : nesting level of hCard prop/subprop
# hcpath (array hclevel->path) : stack of nested prop/subprop
# hckey (array hclevel->prop.name) : stack of nested prop/subprop
# hcvalue (array hclevel->value) : stack of nested prop/subprop

@load xml

BEGIN {
XMLCHARSET = "UTF-8"
hcsep = "|"
hckeys =
"\\<(fn|n|family-name|given-name|additional-name|honorific-prefix|honorific-suffix|nickname|sort-string|url|email|type|value|tel|type|value|adr|pos t-office-box|extended-address|street-address|locality|region|postal-code|country-name|type|value|label|geo|latitude|longitude|tz|ph oto|logo|sound|bday|title|role|org|organization-name|organization-unit|category|note|class|key|mailer|uid|rev)\\>"
}

# hCard start
XMLSTARTELEM && (XMLATTR["class"] ~ "\\<vcard\\>") {
hcroot = XMLPATH
delete hcard
hclevel = 0
delete hcpath
delete hckey
delete hcvalue
}

# skip content outside the hCard
!hcroot {
next
}

# data element (property) start
XMLSTARTELEM && hcroot {
# push each keyword (if any) on the stack
split( XMLATTR["class"], keylist, " " )
for (k in keylist) {
if (keylist[k] ~ hckeys) {
hclevel++
hckey[hclevel] = keylist[k]
hcpath[hclevel] = XMLPATH
hcvalue[hclevel] = ""
}
}
}

# character data
XMLCHARDATA && hclevel {
# concatenate text fragments inside the same property
hcvalue[hclevel] = hcvalue[hclevel] $0
}

# data element (property) end
XMLENDELEM && XMLPATH == hcpath[hclevel] {
# pop the value fron the stack and accumulate on parent data and hcard
while (XMLPATH == hcpath[hclevel]) {
value = hcvalue[hclevel]
key = hckey[hclevel]
if (key in hcard) {
hcard[key] = hcard[key] hcsep xs_trim(value)
} else {
hcard[key] = xs_trim(value)
}
delete hcvalue[hclevel]
hclevel--
if (hclevel) {
hcvalue[hclevel] = hcvalue[hclevel] value
}
}
}

# hCard end
XMLENDELEM && XMLPATH == hcroot {
hcroot = ""
# dump the collected data
print "------------------------------------------"
for (key in hcard) {
print key ": " hcard[key]
}
print "------------------------------------------"
}

END {
XmlCheckError()
}

# XMLgawk error reporting needs some redesign.
# Interim code: uses both ERRNO and XMLERROR to generate consistent messages
function XmlCheckError() {
if (XMLERROR) {
printf("\n%s:%d:%d:(%d) %s\n", FILENAME, XMLROW, XMLCOL, XMLLEN,
XMLERROR)
} else if (ERRNO) {
printf("\n%s\n", ERRNO)
ERRNO = ""
}
}

#------------------------------------------------------------------
# xs_trim: remove leading and trailing [[:space:]] characters, and
# collapse repeated spaces into a single one
#------------------------------------------------------------------
function xs_trim( string ) {
sub(/^[[:space:]]+/, "", string)
if (string) sub( /[[:space:]]+$/, "", string )
if (string) gsub( /[[:space:]]+/, " ", string )
return string
}

--------------------------------------------------------------------------------
Regards.
--
Manuel Collado
Oct 19 '06 #9

P: n/a
Hello Manuel,
Parsing hCard is a really challenging task. The XSLT transformation
program from Brian Suda exceeds 1900 lines of text. This is a touchstone
for any XML processor.
I wasnt aware that this hCard format has such a widespread support.
I just looked at the example from Klaus and tried to
help him analyse his particular example. Do you have
any link to some kind of specification for the hCard format ?
Oct 19 '06 #10

P: n/a
Jürgen Kahrs wrote:
Do you have any link to some kind of specification for the
hCard format ?
There was a link to microformats in my original posting:
http://microformats.org/

Cheers,

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 19 '06 #11

P: n/a
Jürgen Kahrs schrieb:
Do you have
any link to some kind of specification for the hCard format ?
<http://microformats.org/wiki/hcard>
--
Johannes Koch
Spem in alium nunquam habui praeter in te, Deus Israel.
(Thomas Tallis, 40-part motet)
Oct 19 '06 #12

P: n/a
Manuel Collado wrote:
The following code shows how to collect text data for hCard
properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack.
Thanks a lot, this is of great help to me in order to understand
XML processing with xgawk, I certainly appreciate it!

Cheers,

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 19 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.