473,399 Members | 3,603 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

hCard parsing

Hi group,

I am new to xgawk (and seemingly to xml also), and I've been struggling
all afternoon to have xgawk¹ parsing an XHTML file containing a hCard²,
without luck. I wonder if you guys could give me a push...

Let's say I have the following XHTML file:

#v+

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>hCard example</title>
</head>
<body>
<h1>hCard</h1>
<div class="vcard">
<h2 class="fn n">
<span class="given-name">John</span>
<span class="additional-name">Brian</span>
<span class="family-name">Doe</span>
</h2>
<p class="adr">
<div class="street-address">123 Circle Drive</div>
<div class="locality">South Metropolis</div>
<span class="region">XYZ</span<span class="postal-code">012345</span><br />
<abbr title="Aaland Islands"><span class="country">AX</span></abbr>
</p>
</div>
</body>
</html>

#v-

I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-

I have been playing with xgawk and the ECB forex reference rates³,
and I have no problems extracting the exchange rates and have xgawk
calculating cross rates, but I can't seem to get xgawk parsing the
simple hCard above. I have read the XMLgawk documentation and
studied the examples, and I have been googling this group. Still
no luck.

Thanks for any help or hints.

Cheers,
Klaus.

¹) http://home.vrweb.de/~juergen.kahrs/gawk/XML/
²) http://microformats.org/wiki/hcard
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://seistrup.dk/klaus/
Oct 17 '06 #1
12 2509
Klaus Alexander Seistrup wrote:
I would like to end up with something like:

#v+

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX

#v-
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }
xgawk -lxml -f hcard.awk hcard.xml

John Brian Doe
123 Circle Drive
South Metropolis
XYZ 012345
AX
³) http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
Thanks for the link.
Oct 17 '06 #2
Jürgen Kahrs wrote:
Try this one:

XMLCHARDATA { data = $0 }
XMLENDELEM == "span" { name = name data " " }
XMLENDELEM == "h2" ||
XMLENDELEM == "br" ||
XMLENDELEM == "abbr" { print name; name = "" }
XMLENDELEM == "div" { print data }
Thanks. I guess I should have been more specific -- or general, as
it is.

What I want to achieve is a general hCard parser. While your code
above works for the specific hCard sample, it doesn't trigger on
hCard markup, and it will fail for most other hcards in the universe.

Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.

Cheers,
Klaus.

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 17 '06 #3
Klaus Alexander Seistrup wrote:
Rather than relying on certain tags, the parser should look for
things like 'class="given-name"', 'class="street-address"', and only
if found within a 'class="vcard"'.
That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}
Oct 17 '06 #4
Jürgen Kahrs wrote:
>Rather than relying on certain tags, the parser should look
for things like 'class="given-name"', 'class="street-address"',
and only if found within a 'class="vcard"'.

That's easy. It should go like this:

XMLCHARDATA { data = $0 }
XMLSTARTELEM == "vcard" { delete addr }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == "vcard" {
print addr["given-name"}
print addr["street-address"}
}
Either it's a bit more complicated than that, or my brain is just
not working...

The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>

but it could be a <p/>, a <ul/or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?

Cheers,

--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://streetkids.dk/
Oct 17 '06 #5
Klaus Alexander Seistrup wrote:
The hCard is not found within a <vcard/>, rather within "something"
with a "vcard" attribute, in my example case it's

<div class="vcard">
:
</div>
Yes, you are right, I overlooked that "vcard" is
not a tag but an attribute of any tag.
but it could be a <p/>, a <ul/or any other tag. I guess I could
find the tag by looking for '"vcard" in XMLATTR', but how do I find
the corresponding XMLENDELEM (the </div>) so that I know when to stop?
You are on the right track. It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}
Oct 17 '06 #6
Jürgen Kahrs wrote:
It should go like this:

XMLCHARDATA { data = $0 }
"vcard" in XMLATTR { delete addr; vcard_in = XMLSTARTELEM }
"class" in XMLATTR {
addr[XMLATTR["class"]] = data
}
XMLENDELEM == vcard_in {
print addr["given-name"]
print addr["street-address"]
}
We're getting closer, but still no cigar.

In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

Cheers,
Klaus.

--
Klaus Alexander Seistrup
Copenhagen, Denmark
http://surdej.dk/
Oct 18 '06 #7
Klaus Alexander Seistrup wrote:
In the specific case "XMLENDELEM == vcard_in" will match all </div>s,
which is not what I want. I have better luck with:

vcard_in = XMLPATH
and
XMLENDELEM && XMLPATH == vcard_in
This should be equivalent to

XMLENDELEM == vcard_in && XMLDEPTH == 1

However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...
The data variable is filled with the content of the very
last character data that occurred immediately before
the </div(usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
Oct 18 '06 #8
Jürgen Kahrs escribió:
Klaus Alexander Seistrup wrote:
>...
However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...

The data variable is filled with the content of the very
last character data that occurred immediately before
the </div(usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
Parsing hCard is a really challenging task. The XSLT transformation program
from Brian Suda exceeds 1900 lines of text. This is a touchstone for any
XML processor.

Recognizing the elements with hCard content is the easy part. Collecting
values is much more difficult. The following code shows how to collect text
data for hCard properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack. The hCard structure is lost, and converted to a
flat list of property name/values. Only text data is collected. Properties
with value in attributes (like 'href') are not handled properly.

Maybe someday I could write a useable hCard parser, just to show the
capabilities of xmlgawk.
-- hcard.awk ---------------------------------------------------------
# Extract content of hCard

# Partial work. Just a quick and dirty hack!
# Author: Manuel Collado

# Global variables:
# hcroot (string) : path of the hCard root element
# hcard (array prop.name->value) : the extracted hCard (flat structure)
# hclevel (number) : nesting level of hCard prop/subprop
# hcpath (array hclevel->path) : stack of nested prop/subprop
# hckey (array hclevel->prop.name) : stack of nested prop/subprop
# hcvalue (array hclevel->value) : stack of nested prop/subprop

@load xml

BEGIN {
XMLCHARSET = "UTF-8"
hcsep = "|"
hckeys =
"\\<(fn|n|family-name|given-name|additional-name|honorific-prefix|honorific-suffix|nickname|sort-string|url|email|type|value|tel|type|value|adr|pos t-office-box|extended-address|street-address|locality|region|postal-code|country-name|type|value|label|geo|latitude|longitude|tz|ph oto|logo|sound|bday|title|role|org|organization-name|organization-unit|category|note|class|key|mailer|uid|rev)\\>"
}

# hCard start
XMLSTARTELEM && (XMLATTR["class"] ~ "\\<vcard\\>") {
hcroot = XMLPATH
delete hcard
hclevel = 0
delete hcpath
delete hckey
delete hcvalue
}

# skip content outside the hCard
!hcroot {
next
}

# data element (property) start
XMLSTARTELEM && hcroot {
# push each keyword (if any) on the stack
split( XMLATTR["class"], keylist, " " )
for (k in keylist) {
if (keylist[k] ~ hckeys) {
hclevel++
hckey[hclevel] = keylist[k]
hcpath[hclevel] = XMLPATH
hcvalue[hclevel] = ""
}
}
}

# character data
XMLCHARDATA && hclevel {
# concatenate text fragments inside the same property
hcvalue[hclevel] = hcvalue[hclevel] $0
}

# data element (property) end
XMLENDELEM && XMLPATH == hcpath[hclevel] {
# pop the value fron the stack and accumulate on parent data and hcard
while (XMLPATH == hcpath[hclevel]) {
value = hcvalue[hclevel]
key = hckey[hclevel]
if (key in hcard) {
hcard[key] = hcard[key] hcsep xs_trim(value)
} else {
hcard[key] = xs_trim(value)
}
delete hcvalue[hclevel]
hclevel--
if (hclevel) {
hcvalue[hclevel] = hcvalue[hclevel] value
}
}
}

# hCard end
XMLENDELEM && XMLPATH == hcroot {
hcroot = ""
# dump the collected data
print "------------------------------------------"
for (key in hcard) {
print key ": " hcard[key]
}
print "------------------------------------------"
}

END {
XmlCheckError()
}

# XMLgawk error reporting needs some redesign.
# Interim code: uses both ERRNO and XMLERROR to generate consistent messages
function XmlCheckError() {
if (XMLERROR) {
printf("\n%s:%d:%d:(%d) %s\n", FILENAME, XMLROW, XMLCOL, XMLLEN,
XMLERROR)
} else if (ERRNO) {
printf("\n%s\n", ERRNO)
ERRNO = ""
}
}

#------------------------------------------------------------------
# xs_trim: remove leading and trailing [[:space:]] characters, and
# collapse repeated spaces into a single one
#------------------------------------------------------------------
function xs_trim( string ) {
sub(/^[[:space:]]+/, "", string)
if (string) sub( /[[:space:]]+$/, "", string )
if (string) gsub( /[[:space:]]+/, " ", string )
return string
}

--------------------------------------------------------------------------------
Regards.
--
Manuel Collado
Oct 19 '06 #9
Hello Manuel,
Parsing hCard is a really challenging task. The XSLT transformation
program from Brian Suda exceeds 1900 lines of text. This is a touchstone
for any XML processor.
I wasnt aware that this hCard format has such a widespread support.
I just looked at the example from Klaus and tried to
help him analyse his particular example. Do you have
any link to some kind of specification for the hCard format ?
Oct 19 '06 #10
Jürgen Kahrs wrote:
Do you have any link to some kind of specification for the
hCard format ?
There was a link to microformats in my original posting:
http://microformats.org/

Cheers,

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 19 '06 #11
Jürgen Kahrs schrieb:
Do you have
any link to some kind of specification for the hCard format ?
<http://microformats.org/wiki/hcard>
--
Johannes Koch
Spem in alium nunquam habui praeter in te, Deus Israel.
(Thomas Tallis, 40-part motet)
Oct 19 '06 #12
Manuel Collado wrote:
The following code shows how to collect text data for hCard
properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack.
Thanks a lot, this is of great help to me in order to understand
XML processing with xgawk, I certainly appreciate it!

Cheers,

--
Klaus Alexander Seistrup
SubZeroNet, Copenhagen, Denmark
http://magnetic-ink.dk/
Oct 19 '06 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
2
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
0
by: Pentti | last post by:
Can anyone help to understand why re-parsing occurs on a remote database (using database links), even though we are using a prepared statement on the local database: Scenario: ======== We...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
5
by: randy | last post by:
Can some point me to a good example of parsing XML using C# 2.0? Thanks
3
by: toton | last post by:
Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
7
by: Daniel Fetchinson | last post by:
Many times a more user friendly date format is convenient than the pure date and time. For example for a date that is yesterday I would like to see "yesterday" instead of the date itself. And for...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.