Jürgen Kahrs escribió:
Klaus Alexander Seistrup wrote:
>...
However, while xgawk will find all relevant attributes, it seems that
the data variable will always either be empty or contain a linefeed.
I don't quite understand why, but I'm trying to find out...
The data variable is filled with the content of the very
last character data that occurred immediately before
the </div(usually a newline). If you want anything else
to be assigned to the data variable, change the condition
in front of the assignment of the data variable.
Parsing hCard is a really challenging task. The XSLT transformation program
from Brian Suda exceeds 1900 lines of text. This is a touchstone for any
XML processor.
Recognizing the elements with hCard content is the easy part. Collecting
values is much more difficult. The following code shows how to collect text
data for hCard properties/subproperties that can be arbitrarily nested. It
is a quick and dirty hack. The hCard structure is lost, and converted to a
flat list of property name/values. Only text data is collected. Properties
with value in attributes (like 'href') are not handled properly.
Maybe someday I could write a useable hCard parser, just to show the
capabilities of xmlgawk.
-- hcard.awk ---------------------------------------------------------
# Extract content of hCard
# Partial work. Just a quick and dirty hack!
# Author: Manuel Collado
# Global variables:
# hcroot (string) : path of the hCard root element
# hcard (array prop.name->value) : the extracted hCard (flat structure)
# hclevel (number) : nesting level of hCard prop/subprop
# hcpath (array hclevel->path) : stack of nested prop/subprop
# hckey (array hclevel->prop.name) : stack of nested prop/subprop
# hcvalue (array hclevel->value) : stack of nested prop/subprop
@load xml
BEGIN {
XMLCHARSET = "UTF-8"
hcsep = "|"
hckeys =
"\\<(fn|n|family-name|given-name|additional-name|honorific-prefix|honorific-suffix|nickname|sort-string|url|email|type|value|tel|type|value|adr|pos t-office-box|extended-address|street-address|locality|region|postal-code|country-name|type|value|label|geo|latitude|longitude|tz|ph oto|logo|sound|bday|title|role|org|organization-name|organization-unit|category|note|class|key|mailer|uid|rev)\\>"
}
# hCard start
XMLSTARTELEM && (XMLATTR["class"] ~ "\\<vcard\\>") {
hcroot = XMLPATH
delete hcard
hclevel = 0
delete hcpath
delete hckey
delete hcvalue
}
# skip content outside the hCard
!hcroot {
next
}
# data element (property) start
XMLSTARTELEM && hcroot {
# push each keyword (if any) on the stack
split( XMLATTR["class"], keylist, " " )
for (k in keylist) {
if (keylist[k] ~ hckeys) {
hclevel++
hckey[hclevel] = keylist[k]
hcpath[hclevel] = XMLPATH
hcvalue[hclevel] = ""
}
}
}
# character data
XMLCHARDATA && hclevel {
# concatenate text fragments inside the same property
hcvalue[hclevel] = hcvalue[hclevel] $0
}
# data element (property) end
XMLENDELEM && XMLPATH == hcpath[hclevel] {
# pop the value fron the stack and accumulate on parent data and hcard
while (XMLPATH == hcpath[hclevel]) {
value = hcvalue[hclevel]
key = hckey[hclevel]
if (key in hcard) {
hcard[key] = hcard[key] hcsep xs_trim(value)
} else {
hcard[key] = xs_trim(value)
}
delete hcvalue[hclevel]
hclevel--
if (hclevel) {
hcvalue[hclevel] = hcvalue[hclevel] value
}
}
}
# hCard end
XMLENDELEM && XMLPATH == hcroot {
hcroot = ""
# dump the collected data
print "------------------------------------------"
for (key in hcard) {
print key ": " hcard[key]
}
print "------------------------------------------"
}
END {
XmlCheckError()
}
# XMLgawk error reporting needs some redesign.
# Interim code: uses both ERRNO and XMLERROR to generate consistent messages
function XmlCheckError() {
if (XMLERROR) {
printf("\n%s:%d:%d:(%d) %s\n", FILENAME, XMLROW, XMLCOL, XMLLEN,
XMLERROR)
} else if (ERRNO) {
printf("\n%s\n", ERRNO)
ERRNO = ""
}
}
#------------------------------------------------------------------
# xs_trim: remove leading and trailing [[:space:]] characters, and
# collapse repeated spaces into a single one
#------------------------------------------------------------------
function xs_trim( string ) {
sub(/^[[:space:]]+/, "", string)
if (string) sub( /[[:space:]]+$/, "", string )
if (string) gsub( /[[:space:]]+/, " ", string )
return string
}
--------------------------------------------------------------------------------
Regards.
--
Manuel Collado