I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>
Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:
<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.
Here's what I've tried so far for a Regex string.
"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"
It's not working very well :) (this regex stuff is complicated!)
Can anybody help a regex newbie?