Regex to grab keywords from HTML header

Digital.Rebel.18

I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"

The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #1

Subscribe Post Reply

3025

Boni

You might want to try Expresso
http://www.ultrapico.com/
or
Regulator (look at google to find out the address)
HIH

<Di**************@gmail.com> schrieb im Newsbeitrag
news:11**********************@g47g2000cwa.googlegr oups.com...

I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
>

The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #2

Veign

Try one of the many RegEx sites:

How To Use Regular Expressions in Microsoft Visual Basic 6.0
http://support.microsoft.com/default...b;en-us;818802

RegEx Tutorial for VB:
http://juicystudio.com/tutorial/vb/regexp.asp

RegEx Library:
http://www.regexlib.com/

RegEx Module for VB:
http://www.aivosto.com/regexpr.html

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--
<Di**************@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...

I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
>

The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #3

Digital.Rebel.18

Woohoo! Great reference Boni!

Here's the regex string that returns the keywords:
<\s*meta\s*name\s*=\s*"\s*keywords\s*"\s*content\s *=\s*"\s*([^"]+)"\s*>

This makes a lot more sense now...

Is there a way to further parse the keywords inside the ([^"]+)
adding to the string above?

Keywords are listed as
at least one keyword (ending in either quote or comma)
if it ends with a quote, then throw away the quote and we're done.
if it ends in comma then look for repeating groups of [,next keyword]
and throw away the comma each time

I've looked at a number of tutorials online and this part is more
complicated

Note: Veign - the "jucystudio" reference 404'd out :(

Nov 23 '05 #4

Herfried K. Wagner [MVP]

<Di**************@gmail.com> schrieb:

I'm trying to figure out how to extract the keywords from an HTML
document.

I'd consider the alternatives to using regular expressions for this purpose:

<URL:http://groups.google.de/group/microsoft.public.dotnet.languages.csharp/msg/d3a373d9d9f8367b>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 23 '05 #5

Michael Cole

Di**************@gmail.com wrote:

I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

You have posted this to both a dotnet group and a VB Classic group - the two
are different languages. You need to specify which language you are using,
because...
--
<response type="generic" language="VB.Net">
This newsgroup (.vb.syntax) is for users of Visual Basic version 6.0
and earlier and not the misleadingly named VB.Net
or VB 200x. Solutions, and often even the questions,
for one platform will be meaningless in the other.
When VB.Net was released Microsoft created new newsgroups
devoted to the new platform so that neither group of
developers need wade through the clutter of unrelated
topics. Look for newsgroups with the words "dotnet" or
"vsnet" in their name. For the msnews.microsoft.com news
server try these:

microsoft.public.dotnet.general
microsoft.public.dotnet.languages.vb

</response>
--
Regards,

Michael Cole

Nov 23 '05 #6

by: Baby Blue | last post by:

I have 2 code like below to grab a news website for my site. However, when I click some links (such as : http://wwww.vnexpress.net/xxx/xxxx ) inside the site which I want to grab, it has some...

PHP

How can I embed the *regex* engine into C program?

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...

C / C++

regex: how to loop through individual matches

by: darrel | last post by:

I have some vb.net code that is running a regex, matching groups, and replacing them. I'm trying to come up with a simple script that will strip all attributes from all HTML tags. This is what I...

ASP.NET

Regex help...pretty please?

by: MooMaster | last post by:

I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: ...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Regex to grab keywords from HTML header

Similar topics