473,624 Members | 2,272 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex to grab keywords from HTML header

I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"


The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #1
5 3050
You might want to try Expresso
http://www.ultrapico.com/
or
Regulator (look at google to find out the address)
HIH

<Di************ **@gmail.com> schrieb im Newsbeitrag
news:11******** **************@ g47g2000cwa.goo glegroups.com.. .
I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
>


The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #2
Try one of the many RegEx sites:

How To Use Regular Expressions in Microsoft Visual Basic 6.0
http://support.microsoft.com/default...b;en-us;818802

RegEx Tutorial for VB:
http://juicystudio.com/tutorial/vb/regexp.asp

RegEx Library:
http://www.regexlib.com/

RegEx Module for VB:
http://www.aivosto.com/regexpr.html

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--
<Di************ **@gmail.com> wrote in message
news:11******** **************@ g47g2000cwa.goo glegroups.com.. .
I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>

Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:

<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
>


The coolest thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.

Here's what I've tried so far for a Regex string.

"[<][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"

It's not working very well :) (this regex stuff is complicated!)

Can anybody help a regex newbie?

Nov 23 '05 #3
Woohoo! Great reference Boni!

Here's the regex string that returns the keywords:
<\s*meta\s*name \s*=\s*"\s*keyw ords\s*"\s*cont ent\s*=\s*"\s*([^"]+)"\s*>

This makes a lot more sense now...

Is there a way to further parse the keywords inside the ([^"]+)
adding to the string above?

Keywords are listed as
at least one keyword (ending in either quote or comma)
if it ends with a quote, then throw away the quote and we're done.
if it ends in comma then look for repeating groups of [,next keyword]
and throw away the comma each time

I've looked at a number of tutorials online and this part is more
complicated

Note: Veign - the "jucystudio " reference 404'd out :(

Nov 23 '05 #4
<Di************ **@gmail.com> schrieb:
I'm trying to figure out how to extract the keywords from an HTML
document.


I'd consider the alternatives to using regular expressions for this purpose:

<URL:http://groups.google.d e/group/microsoft.publi c.dotnet.langua ges.csharp/msg/d3a373d9d9f8367 b>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 23 '05 #5
Di************* *@gmail.com wrote:
I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>


You have posted this to both a dotnet group and a VB Classic group - the two
are different languages. You need to specify which language you are using,
because...
--
<response type="generic" language="VB.Ne t">
This newsgroup (.vb.syntax) is for users of Visual Basic version 6.0
and earlier and not the misleadingly named VB.Net
or VB 200x. Solutions, and often even the questions,
for one platform will be meaningless in the other.
When VB.Net was released Microsoft created new newsgroups
devoted to the new platform so that neither group of
developers need wade through the clutter of unrelated
topics. Look for newsgroups with the words "dotnet" or
"vsnet" in their name. For the msnews.microsof t.com news
server try these:

microsoft.publi c.dotnet.genera l
microsoft.publi c.dotnet.langua ges.vb

</response>
--
Regards,

Michael Cole
Nov 23 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
2712
by: Baby Blue | last post by:
I have 2 code like below to grab a news website for my site. However, when I click some links (such as : http://wwww.vnexpress.net/xxx/xxxx ) inside the site which I want to grab, it has some errors. Can any body help me ?? Demo : news.thuthao.info The real website : vnexpress.net ---------------------------index.php------------------------------- <?php
7
5720
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning is to improve, but not to prove.
1
2891
by: darrel | last post by:
I have some vb.net code that is running a regex, matching groups, and replacing them. I'm trying to come up with a simple script that will strip all attributes from all HTML tags. This is what I have: ============================================================= function stripAllAttributes(ByVal textToParse as String, ByVal tagToFind as String) as String
4
1914
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b)
0
8242
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8177
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8681
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8629
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
5570
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4084
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4183
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1793
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1488
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.