473,796 Members | 2,801 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regular Expression for all attributes in HTML tag

I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert

Aug 25 '06 #1
4 5630
Hi Gert,
>I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.
(Worst case scenario where standards was not followed in the past)
Since your parser needs to be aware of all kinds of ways to write
attributes, I think trying to write an all-around regular expression quickly
becomes a steep uphill climb.

I would probably forget about regular expressions altogether, and instead
write a simple text parser of my own. I think that would be simpler.

Just a thought, I'm not saying you can't do it with regex.

--
Regards,

Mr. Jani Järvinen
C# MVP
Helsinki, Finland
ja***@removethi s.dystopia.fi
http://www.saunalahti.fi/janij/
Aug 25 '06 #2
You may want to give this HTML Parser a try..

http://www.netomatix.com/products/do...parsernet.aspx

There is a fully functional community edition that you can use for free.
"Gert Conradie" <ge***********@ gmail.comwrote in message
news:11******** **************@ p79g2000cwp.goo glegroups.com.. .
>I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert

Aug 25 '06 #3
This ought to do it for you:

(\w+)=(?:["']?([^"'>=]*)["']?)

Translation: a sequence of one or more word characters (letters and/or
digits), followed by an equals sign, followed by 0 or 1 single quote or
double quote, followed by any number of any character that is not a single
quote or a double quote or a right angle bracket, followed by 0 or 1 single
or double quotes.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
"Gert Conradie" <ge***********@ gmail.comwrote in message
news:11******** **************@ p79g2000cwp.goo glegroups.com.. .
>I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert

Aug 25 '06 #4
Hi Kevin & other
(\w+)=(?:["']?([^"'>=]*)["']?)
This one misses the "key4=444" in my example but surely make my attempt
look like a goods train compared. :) I will use it as a starting point
to try again.

Yani & Winista, I will try the parser and let you know the results...

Thanks, gert


Kevin Spencer wrote:
This ought to do it for you:

(\w+)=(?:["']?([^"'>=]*)["']?)

Translation: a sequence of one or more word characters (letters and/or
digits), followed by an equals sign, followed by 0 or 1 single quote or
double quote, followed by any number of any character that is not a single
quote or a double quote or a right angle bracket, followed by 0 or 1 single
or double quotes.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
"Gert Conradie" <ge***********@ gmail.comwrote in message
news:11******** **************@ p79g2000cwp.goo glegroups.com.. .
I need to list all the key/value pairs of and HTML tag. I already have
the complete tag as an text string.

For example: (Worst case scenario where standards was not followed in
the past)
<myTag key1="aaa" key2 = "bbb" key3='ccc' key4=444 key5= 555
key5="Please click here" >

I end up with two versions, each with its own flaw and I cant seems to
merge them:
A) Allow for no " or ' around values but fail when there is a space in
the attribute value:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w]*)[",']?

B)Allow for space in value of attribute but miss those without " or '
around the value.
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*[",']?(?<Value>[\w\s]*)[",']

This is my merge attempt that find all the key's and integer values,
but not the text values:
\b(?<Keyword>[^>\s][\w]+)[\s]*=[\s]*(?<Value>((?<!["'])[\d]+(?!["']))|((?<=["']?)[\w\s]*(?=["']?)))

Thanks in advance - help here would be much appreciated.

Gert
Aug 28 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
2599
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to appear beside TYPE:)
1
4185
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make regular expressions easier to create and use (and in my experience as a regular expression user, it makes them MUCH easier to create and use.) I'm still working on formal documentation, and in any case, such documentation isn't necessarily the...
7
2192
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. Consider the example of the HTML element TABLE with the following attributes producing sufficient complexity within the element: <table id="machines" class="noborders inred" style="margin:2em 4em;background-color:#ddd;">
2
373
by: applemonster100 | last post by:
I have an xml string which I need to remove certain <error> node from. I can recognise the <error> nodes I want to delete from their attributes. For example, I need to replace the following with a blank string : <error errorid="11" itemid="10">The card name is mandatory.</error> I can't use a regular expression which matches the whole <error> node string because the text value of the node can be in french, german, anything, or even...
1
1608
by: Earl Teigrob | last post by:
I am parsing some HTML and need to match certain tag attributes to change from relative to absolute urls. The search string I am currently using is rex="src\\s*=\\s*\"(?<content>.+?)\""; that matches all src attributes on the page (and it works, by the way, but is doing extra work) However, I really do not need to match attributes that where the address
0
1405
by: Martin | last post by:
Hi, I am working on a function to email my entire web page. I have so far achieved this, however I am having problems removing all script from the HTML of the page. I wish to remove script because mail clients like outlook ect give messages about not being able to display it properly.
25
5171
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How do I gain access to the expression (not the matches) at runtime? Thanks, Mike
12
337
by: stevebread | last post by:
Hi, I am having some difficulty trying to create a regular expression. Consider: <tag1 name="john"/ <br/<tag2 value="adj__tall__"/> <tag1 name="joe"/> <tag1 name="jack"/> <tag2 value="adj__short__"/> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
1
1971
by: Heron | last post by:
Hi, I'm new to regular expressions and having troubles recreating one that would match tags with there attribute and content. Example on which i'm doing the match: protected void btnLogout_Click(object sender, EventArgs e)<br / {<br /> this._db.SignOut(Session);<br / if (Session != null)<br /> Session.Remove("User");<br / FormsAuthentication.SignOut();<br />
0
9673
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10448
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10003
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9046
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7544
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6784
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5440
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5566
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4114
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.