Regex Question - General

3,256 Recognized Expert Specialist

I know almost nothing of Regular Expressions other than that they exist. However, I know that it's probably the answer to my co-worker's problem.

We need to strip some HTML out of some data. The biggest problem we have is the <p> tags. But they include some other attributes. For example:

Expand|Select|Wrap|Line Numbers

 <p class="asdf1234">Some Text</p>

<p class="qwer567890">Some Other Text</p>

Our end goal is

Expand|Select|Wrap|Line Numbers

 Some Text

Some Other Text

We've so far gotten a regex to remove the closing </p>, and to get rid of an empty open <p>, but if it has any attributes included, the regex won't mach it.

Can anyone suggest a regex that will match "<p" + any number of characters/symbols + ">" for me? I'd appreciate it.

Jan 5 '10 #1

Subscribe Reply

2627

tlhintoq

3,525

Recognized Expert Specialist

I know nothing of RegEx either meaning my way is probably more brute force.
Get the indexes of the "<" and ">" characters
Discard everything before and including the first ">" and after and including the last "<"

Jan 5 '10 #2

bvdet

2,851

Recognized Expert Moderator Specialist

insertAlias,

In Python:

Expand|Select|Wrap|Line Numbers

"<p.*>([^<>]+?)</p>"

Example:

Expand|Select|Wrap|Line Numbers

 import re
 
patt = re.compile(r"<p.*>([^<>]+?)</p>")
 
def tag_text(s):

    output = []

    while True:

        m = patt.search(s)

        if m:

            output.append(m.group(1))

            s = s[m.end()+1:]

        else:

            return output
 
s = '''

<p class="asdf1234">Some Text</p>

<p class="qwer567890">Some Other Text</p>'''
 
textList = tag_text(s)
 
print textList

Output:

Expand|Select|Wrap|Line Numbers

 >>> ['Some Text', 'Some Other Text']

>>>

Jan 5 '10 #3

Markus

6,050

Recognized Expert Expert

@bvdet
The regex will largely be the same throughout languages.

Jan 5 '10 #4

NeoPa

32,558

Recognized Expert Moderator MVP

Expand|Select|Wrap|Line Numbers

</?p[^>]*>

will match either the introduction or the termination.

If you parse through your text changing this to blank then you should have what you need.

< ==> Find string starting with <.
/? ==> Next it may, or may not, have a /.
p ==> A p must follow.
[^>] ==> Any character other than >.
* ==> Match any number of the preceding specification.
> ==> A > must follow.

Jan 6 '10 #5

Curtis Rutland

3,256

Recognized Expert Specialist

I think Ade's is going to do it for me. I'll give it to my co-worker tomorrow and see if it works.

Thanks for the response. I've been meaning to learn regex, but I'm a big slacker.

Jan 6 '10 #6

MMcCarthy

14,534

Recognized Expert Moderator MVP

IA if this answers the question (after consulting with your colleague) can you move this into misc. Otherwise it won't be public in searches.

Thanks

Mary

Jan 6 '10 #7

NeoPa

32,558

Recognized Expert Moderator MVP

Sorry Mary. I should have done this before. There's no need to wait for the answer to be confirmed. That's where it should be anyway.

Jan 6 '10 #8

NeoPa

32,558

Recognized Expert Moderator MVP

BTW. I learnt all I know about RegExes from the Help section of a utility called TextPad. A pretty powerful text editor (I'm sure there are various other good ones out there too) which supports them. If you click Help while in the Search or Replace dialog boxes it takes you to three pages of info and details. Go through that and practice a bit (I found it so powerful I didn't need to try to practice) and you'll be making them dance in no time. It also acts as a reference when you can sort of remember what you need but need a memory jog.

There are probably other places more web available, but I only know this one well, as I used it to learn from and it did a good job for me.

Jan 6 '10 #9

bvdet

2,851

Recognized Expert Moderator Specialist

I learned about regular expressions a little at a time. I found this page to be very informative and easy to understand. It is helpful having a way to easily test a regular expression. When you cannot figure out why an expression won't work, editing in a regex debugger makes it almost tolerable. I have been using Kodos.

Jan 6 '10 #10

Markus

6,050

Recognized Expert Expert

The only problem is... where the hell is misc? It's disappeared from the navigation again.

Jan 6 '10 #11

NeoPa

32,558

Recognized Expert Moderator MVP

You could try the Ask Question link at the top.

It's not there either, but you could try just for fun :D

Otherwise the breadcrumbs is good from here :S

Jan 6 '10 #12

dgreenhouse

250

Recognized Expert Contributor

I know this is a 3 week old thread, but I'd like to note
that the following book is the RegEx bible:

Mastering Regular Expressions by Jeffrey E. F. Friedl
Publisher: O'Reilly (it's currently in its 3rd edition).

It sits to miy desk to the right; I've barely touched its depths; It hurts your head! :-)

Feb 2 '10 #13

Markus

6,050

Recognized Expert Expert

That's going on my wishlist. Thanks, dgreenhouse.

Feb 2 '10 #14

Similar topics

4349

vb.net regex question

by: engwar1 | last post by:

Not sure where to ask this. Please suggest another newsgroup if this isn't the best place for this question. I'm new to both vb.net and regex. I need a regular expression that will validate what people are entering as their new password. Must be between 6 and 10 characters Must be alphanumeric only Can not be the word "password" in any...

.NET Framework

9704

Convert.ToString( double ) + xsd:pattern + RegEx == ?

by: aevans1108 | last post by:

expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.

.NET Framework

3411

Regex to recognize math/string functions

by: Tim Conner | last post by:

Hi, Thanks to Peter, Chris and Steven who answered my previous answer about regex to split a string. Actually, it was as easy as create a regex with the pattern "/*-+()," and most of my string was splitted. I am fascinated to the powerfull use of this RegEx class, so I wonder if it could go a step further. As a question, can regex be...

C# / C Sharp

396

Regex question

by: Du Dang | last post by:

Text: ===================== <script1> ***stuff A </script1> ***more stuff <script2> ***stuff B

C# / C Sharp

3942

Which RegEx Testing Tool Do You Prefer?

by: clintonG | last post by:

I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher http://forta.com/books/0672325667/

ASP.NET

5077

Regex Vb.net question

by: Chris | last post by:

How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated by a tool. ' Runtime Version: 1.1.4322.2032 ' ' Changes to this file may cause incorrect behavior and will be lost if ' the code is...

Visual Basic .NET

5862

RegEx conditional search and replace

by: Martin Evans | last post by:

Sorry, yet another REGEX question. I've been struggling with trying to get a regular expression to do the following example in Python: Search and replace all instances of "sleeping" with "dead". This parrot is sleeping. Really, it is sleeping. to This parrot is dead. Really, it is dead.

Python

2566

Quick regex question

by: Extremest | last post by:

I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward slash then some numbers. For some reason I am not getting that. It won't work at all in 2.0

C# / C Sharp

4191

Regex and repeating characters

by: Phil Barber | last post by:

I am using Regex to validate a file name. I have everything I need except I would like the dot(.) in the filename only to appear once. My question is it possible to allow one instance of character but not two or more? example myfile.doc = good My.file.doc = not good if you could give an example of the expression pattern that would most...

C# / C Sharp

2055

Regex performance issue

by: | last post by:

Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test it is called a 100 times but in my real app it could be called 50000 or more times. So my question is if it is somehow possible to speed this up...

C# / C Sharp

7269

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7177

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

7394

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

7559

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7123

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

5701

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5100

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

4756

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3237

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET