473,385 Members | 2,069 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

A challenge: regex to convert all urls within HTML

Hi!

My task: Take HTML -convert into plain text.

Sub-task:
1. Find all urls within HTML (<a href="http://www.abc.com">More
about baby bears</a>).
2. And convert them into plain text: More about baby brears (http://
www.abc.com)
The question:
Can it be done with a single regex (i.e. single pass)? Or what would
be otherwise the most efficient way of doing it?

Thank you very much for your time!

Sep 17 '07 #1
3 1721
I don't claim it would be better or worse - but if the source is
xhtml, an alternative might be xslt? But it can be hard to write tidy
xslt that correctly handles mixed content (which is typical in xhml).

Just a thought...

But yes - I would *imagine* that you can do this with a regex replace,
but handling all permutations of attributes / sequence etc could be a
pain.

Marc
Sep 17 '07 #2
(?i)(?s)<a[^>]+?href="?(?<url>[^"]+)"?>(?<innerHtml>.+?)</a\s*>

Group "url" contains URL.
Group "innerHtml" contains innerHtml - the text between the tags.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Vlad" <vg****@gmail.comwrote in message
news:11**********************@r29g2000hsg.googlegr oups.com...
Hi!

My task: Take HTML -convert into plain text.

Sub-task:
1. Find all urls within HTML (<a href="http://www.abc.com">More
about baby bears</a>).
2. And convert them into plain text: More about baby brears (http://
www.abc.com)
The question:
Can it be done with a single regex (i.e. single pass)? Or what would
be otherwise the most efficient way of doing it?

Thank you very much for your time!

Sep 17 '07 #3
Marc, thank you! It a cool idea!
Kevin, thank you, works as a treat! You're a star!

Sep 17 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Steve | last post by:
Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...
12
by: chris | last post by:
i can see the power of regular expressions but am having a bit of a battle getting my head around them. can anyone recommend some BASIC - tutorials for using regex something like th idots...
4
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...
6
by: Martin Evans | last post by:
Sorry, yet another REGEX question. I've been struggling with trying to get a regular expression to do the following example in Python: Search and replace all instances of "sleeping" with "dead"....
23
by: Steve Howell | last post by:
Hi, I'm offering a challenge to extend the following page by one good example: http://wiki.python.org/moin/SimplePrograms Right now the page starts off with 15 examples that cover lots of...
11
by: ymic8 | last post by:
Hi everyone, this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success. ...
1
by: Mick Walker | last post by:
Hi, I am using the following function to match any URLS from within a string containing the html of a webpage: public List<stringDumpHrefs(String inputString) { Regex r; Match m;...
2
by: Mick Walker | last post by:
Hi, I am using the following function to match any URLS from within a string containing the html of a webpage: public List<stringDumpHrefs(String inputString) { Regex r; Match m;...
0
by: Mick Walker | last post by:
Hi, I am using the following function to match any URLS from within a string containing the html of a webpage: public List<stringDumpHrefs(String inputString) { Regex r; Match m;...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.