473,698 Members | 2,158 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parsing HTML to remove pictures and stylesheets

Seb
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb

Oct 21 '06 #1
3 2023
I would start with the HTML agility pack and se if it helps you.

http://www.codeplex.com/Wiki/View.as...tmlagilitypack

If that fails, then a few well targetted regular expressions would do the
job I expect in findinf the offending parts. string.replace takes regular
expressions.
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
"Seb" <so****@gmail.c omwrote in message
news:11******** **************@ b28g2000cwb.goo glegroups.com.. .
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb

Oct 21 '06 #2
Hey Seb,

I don't know of a program or function that already does this kind of thing,
but you could implement it by using Regular Expressions
(System.Text.Re gularExpression s)

check out:
http://msdn.microsoft.com/library/de...classtopic.asp

Eric

"Seb" <so****@gmail.c omwrote in message
news:11******** **************@ b28g2000cwb.goo glegroups.com.. .
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb

Oct 21 '06 #3
I'd second that, the Html Agility Pack is ideal for this sort of thing.
In fact, it includes sample code for converting HTML into plain text.

More info here:
http://chrisfulstow.blogspot.com/200...ml-in-net.html

--
Chris Fulstow
MCP, MCTS
http://chrisfulstow.blogspot.com/
John Timney (MVP) wrote:
I would start with the HTML agility pack and se if it helps you.

http://www.codeplex.com/Wiki/View.as...tmlagilitypack

If that fails, then a few well targetted regular expressions would do the
job I expect in findinf the offending parts. string.replace takes regular
expressions.
--
Regards

John Timney (MVP)
VISIT MY WEBSITE:
http://www.johntimney.com
"Seb" <so****@gmail.c omwrote in message
news:11******** **************@ b28g2000cwb.goo glegroups.com.. .
Hello,

I am trying to find some object/function able to take an HTML page
(code) as an input, strip out all images, stylesheets and other
external references, and returns "cleaned" HTML only (without external
references) or a text-only version of the page.

Any ideas?

Thanks,
Seb
Oct 21 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
4222
by: Anders Eriksson | last post by:
Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to take all the words, both English and the other language and create a dictionary. so that I can look up About and get Om as the answer. How is the best way to do this?
16
2892
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.
0
1082
by: rufus | last post by:
Hi, I need to parse some html that contains the comments: <!-- Start Display ---> and <!-- End Display ---> . I need to capture all the HTML between these comments. I started with: r = New Regex("Start Display --->(.*)<!-- End Display --->") but it didn't seem to match anything. I also tried using the Substring function in the string library but it didn't match the first comment for some reason ie strTemp.IndexOf("<!-- Start ...
1
984
by: unklevo | last post by:
Is there an easy way to convert HTML that comes from database as a string into text and display it on winform... Thanks.
0
2864
by: firelli | last post by:
Hi, I would like to be able to read (parse) an html file into my Java program. Once I'm able to do this, I need to be able to analyse the html code. If you could offer any help in meeting for first goal - parsing html files - I would be very grateful. Even if its a link to somewhere, or perhaps a book to read, that’s fine too. Many thanks, Firelli
2
1656
by: pabloski | last post by:
I need to parse real world HTML/XML documents and I found two nice python solution: BeautifulSoup and Tidy. However I found pyXPCOM that is a wrapper for Gecko. So I was thinking Gecko surely handles bad html in a more consistent and error-proof way than BS and Tidy. I'm interested in using Mozilla DOM from inside a Python script, however I'm a bit confused about how can I use pyXPCOM to accomplish this job.
9
2464
by: sebzzz | last post by:
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student worker...). Anyways, to help them, they first asked me to copy all the text from all the pages of the site (and there is a lot!) to word documents. I found the idea pretty stupid since style would have to be applied from scratch anyway since we...
3
2466
by: codemannh | last post by:
I have been trying to figure out the best way to parse some chunks of html code that contain tables. I've been trying to do this with HTML::Parser and HTML::TokeParser and HTML::TokeParser::Simple, but I just can't seem to get everything working the way I want. The chunks are from a standard template so there is some structure to the data, but it is not that simple. The problem: There can be bare text before, after or between each table in...
7
5591
by: Benjamin | last post by:
I'm trying to parse an HTML file. I want to retrieve all of the text inside a certain tag that I find with XPath. The DOM seems to make this available with the innerHTML element, but I haven't found a way to do it in Python.
0
8683
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8609
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8901
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8871
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
4371
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4622
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3052
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2336
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2007
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.