473,473 Members | 2,029 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

How to find <tag> to </tag> HTML strings and 'save' them?

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Mar 25 '07 #1
7 5082

On Mar 25, 2007, at 12:04 PM, ma**@agtechnical.co.uk wrote:
don't even get me
started on python docs.. ayaa ;]
ok, try getting started with this then: http://www.crummy.com/
software/BeautifulSoup/
Mar 25 '07 #2
ma**@agtechnical.co.uk writes:
Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jg****@gmail.com>
Mar 25 '07 #3
Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
................................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
................................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")

but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

...................

On Mar 25, 6:51 pm, Jorge Godoy <jgo...@gmail.comwrote:
m...@agtechnical.co.uk writes:
Hi All,
Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[
I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>
Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.
Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?
I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)
Thanks in advance,
Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jgo...@gmail.com>- Hide quoted text -

- Show quoted text -

Mar 25 '07 #4
En Sun, 25 Mar 2007 19:44:17 -0300, <ma**@agtechnical.co.ukescribió:
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.
findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists here http://docs.python.org/lib/typesseq.html and strings
here http://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina

Mar 25 '07 #5
Yep, I agree! once I've got this done I'll be back to trawling the
tutorials.
Life never gives you the convenience of learning something fully
before having to apply what you have learnt ;]

Thanks for the feedback and links, I'll be sure to check those out.

Mark.

On Mar 26, 12:05 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 25 Mar 2007 19:44:17 -0300, <m...@agtechnical.co.ukescribió:


from BeautifulSoup import BeautifulSoup
import re
page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
myTagSearch = str(soup.findAll('h2'))
myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()
del myTagSearch
...............................
Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists herehttp://docs.python.org/lib/typesseq.htmland strings
herehttp://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina- Hide quoted text -

- Show quoted text -

Mar 26 '07 #6
ma**@agtechnical.co.uk wrote:
Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.
Ah. What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

John Nagle
Mar 26 '07 #7
John Nagle <na***@animats.comwrote:
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags,
both cases
Have you been bitten by this? When I read this, I was operating under
the assumption that BeautifulSoup wasn't case sensitive, and then I
tried this:
>>import BeautifulSoup as BS
>>soup=BS.BeautifulSoup('<b>one</b><B>two</B>')
soup.findAll('b')
[<b>one</b>, <b>two</b>]
>>soup.findAll({'b':True})
[<b>one</b>, <b>two</b>]
>>>
So I am a little curious.
max

Mar 26 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: higabe | last post by:
Three questions 1) I have a string function that works perfectly but according to W3C.org web site is syntactically flawed because it contains the characters </ in sequence. So how am I...
0
by: Wolfgang Schwanke | last post by:
Dear usenet, I'm having the following small problem. I've been ask to add some Quicktime panoramas to a website. The author of the panoramas has made two versions of each: One in MOV format,...
9
by: Akseli Mäki | last post by:
Hi, the subject say quite a lot. I have about the following code(4.01 transitional): <p>blaa blaa blaa <ul> <li><a href="foo.html">foo</a></li> <li><a href="bar.html">bar</a></li> </ul>
7
by: Rocky Moore | last post by:
I have a web site called HintsAndTips.com. On this site people post tips using a very simply webform with a multi line TextBox for inputing the tip text. This text is encode to HTML so that no...
2
by: Nicky | last post by:
hi, all I know we can do this by some jscript. But is there a way to do it in asp.net c# code? In our project, users could sumit a piece of html code and I need to remove all html tag out. What's...
1
by: andrew | last post by:
So I spent ages trying to work out what the problem was with my code when I did this and found a post which led me to the very simple solution. I use WebMatrix so I'm not sure if this is a major...
14
by: Schraalhans Keukenmeester | last post by:
I am building a default sheet for my linux-related pages. Since many linux users still rely on/prefer viewing textmode and unstyled content I try to stick to the correct html tags to pertain good...
14
by: Michael | last post by:
Since the include function is called from within a PHP script, why does the included file have to identify itself as a PHP again by enclosing its code in <?php... <?> One would assume that the...
7
by: Nathan Sokalski | last post by:
Something that I recently noticed in IE6 (I don't know whether it is true for other browsers or versions of IE) is that it renders <br/and <br></br> differently. With the <br/version, which is what...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
1
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.