How to find <tag> to </tag> HTML strings and 'save' them?

mark

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.

>From where I would then like to 'diff' the results to see if they

match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Mar 25 '07 #1

Subscribe Reply

5082

Michael Bentley

On Mar 25, 2007, at 12:04 PM, ma**@agtechnical.co.uk wrote:

don't even get me
started on python docs.. ayaa ;]

ok, try getting started with this then: http://www.crummy.com/
software/BeautifulSoup/

Mar 25 '07 #2

Jorge Godoy

ma**@agtechnical.co.uk writes:

Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jg****@gmail.com>

Mar 25 '07 #3

mark

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
................................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
................................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")

but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

...................

On Mar 25, 6:51 pm, Jorge Godoy <jgo...@gmail.comwrote:

m...@agtechnical.co.uk writes:
Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jgo...@gmail.com>- Hide quoted text -

- Show quoted text -

Mar 25 '07 #4

Gabriel Genellina

En Sun, 25 Mar 2007 19:44:17 -0300, <ma**@agtechnical.co.ukescribió:

from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists here http://docs.python.org/lib/typesseq.html and strings
here http://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina

Mar 25 '07 #5

Mark Crowther

Yep, I agree! once I've got this done I'll be back to trawling the
tutorials.
Life never gives you the convenience of learning something fully
before having to apply what you have learnt ;]

Thanks for the feedback and links, I'll be sure to check those out.

Mark.

On Mar 26, 12:05 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:

En Sun, 25 Mar 2007 19:44:17 -0300, <m...@agtechnical.co.ukescribió:

from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists herehttp://docs.python.org/lib/typesseq.htmland strings
herehttp://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina- Hide quoted text -

- Show quoted text -

Mar 26 '07 #6

John Nagle

ma**@agtechnical.co.uk wrote:

Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

Ah. What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

John Nagle

Mar 26 '07 #7

Max Erickson

John Nagle <na***@animats.comwrote:

htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags,
both cases

Have you been bitten by this? When I read this, I was operating under
the assumption that BeautifulSoup wasn't case sensitive, and then I
tried this:

>>import BeautifulSoup as BS

>>soup=BS.BeautifulSoup('<b>one</b><B>two</B>')
soup.findAll('b')

[<b>one</b>, <b>two</b>]

>>soup.findAll({'b':True})

[<b>one</b>, <b>two</b>]

>>>

So I am a little curious.
max

Mar 26 '07 #8

Similar topics

String.replace(/</g,'<');

by: higabe | last post by:

Three questions 1) I have a string function that works perfectly but according to W3C.org web site is syntactically flawed because it contains the characters </ in sequence. So how am I...