By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,930 Members | 1,389 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,930 IT Pros & Developers. It's quick & easy.

How to find <tag> to </tag> HTML strings and 'save' them?

P: n/a
Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Mar 25 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a

On Mar 25, 2007, at 12:04 PM, ma**@agtechnical.co.uk wrote:
don't even get me
started on python docs.. ayaa ;]
ok, try getting started with this then: http://www.crummy.com/
software/BeautifulSoup/
Mar 25 '07 #2

P: n/a
ma**@agtechnical.co.uk writes:
Hi All,

Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[

I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>

Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>>From where I would then like to 'diff' the results to see if they
match.

Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?

I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)

Thanks in advance,

Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jg****@gmail.com>
Mar 25 '07 #3

P: n/a
Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
................................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
................................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")

but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

...................

On Mar 25, 6:51 pm, Jorge Godoy <jgo...@gmail.comwrote:
m...@agtechnical.co.uk writes:
Hi All,
Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[
I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>
Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.
Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?
I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)
Thanks in advance,
Mark.

Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.

--
Jorge Godoy <jgo...@gmail.com>- Hide quoted text -

- Show quoted text -

Mar 25 '07 #4

P: n/a
En Sun, 25 Mar 2007 19:44:17 -0300, <ma**@agtechnical.co.ukescribió:
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.
findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists here http://docs.python.org/lib/typesseq.html and strings
here http://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina

Mar 25 '07 #5

P: n/a
Yep, I agree! once I've got this done I'll be back to trawling the
tutorials.
Life never gives you the convenience of learning something fully
before having to apply what you have learnt ;]

Thanks for the feedback and links, I'll be sure to check those out.

Mark.

On Mar 26, 12:05 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Sun, 25 Mar 2007 19:44:17 -0300, <m...@agtechnical.co.ukescribió:


from BeautifulSoup import BeautifulSoup
import re
page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
myTagSearch = str(soup.findAll('h2'))
myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()
del myTagSearch
...............................
Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

findAll() returns a list. You convert the list to its string
representation, using str(...), and that's the way lists look like: with
[] around, and commas separating elements. If you don't like that, don't
use str(some_list).
Do you like an item by line? Use "\n".join(myTagSearch) (remember to strip
the str() around findAll)
Do you like comma separated items? Use ",".join(myTagSearch)
Read about lists herehttp://docs.python.org/lib/typesseq.htmland strings
herehttp://docs.python.org/lib/string-methods.html

For the remaining questions, I strongly suggest reading the Python
Tutorial (or any other book like Dive into Python). You should grasp some
basic knowledge of the language at least, before trying to use other tools
like BeautifulSoup; it's too much for a single step.

--
Gabriel Genellina- Hide quoted text -

- Show quoted text -

Mar 26 '07 #6

P: n/a
ma**@agtechnical.co.uk wrote:
Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.
Ah. What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

John Nagle
Mar 26 '07 #7

P: n/a
John Nagle <na***@animats.comwrote:
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags,
both cases
Have you been bitten by this? When I read this, I was operating under
the assumption that BeautifulSoup wasn't case sensitive, and then I
tried this:
>>import BeautifulSoup as BS
>>soup=BS.BeautifulSoup('<b>one</b><B>two</B>')
soup.findAll('b')
[<b>one</b>, <b>two</b>]
>>soup.findAll({'b':True})
[<b>one</b>, <b>two</b>]
>>>
So I am a little curious.
max

Mar 26 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.