Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :) :
................................
from BeautifulSoup import BeautifulSoup
import re
page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
myTagSearch = str(soup.findAll('h2'))
myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()
del myTagSearch
................................
I do have two other small queries that I wonder if anyone can help
with.
Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.
There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
soup.find(text=",").replaceWith("")
but that throws this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'
Again working with the 'Removing Elements' example I tried:
soup = BeautifulSoup("you are a banana, banana, banana")
a = str(",")
comments = soup.findAll(text=",")
[",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing
Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?
Thanks in advance!
Mark.
...................
On Mar 25, 6:51 pm, Jorge Godoy <jgo...@gmail.comwrote:
m...@agtechnical.co.uk writes:
Hi All,
Apologies for the newbie question but I've searched and tried all
sorts for a few days and I'm pulling my hair out ;[
I have a 'reference' HTML file and a 'test' HTML file from which I
need to pull 10 strings, all of which are contained within <h2tags,
e.g.:
<h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>
Once I've found the 10 I'd like to write them to another 'results'
html file. Perhaps a 'reference results' and a 'test results' file.
>From where I would then like to 'diff' the results to see if they
match.
Here's the rub: I cannot find a way to pull those 10 strings so I can
save them to the results pages.
Can anyone please suggest how this can be done?
I've tried allsorts but I've been learning Python for 1 week and just
don't know enough to mod example scripts it seems. don't even get me
started on python docs.. ayaa ;] Please feel free to teach me to suck
eggs because it's all new to me :)
Thanks in advance,
Mark.
Take a look at BeautifulSoup. It is easy to use and works well with some
malformed HTML that you might find ahead.
--
Jorge Godoy <jgo...@gmail.com>- Hide quoted text -
- Show quoted text -