473,707 Members | 2,309 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Using Beautiful Soup to entangle bookmarks.html

Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

Sep 7 '06 #1
15 5990
Francach schrieb:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?
Why do you use BeautifulSoup on that? It's generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez
Sep 7 '06 #2

Diez B. Roggisch wrote:
suppose it is well-formed, most probably even xml.
Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.

[1]:
http://www.physic.ut.ee/~kkannike/en...e/bookmarks.py
[2]: http://www.google.com/search?q=firef...ks.html+python

Waylan

Sep 7 '06 #3
Diez B. Roggisch wrote:
Francach schrieb:
>Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.htm l" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?


Why do you use BeautifulSoup on that? It's generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez
Once upon a time I have written for my own purposes some code on this
subject, so maybe it can be used as a starter (tested a bit, but
consider its status as a kind of alpha release):

<code>
from urllib import urlopen
from sgmllib import SGMLParser

class mySGMLParserCla ssProvidingList Of_HREFs(SGMLPa rser):
# provides only HREFs <a href="someURL"f or links to another pages skipping
# references to:
# - internal links on same page : "#..."
# - email adresses : "mailto:... "
# and skipping part with appended internal link info, so that e.g.:
# - "LinkSpec#inter nalLinkID" will be listed as "LinkSpec" only
# ---
# reset() overwrites an empty function available in SGMLParser class
def reset(self):
SGMLParser.rese t(self)
self.A_HREFs = []
#: def reset(self)

# start_a() overwrites an empty function available in SGMLParser class
# from which this class is derived. start_a() will be called each
time the
# SGMLParser detects an <a ...tag within the feed(ed) HTML document:
def start_a(self, tagAttributes_a sListOfNameValu ePairs):
for attrName, attrValue in tagAttributes_a sListOfNameValu ePairs:
if attrName=='href ':
if attrValue[0] != '#' and attrValue[:7] !='mailto:':
if attrValue.find( '#') >= 0:
attrValue = attrValue[:attrValue.find ('#')]
#: if
self.A_HREFs.ap pend(attrValue)
#: if
#: if
#: for
#: def start_a(self, attributes_Name sAndValues_AsLi stOfTuples)
#: class mySGMLParserCla ssProvidingList Of_HREFs(SGMLPa rser)
#
------------------------------------------------------------------------------
# ---
# Execution block:
fileLikeObjFrom _urlopen = urlopen('www.go ogle.com') # set URL
mySGMLParserCla ssObj_withListO fHREFs =
mySGMLParserCla ssProvidingList Of_HREFs()
mySGMLParserCla ssObj_withListO fHREFs.feed(fil eLikeObjFrom_ur lopen.read())
mySGMLParserCla ssObj_withListO fHREFs.close()
fileLikeObjFrom _urlopen.close( )

for href in mySGMLParserCla ssObj_withListO fHREFs.A_HREFs:
print href
#: for
</code>

Claudio Grondi
Sep 7 '06 #4
waylan schrieb:
Diez B. Roggisch wrote:
>suppose it is well-formed, most probably even xml.

Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.
I have to admit: I didn't check on that file, and simply couldn't
believe it was so badly written as it apparently is.

But I was at least capable of shoving it through HTMLParser. But I'm not
sure if that is of any use.

Excuse me causing confusion.

Diez
Sep 7 '06 #5

Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.
If the only thing you want out of the document is the URL's why not
search for: href="..." ? You could get a regular expression that
matches that pretty easily. I think this should just about get you
there, but my regular expressions have gotten very rusty.

/href=\".+\"/

Sep 7 '06 #6
On 7 Sep 2006 14:30:25 -0700, Adam Jones <aj*****@gmail. comwrote:
>
Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

If the only thing you want out of the document is the URL's why not
search for: href="..." ? You could get a regular expression that
matches that pretty easily. I think this should just about get you
there, but my regular expressions have gotten very rusty.

/href=\".+\"/
I doubt the bookmarks file is huge so something simple like

f = open('bookmarks .html').readlin es()
data = [x for x in f if x.strip().start swith('<DT><A ')]

would get you started.

On my exported firefox bookmarks, this gives me all the urls, they
just need to be parsed a bit more accurately, I might be tempted to
just use a couple of splits() to keep it real simple.

HTH
--

Tim Williams
Sep 7 '06 #7
Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.
from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(o pen('bookmarks. html')).findAll ('a')]

Regards,
George

Sep 8 '06 #8
Hi,

thanks for the helpful reply.
I wanted to do two things - learn to use Beautiful Soup and bring out
all the information
in the bookmarks file to import into another application. So I need to
be able to travel down the tree in the bookmarks file. bookmarks seems
to use header tags which can then contain a tags where the href
attributes are. What I don't understand is how to create objects which
can then be used to return the information in the next level of the
tree.

Thanks again,
Martin.

George Sakkis wrote:
Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(o pen('bookmarks. html')).findAll ('a')]

Regards,
George
Sep 8 '06 #9
Francach wrote:
George Sakkis wrote:
Francach wrote:
Hi,
>
I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html " file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?
>
Thanks,
Martin.
from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(o pen('bookmarks. html')).findAll ('a')]
Hi,

thanks for the helpful reply.
I wanted to do two things - learn to use Beautiful Soup and bring out
all the information
in the bookmarks file to import into another application. So I need to
be able to travel down the tree in the bookmarks file. bookmarks seems
to use header tags which can then contain a tags where the href
attributes are. What I don't understand is how to create objects which
can then be used to return the information in the next level of the
tree.

Thanks again,
Martin.
I'm not sure I understand what you want to do. Originally you asked to
extract all urls and BeautifulSoup can do this for you in one line. Why
do you care about intermediate objects or if the anchor tags are nested
under header tags or not ? Read and embrace BeautifulSoup's philosophy:
"You didn't write that awful page. You're just trying to get some data
out of it. Right now, you don't really care what HTML is supposed to
look like."

George

Sep 8 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
2905
by: NoSpamThankYouMam | last post by:
I am looking for a product that I am not sure exists. I have bookmarks to webpages in Internet Explorer, Mozilla Firefox, Opera, Netscape Navigator, and on a "Favorite Links" page on my website. Every time I add a bookmarked page, I have to do it five times. Is there a product or combination of products that will synchronize my bookmarks and generate a HTML document that I can upload to my website? I am using Windows 2000.
15
17893
by: could ildg | last post by:
In re, the punctuation "^" can exclude a single character, but I want to exclude a whole word now. for example I have a string "hi, how are you. hello", I want to extract all the part before the world "hello", I can't use ".*" because "^" only exclude single char "h" or "e" or "l" or "o". Will somebody tell me how to do it? Thanks.
2
2169
by: meyerkp | last post by:
Hi all, I'm trying to extract some information from an html file using beautiful soup. The strings I want get are after br tags, eg: <font size='6'> <br>this info <br>more info <br>and more info </font>
3
1416
by: rh0dium | last post by:
Hi all, I am trying to parse into a dictionary a table and I am having all kinds of fun. Can someone please help me out. What I want is this: dic={'Division Code':'SALS','Employee':'LOO ABLE'} Here is what I have..
1
2730
by: Tempo | last post by:
Heya. I have never used a module/script before, and the first problem I have run into is that I do not know how to install a module/script. I have downloaded Beautiful Soup, but how do I use it in one of my own programs? I know that I use an "include" statement, but do I first need to make a copy of BeautifulSoup.pyc or BeautifulSoup.py into the Python directory? Thanks in advanced for any and all help that you may provide. Many thanks.
0
679
by: Anthra Norell | last post by:
Hi, Martin, SE is a stream editor that does not introduce the overhead and complications of overkill parsing. See if it suits your needs: http://cheeseshop.python.org/pypi/SE/2.2%20beta <EAT # delete all unmatched input "~(?i)<a.*?href.*?>~==\n" # keep hrefs and add a new line
3
3129
by: PicURLPy | last post by:
Hello, I want to extract some image links from different html pages, in particular i want extract those image tags which height values are greater than 200. Is there an elegant way in BeautifulSoup to do this?
3
10728
by: cjl | last post by:
I am learning python and beautiful soup, and I'm stuck. A web page has a table that contains data I would like to scrape. The table has a unique class, so I can use: soup.find("table", {"class": "class_name"}) This isolates the table. So far, so good. Next, this table has a certain number of rows (I won't know ahead of time how many), and each row has a set number of cells (which will be constant).
2
5593
by: Alexnb | last post by:
Okay, I have used BeautifulSoup a lot lately, but I am wondering, how do you open a local html file? Usually I do something like this for a url soup = BeautifulSoup(urllib.urlopen('http://www.website.com') but the file extension doesn't work. So how do I open one? -- View this message in context: http://www.nabble.com/simple-Question-about-using-BeautifulSoup-tp19069980p19069980.html
0
8697
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9157
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
7920
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6615
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5939
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4454
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3150
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2507
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2096
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.