Using Beautiful Soup to entangle bookmarks.html

Francach

Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

Sep 7 '06 #1

Subscribe Reply

5957

Diez B. Roggisch

Francach schrieb:

Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Why do you use BeautifulSoup on that? It's generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez

Sep 7 '06 #2

waylan

Diez B. Roggisch wrote:

suppose it is well-formed, most probably even xml.

Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.

[1]:
http://www.physic.ut.ee/~kkannike/en...e/bookmarks.py
[2]: http://www.google.com/search?q=firef...ks.html+python

Waylan

Sep 7 '06 #3

Claudio Grondi

Diez B. Roggisch wrote:

Francach schrieb:

>Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Why do you use BeautifulSoup on that? It's generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez

Once upon a time I have written for my own purposes some code on this
subject, so maybe it can be used as a starter (tested a bit, but
consider its status as a kind of alpha release):

<code>
from urllib import urlopen
from sgmllib import SGMLParser

class mySGMLParserClassProvidingListOf_HREFs(SGMLParser) :
# provides only HREFs <a href="someURL"for links to another pages skipping
# references to:
# - internal links on same page : "#..."
# - email adresses : "mailto:..."
# and skipping part with appended internal link info, so that e.g.:
# - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
# ---
# reset() overwrites an empty function available in SGMLParser class
def reset(self):
SGMLParser.reset(self)
self.A_HREFs = []
#: def reset(self)

# start_a() overwrites an empty function available in SGMLParser class
# from which this class is derived. start_a() will be called each
time the
# SGMLParser detects an <a ...tag within the feed(ed) HTML document:
def start_a(self, tagAttributes_asListOfNameValuePairs):
for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
if attrName=='href':
if attrValue[0] != '#' and attrValue[:7] !='mailto:':
if attrValue.find('#') >= 0:
attrValue = attrValue[:attrValue.find('#')]
#: if
self.A_HREFs.append(attrValue)
#: if
#: if
#: for
#: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
#: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
#
------------------------------------------------------------------------------
# ---
# Execution block:
fileLikeObjFrom_urlopen = urlopen('www.google.com') # set URL
mySGMLParserClassObj_withListOfHREFs =
mySGMLParserClassProvidingListOf_HREFs()
mySGMLParserClassObj_withListOfHREFs.feed(fileLike ObjFrom_urlopen.read())
mySGMLParserClassObj_withListOfHREFs.close()
fileLikeObjFrom_urlopen.close()

for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
print href
#: for
</code>

Claudio Grondi

Sep 7 '06 #4

Diez B. Roggisch

waylan schrieb:

Diez B. Roggisch wrote:
>suppose it is well-formed, most probably even xml.

Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.

I have to admit: I didn't check on that file, and simply couldn't
believe it was so badly written as it apparently is.

But I was at least capable of shoving it through HTMLParser. But I'm not
sure if that is of any use.

Excuse me causing confusion.

Diez

Sep 7 '06 #5

Adam Jones

Francach wrote:

Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

If the only thing you want out of the document is the URL's why not
search for: href="..." ? You could get a regular expression that
matches that pretty easily. I think this should just about get you
there, but my regular expressions have gotten very rusty.

/href=\".+\"/

Sep 7 '06 #6

Tim Williams

On 7 Sep 2006 14:30:25 -0700, Adam Jones <aj*****@gmail.comwrote:

>
Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

If the only thing you want out of the document is the URL's why not
search for: href="..." ? You could get a regular expression that
matches that pretty easily. I think this should just about get you
there, but my regular expressions have gotten very rusty.

/href=\".+\"/

I doubt the bookmarks file is huge so something simple like

f = open('bookmarks.html').readlines()
data = [x for x in f if x.strip().startswith('<DT><A ')]

would get you started.

On my exported firefox bookmarks, this gives me all the urls, they
just need to be parsed a bit more accurately, I might be tempted to
just use a couple of splits() to keep it real simple.

HTH
--

Tim Williams

Sep 7 '06 #7

George Sakkis

Francach wrote:

Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(open('bookmarks.html')).findAll('a')]

Regards,
George

Sep 8 '06 #8

Francach

Hi,

thanks for the helpful reply.
I wanted to do two things - learn to use Beautiful Soup and bring out
all the information
in the bookmarks file to import into another application. So I need to
be able to travel down the tree in the bookmarks file. bookmarks seems
to use header tags which can then contain a tags where the href
attributes are. What I don't understand is how to create objects which
can then be used to return the information in the next level of the
tree.

Thanks again,
Martin.

George Sakkis wrote:

Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(open('bookmarks.html')).findAll('a')]

Regards,
George

Sep 8 '06 #9

George Sakkis

Francach wrote:

George Sakkis wrote:
Francach wrote:
Hi,
>
I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?
>
Thanks,
Martin.
from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(open('bookmarks.html')).findAll('a')]
Hi,

thanks for the helpful reply.
I wanted to do two things - learn to use Beautiful Soup and bring out
all the information
in the bookmarks file to import into another application. So I need to
be able to travel down the tree in the bookmarks file. bookmarks seems
to use header tags which can then contain a tags where the href
attributes are. What I don't understand is how to create objects which
can then be used to return the information in the next level of the
tree.

Thanks again,
Martin.

I'm not sure I understand what you want to do. Originally you asked to
extract all urls and BeautifulSoup can do this for you in one line. Why
do you care about intermediate objects or if the anchor tags are nested
under header tags or not ? Read and embrace BeautifulSoup's philosophy:
"You didn't write that awful page. You're just trying to get some data
out of it. Right now, you don't really care what HTML is supposed to
look like."

George

Sep 8 '06 #10

Francach

Hi George,

Firefox lets you group the bookmarks along with other information into
directories and sub-directories. Firefox uses header tags for this
purpose. I'd like to get this grouping information out aswell.

Regards,
Martin.
the idea is to extract.
George Sakkis wrote:

Francach wrote:
George Sakkis wrote:
Francach wrote:
Hi,

I'm trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I've been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.
>
from BeautifulSoup import BeautifulSoup
urls = [tag['href'] for tag in
BeautifulSoup(open('bookmarks.html')).findAll('a')]
Hi,

thanks for the helpful reply.
I wanted to do two things - learn to use Beautiful Soup and bring out
all the information
in the bookmarks file to import into another application. So I need to
be able to travel down the tree in the bookmarks file. bookmarks seems
to use header tags which can then contain a tags where the href
attributes are. What I don't understand is how to create objects which
can then be used to return the information in the next level of the
tree.

Thanks again,
Martin.

I'm not sure I understand what you want to do. Originally you asked to
extract all urls and BeautifulSoup can do this for you in one line. Why
do you care about intermediate objects or if the anchor tags are nested
under header tags or not ? Read and embrace BeautifulSoup's philosophy:
"You didn't write that awful page. You're just trying to get some data
out of it. Right now, you don't really care what HTML is supposed to
look like."

George

Sep 8 '06 #11

Paul Boddie

Francach wrote:

>
Firefox lets you group the bookmarks along with other information into
directories and sub-directories. Firefox uses header tags for this
purpose. I'd like to get this grouping information out aswell.

import libxml2dom # http://www.python.org/pypi/libxml2dom
d = libxml2dom.parse("bookmarks.html", html=1)
for node in d.xpath("html/body//dt/*[1]"):
if node.localName == "h3":
print "Section:", node.nodeValue
elif node.localName == "a":
print "Link:", node.getAttribute("href")

One exercise, using the above code as a starting point, would be to
reproduce the hierarchy exactly, rather than just showing the section
names and the links which follow them. Ultimately, you may be looking
for a way to just convert the HTML into a simple XML document or into
another hierarchical representation which excludes the HTML baggage and
details irrelevant to your problem.

Paul

Sep 8 '06 #12

George Sakkis

Francach wrote:

Hi George,

Firefox lets you group the bookmarks along with other information into
directories and sub-directories. Firefox uses header tags for this
purpose. I'd like to get this grouping information out aswell.

Regards,
Martin.

Here's what I came up with:
http://rafb.net/paste/results/G91EAo70.html. Tested only on my
bookmarks; see if it works for you.

For each subfolder there is a recursive call that walks the respective
subtree, so it's probably not the most efficient solution, but I
couldn't think of any one-pass way to do it using BeautifulSoup.

George

Sep 8 '06 #13

Francach

Hallo George,

thanks a lot! This is exactly the direction I had in mind.
Your script demonstrates nicely how Beautiful Soup works.

Regards,
Martin.

George Sakkis wrote:

Francach wrote:
Hi George,

Firefox lets you group the bookmarks along with other information into
directories and sub-directories. Firefox uses header tags for this
purpose. I'd like to get this grouping information out aswell.

Regards,
Martin.

Here's what I came up with:
http://rafb.net/paste/results/G91EAo70.html. Tested only on my
bookmarks; see if it works for you.

For each subfolder there is a recursive call that walks the respective
subtree, so it's probably not the most efficient solution, but I
couldn't think of any one-pass way to do it using BeautifulSoup.

George

Sep 9 '06 #14

robin

"George Sakkis" <ge***********@gmail.comwrote:

>Here's what I came up with:
http://rafb.net/paste/results/G91EAo70.html. Tested only on my
bookmarks; see if it works for you.

That URL is dead. Got another?

-----
robin
noisetheatre.blogspot.com

Sep 21 '06 #15

George Sakkis

robin wrote:

"George Sakkis" <ge***********@gmail.comwrote:

Here's what I came up with:
http://rafb.net/paste/results/G91EAo70.html. Tested only on my
bookmarks; see if it works for you.

That URL is dead. Got another?

Yeap, try this one:
http://gsakkis-utils.googlecode.com/...s/bookmarks.py

George

Sep 21 '06 #16

Similar topics

2885

Autogenerating a HTML links/bookmarks page

by: NoSpamThankYouMam | last post by:

I am looking for a product that I am not sure exists. I have bookmarks to webpages in Internet Explorer, Mozilla Firefox, Opera, Netscape Navigator, and on a "Favorite Links" page on my website....

HTML / CSS

17852

How can I exclude a word by using re?

by: could ildg | last post by:

In re, the punctuation "^" can exclude a single character, but I want to exclude a whole word now. for example I have a string "hi, how are you. hello", I want to extract all the part before the...

Python

2156

beautiful soup library question

by: meyerkp | last post by:

Hi all, I'm trying to extract some information from an html file using beautiful soup. The strings I want get are after br tags, eg: <font size='6'> <br>this info <br>more info <br>and...

Python

1401

Beautiful parse joy - Oh what fun

by: rh0dium | last post by:

Hi all, I am trying to parse into a dictionary a table and I am having all kinds of fun. Can someone please help me out. What I want is this: dic={'Division Code':'SALS','Employee':'LOO...

Python

2714

Using Beautiful Soup

by: Tempo | last post by:

Heya. I have never used a module/script before, and the first problem I have run into is that I do not know how to install a module/script. I have downloaded Beautiful Soup, but how do I use it in...

Python

679

Using Beautiful Soup to entangle bookmarks.html

by: Anthra Norell | last post by:

Hi, Martin, SE is a stream editor that does not introduce the overhead and complications of overkill parsing. See if it suits your needs: http://cheeseshop.python.org/pypi/SE/2.2%20beta ...

Python

3098

Beautiful Soup Question: Filtering Images based on their width and height attributes

by: PicURLPy | last post by:

Hello, I want to extract some image links from different html pages, in particular i want extract those image tags which height values are greater than 200. Is there an elegant way in...

Python

10680

parsing tables with beautiful soup?

by: cjl | last post by:

I am learning python and beautiful soup, and I'm stuck. A web page has a table that contains data I would like to scrape. The table has a unique class, so I can use: soup.find("table",...

Python

5585

simple Question about using BeautifulSoup

by: Alexnb | last post by:

Okay, I have used BeautifulSoup a lot lately, but I am wondering, how do you open a local html file? Usually I do something like this for a url soup =...

Python

7098

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7298

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7366

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7471

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5610

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

3187

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3176

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

754

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

406

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General