473,406 Members | 2,849 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

HTML to dictionary

Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
<b>
METAR:
</b>
ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
<br />
<b>
short-TAF:
</b>
ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
<br />
<b>
long-TAF:
</b>
ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG 2124 15012KT
<br />
I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina
Feb 27 '07 #1
6 10151
Tina I wrote:
Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
><b>
METAR:
</b>
ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
<br />
<b>
short-TAF:
</b>
ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
<br />
<b>
long-TAF:
</b>
ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010
BECMG 2124 15012KT
<br />

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina
Forgot to mention that the "METAR:", "short-TAF", and "long-TAF" is
always named as such wheras the line of data ("ENBR 271212 VRB05KT 9999
FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 ") is dynamic and can be
anything...

Tina
Feb 27 '07 #2
Tina I:
I have a small, probably trivial even, problem. I have the following HTML:
This is a little data munging problem.
If it's a one-shot problem, then you can just load it with a browser,
copy and paste it as text, and then process the lines of the text in a
simple way (splitting lines according to ":", and using the stripped
pairs to feed a dict).

If there are more Html files, or you want to automate things more, you
can use html2text:
http://www.aaronsw.com/2002/html2text/

A little script like this may help you:

from html2text import html2text
txt = html2text(the_html_data)
lines = str(txt).replace("**", "").strip().splitlines()
fields = [[field.strip() for field in line.split(":")] for line in
lines]
print dict(fields)

Note that splitlines() is tricky, if you find some problems, then you
may want a smarter splitter.

Bye,
bearophile

Feb 27 '07 #3
Tina I schrieb:
Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
><b>
METAR:
</b>
ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
<br />
....

BeautifulSoup is really fun to work with ;-)
I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina
#!/usr/bin/python
# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup, Tag, NavigableString

html = """<html<head><title>Title</title</head>
<body>
<bMETAR: </bENBR 270920Z 00000KT 9999 ... <br />
<bshort-TAF:</bENBR 270800Z 270918 VRB05KT ... <br />
<blong-TAF: </bENBR 271212 VRB05KT 9999 ... <br />
</body>
</html>
"""

soup = BeautifulSoup( html, convertEntities='html' )
bolds = soup.findAll( 'b' )

dict = {}

for b in bolds :
key = b.next.strip()
val = b.next.next.strip()
print 'key=', key
print 'val=', val, '\n'
dict[key] = val

print dict

#---- end ----
happy pythoning

Herbert
Feb 27 '07 #4
On 27 Feb, 11:08, Tina I <tina...@bestemselv.comwrote:
>
I have a small, probably trivial even, problem. I have the following HTML:
><b>
METAR:
</b>
ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
<br />
<b>
short-TAF:
</b>
ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
<br />
<b>
long-TAF:
</b>
ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010
BECMG 2124 15012KT
<br />
This looks almost like XHTML which means that you might be able to use
a normal XML parser.
I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}
So what you want to do is to find each "b" element, extract the
contents to produce a dictionary key, and then find all following text
nodes up to the "br" element, extracting the contents of those nodes
to produce the corresponding dictionary value.

Now, with a DOM/XPath library, the first part is quite
straightforward. Let's first parse the document, though:

import libxml2dom # my favourite ;-)
d = libxml2dom.parse(the_file) # add html=1 if it's HTML

Now, let's get the "b" elements providing the keys:

key_elements = d.xpath("//b")

The above will find all "b" elements throughout the document. If
that's too broad a search, you can specify something more narrow. For
example:

key_elements = d.xpath("/html/body/b")

At this point, key_elements should contain a list of nodes, each
corresponding to a "b" element, and you can get the contents of each
element by asking for all the text nodes inside it and joining them
together, stripping the whitespace off each end to make the dictionary
key itself:

def get_key(key_element):
texts = []
# Get all text child nodes, collecting the contents.
for n in key_element.xpath("text()"):
texts.append(n.nodeValue)
# Join them together, removing leading/trailing space.
return "".join(texts).strip()

(Currently, libxml2dom lets you ask an element for its nodeValue,
erroneously returning text inside that element, but I don't want to
promote this as a solution since I may change it at some point.)

The process of getting the dictionary values is a bit more difficult.
What we need to do is to ask for the following siblings of the "b"
element, then to loop over them until we find a "br" element. The
dictionary value is then obtained from the discovered text fragments
by joining them together and stripping whitespace from the ends:

def get_value(key_element):
texts = []
# Loop over nodes following the element...
for n in key_element.xpath("following-sibling::node()"):
# Stop looping if we find a "br" element.
if n.nodeType == n.ELEMENT_NODE and n.localName == "br":
break
# Otherwise get the (assumed) text content.
texts.append(n.nodeValue)
# Join the texts and remove leading/trailing space.
return "".join(texts).strip()

So, putting this together, you should get something like this:

dictionary = {}
for key_element in key_elements:
dictionary[get_key(key_element)] = get_value(key_element)

As always with HTML processing, your mileage may vary with such an
approach, but I hope this is helpful. You should also be able to use
something like 4Suite or PyXML with the above code, albeit possibly
slightly modified.

Paul

P.S. Hopefully, Google Groups won't wrap the code badly. Whatever
happened to the preview option, Google?

Feb 27 '07 #5
In article <br********************@telenor.com>,
Tina I <ti*****@bestemselv.comwrote:
Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
<b>
METAR:
</b>
ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
<br />
<b>
short-TAF:
</b>
ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
<br />
<b>
long-TAF:
</b>
ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG
2124 15012KT
<br />

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}
Tina,
In addition to Beautiful Soup which others have mentioned, Connelly
Barnes' HTMLData module will take (X)HTML and convert it into a
dictionary for you:
http://oregonstate.edu/~barnesc/htmldata/

THe dictionary won't have the exact format you want, but I think it
would be fairly easy for you to convert to what you're looking for.

I use HTMLData a lot. Beautiful Soup is great for parsing iteratively,
but if I just want to throw some HTML at a function and get data back,
HTMLData is my tool of choice.

Good luck with whatever you choose

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Feb 27 '07 #6
Thanks people, I learned a lot!! :)

I went for Herbert's solution in my application but I explored, and
learned from, all of them.

Tina
Feb 27 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: none | last post by:
or is it just me? I am having a problem with using a dictionary as an attribute of a class. This happens in python 1.5.2 and 2.2.2 which I am accessing through pythonwin builds 150 and 148...
8
by: Anders Eriksson | last post by:
Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to...
71
by: tomy_baseo | last post by:
I'm new to HTML and want to learn the basics by learning to code by hand (with the assistance of an HTML editor to eliminate repetitive tasks). Can anyone recommend a good, basic HTML editor that's...
22
by: opt_inf_env | last post by:
Hello, It is strange to me that html and css use different names for variables which define the same properties. For example if I whant to define text color from html document I write: <body...
1
by: john wright | last post by:
I have a dictionary oject I created and I want to bind a listbox to it. I am including the code for the dictionary object. Here is the error I am getting: "System.Exception: Complex...
1
by: zunbeltz | last post by:
Hi, I'm parsing html. I have a page with a lot of html enitties for hebrew characters. When i print what i get are blanks, dots and commas. How can i decode this entities to unicode charachters?...
78
by: Robert Baer | last post by:
The homepage i have had up and seemingly working is: http://oil4lessllc.com/ However, the validator has so many complaints, and being so incompetent, i have no clue as to how to fix it all. Would...
8
by: Andy B | last post by:
I have the object property StockContract.Dictionary which is a dictionary collection of <string, stringkey/value pairs. I need to be able to retreive the keys and their values and display them on a...
1
by: sachin2 | last post by:
I am using 3 types of dictionaries. 1) Dictionary<string, string > d = new Dictionary<string, string>(); 2) Dictionary<string, List<string>> d = new Dictionary<string, List<string>>(); 3)...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.