HTML to dictionary

Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:

METAR:

ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
 

short-TAF:

ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
 

long-TAF:

ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG 2124 15012KT

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina

Feb 27 '07 #1

Subscribe Post Reply

10151

Tina I

Tina I wrote:

Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
>
METAR:

ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
 

short-TAF:

ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
 

long-TAF:

ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010
BECMG 2124 15012KT
 

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina

Forgot to mention that the "METAR:", "short-TAF", and "long-TAF" is
always named as such wheras the line of data ("ENBR 271212 VRB05KT 9999
FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 ") is dynamic and can be
anything...

Tina

Feb 27 '07 #2

bearophileHUGS

Tina I:

I have a small, probably trivial even, problem. I have the following HTML:

This is a little data munging problem.
If it's a one-shot problem, then you can just load it with a browser,
copy and paste it as text, and then process the lines of the text in a
simple way (splitting lines according to ":", and using the stripped
pairs to feed a dict).

If there are more Html files, or you want to automate things more, you
can use html2text:
http://www.aaronsw.com/2002/html2text/

A little script like this may help you:

from html2text import html2text
txt = html2text(the_html_data)
lines = str(txt).replace("**", "").strip().splitlines()
fields = [[field.strip() for field in line.split(":")] for line in
lines]
print dict(fields)

Note that splitlines() is tricky, if you find some problems, then you
may want a smarter splitter.

Bye,
bearophile

Feb 27 '07 #3

WEINHANDL Herbert

Tina I schrieb:

Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:
>
METAR:

ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG

....

BeautifulSoup is really fun to work with ;-)

I have played around with BeautifulSoup but I'm stuck at stripping off
the tags and chop it up to what I need to put in the dict. If someone
can offer some hints or example to get me going I would greatly
appreciate it.

Thanks!
Tina

#!/usr/bin/python
# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup, Tag, NavigableString

html = """<html<head><title>Title</title</head>
<body>
<bMETAR: </bENBR 270920Z 00000KT 9999 ... 
<bshort-TAF:</bENBR 270800Z 270918 VRB05KT ... 
<blong-TAF: </bENBR 271212 VRB05KT 9999 ... 
</body>
</html>
"""

soup = BeautifulSoup( html, convertEntities='html' )
bolds = soup.findAll( 'b' )

dict = {}

for b in bolds :
key = b.next.strip()
val = b.next.next.strip()
print 'key=', key
print 'val=', val, '\n'
dict[key] = val

print dict

#---- end ----
happy pythoning

Herbert

Feb 27 '07 #4

Paul Boddie

On 27 Feb, 11:08, Tina I <tina...@bestemselv.comwrote:

>
I have a small, probably trivial even, problem. I have the following HTML:
>
METAR:

ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
 

short-TAF:

ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
 

long-TAF:

ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010
BECMG 2124 15012KT

This looks almost like XHTML which means that you might be able to use
a normal XML parser.

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

So what you want to do is to find each "b" element, extract the
contents to produce a dictionary key, and then find all following text
nodes up to the "br" element, extracting the contents of those nodes
to produce the corresponding dictionary value.

Now, with a DOM/XPath library, the first part is quite
straightforward. Let's first parse the document, though:

import libxml2dom # my favourite ;-)
d = libxml2dom.parse(the_file) # add html=1 if it's HTML

Now, let's get the "b" elements providing the keys:

key_elements = d.xpath("//b")

The above will find all "b" elements throughout the document. If
that's too broad a search, you can specify something more narrow. For
example:

key_elements = d.xpath("/html/body/b")

At this point, key_elements should contain a list of nodes, each
corresponding to a "b" element, and you can get the contents of each
element by asking for all the text nodes inside it and joining them
together, stripping the whitespace off each end to make the dictionary
key itself:

def get_key(key_element):
texts = []
# Get all text child nodes, collecting the contents.
for n in key_element.xpath("text()"):
texts.append(n.nodeValue)
# Join them together, removing leading/trailing space.
return "".join(texts).strip()

(Currently, libxml2dom lets you ask an element for its nodeValue,
erroneously returning text inside that element, but I don't want to
promote this as a solution since I may change it at some point.)

The process of getting the dictionary values is a bit more difficult.
What we need to do is to ask for the following siblings of the "b"
element, then to loop over them until we find a "br" element. The
dictionary value is then obtained from the discovered text fragments
by joining them together and stripping whitespace from the ends:

def get_value(key_element):
texts = []
# Loop over nodes following the element...
for n in key_element.xpath("following-sibling::node()"):
# Stop looping if we find a "br" element.
if n.nodeType == n.ELEMENT_NODE and n.localName == "br":
break
# Otherwise get the (assumed) text content.
texts.append(n.nodeValue)
# Join the texts and remove leading/trailing space.
return "".join(texts).strip()

So, putting this together, you should get something like this:

dictionary = {}
for key_element in key_elements:
dictionary[get_key(key_element)] = get_value(key_element)

As always with HTML processing, your mileage may vary with such an
approach, but I hope this is helpful. You should also be able to use
something like 4Suite or PyXML with the above code, albeit possibly
slightly modified.

Paul

P.S. Hopefully, Google Groups won't wrap the code badly. Whatever
happened to the preview option, Google?

Feb 27 '07 #5

Nikita the Spider

In article <br********************@telenor.com>,
Tina I <ti*****@bestemselv.comwrote:

Hi everyone,

I have a small, probably trivial even, problem. I have the following HTML:

METAR:

ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG
 

short-TAF:

ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040
 

long-TAF:

ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG
2124 15012KT
 

I need to make this into a dictionary like this:

dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004
NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040"
, "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000
SNRA VV010 BECMG 2124 15012KT"}

Tina,
In addition to Beautiful Soup which others have mentioned, Connelly
Barnes' HTMLData module will take (X)HTML and convert it into a
dictionary for you:
http://oregonstate.edu/~barnesc/htmldata/

THe dictionary won't have the exact format you want, but I think it
would be fairly easy for you to convert to what you're looking for.

I use HTMLData a lot. Beautiful Soup is great for parsing iteratively,
but if I just want to throw some HTML at a function and get data back,
HTMLData is my tool of choice.

Good luck with whatever you choose

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Feb 27 '07 #6

Tina I

Thanks people, I learned a lot!! :)

I went for Herbert's solution in my application but I explored, and
learned from, all of them.

Tina

Feb 27 '07 #7

Similar topics

Python bug with dictionary

by: none | last post by:

or is it just me? I am having a problem with using a dictionary as an attribute of a class. This happens in python 1.5.2 and 2.2.2 which I am accessing through pythonwin builds 150 and 148...

Python

Parsing HTML

by: Anders Eriksson | last post by:

Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to...

Python

HTML Editor

by: tomy_baseo | last post by:

I'm new to HTML and want to learn the basics by learning to code by hand (with the assistance of an HTML editor to eliminate repetitive tasks). Can anyone recommend a good, basic HTML editor that's...

HTML / CSS

HTML - CSS variable names correspondence.

by: opt_inf_env | last post by:

Hello, It is strange to me that html and css use different names for variables which define the same properties. For example if I whant to define text color from html document I write: <body...

HTML / CSS

Bind a listbox to a dictionary object

by: john wright | last post by:

I have a dictionary oject I created and I want to bind a listbox to it. I am including the code for the dictionary object. Here is the error I am getting: "System.Exception: Complex...

Visual Basic .NET

html entity to unicode

by: zunbeltz | last post by:

Hi, I'm parsing html. I have a page with a lot of html enitties for hebrew characters. When i print what i get are blanks, dots and commas. How can i decode this entities to unicode charachters?...

Python

Massive HTML coding errors

by: Robert Baer | last post by:

The homepage i have had up and seemingly working is: http://oil4lessllc.com/ However, the validator has so many complaints, and being so incompetent, i have no clue as to how to fix it all. Would...

HTML / CSS

Displaying dictionary collection key/value pairs formatted with html markup on a page

by: Andy B | last post by:

I have the object property StockContract.Dictionary which is a dictionary collection of <string, stringkey/value pairs. I need to be able to retreive the keys and their values and display them on a...

ASP.NET

Problem in finding the Dictionary Member Types

by: sachin2 | last post by:

I am using 3 types of dictionaries. 1) Dictionary<string, string > d = new Dictionary<string, string>(); 2) Dictionary<string, List<string>> d = new Dictionary<string, List<string>>(); 3)...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA