Parsing Html code in Python

Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:

<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">3,421</td><td bgcolor="EEEEEE" align="right">1,221</td><td bgcolor="EEEEEE" align="right">3,189</td><td bgcolor="EEEEEE" align="right">1,775</td></tr>

And

<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">1,478,819,000</td><td bgcolor="EEEEEE" align="right">42,765,000</td><td bgcolor="EEEEEE" align="right">2,023,516,000</td><td bgcolor="EEEEEE" align="right">3,448,129,075</td></tr>

I believe the only difference is the numbers in them, which, again, are constantly changing.

I tried this :

>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']

In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.

Any guess why that might be happening? And how to fix this?

Thanks

Pat

May 23 '07 #1

Subscribe Post Reply

1521

bartonc

6,596

Expert 4TB

I think that you'll want to look into the re module. Regular Expressions are the way to go for this kind of task.

May 23 '07 #2

bvdet

2,851

Expert Mod 2GB

Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:

<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">3,421</td><td bgcolor="EEEEEE" align="right">1,221</td><td bgcolor="EEEEEE" align="right">3,189</td><td bgcolor="EEEEEE" align="right">1,775</td></tr>

And

<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">1,478,819,000</td><td bgcolor="EEEEEE" align="right">42,765,000</td><td bgcolor="EEEEEE" align="right">2,023,516,000</td><td bgcolor="EEEEEE" align="right">3,448,129,075</td></tr>

I believe the only difference is the numbers in them, which, again, are constantly changing.

I tried this :

>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE">Total</td><td bgcolor="EEEEEE" align="right">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']

In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.

Any guess why that might be happening? And how to fix this?

Thanks

Pat

Pat,
Your scrape() only returns one string. You need to do something like this (untested):

Expand|Select|Wrap|Line Numbers

 import re
 
def scrape():

    s = 'part of the string'

    # you have to create the file object

    return [line for line in file_object if s in line]
 
data = r'>([0-9,]+)<'

dataList = []

for item in scrape():

    dataList.append(re.findall(data, item)

May 23 '07 #3

bvdet

2,851

Expert Mod 2GB

Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.

May 24 '07 #4

ghostdog74

511

Expert 256MB

only tested with provided data:

Expand|Select|Wrap|Line Numbers

 
for line in open("file"):

    line = line.strip().split()

    for item in line:

        if "COLOR" in item and "</font>" in item \

           and not "Total" in item:            

            stindex=item.index('">')

            endindex=item.index("</font>")

            print item[stindex+2:endindex]

output:

Expand|Select|Wrap|Line Numbers

 
3,421

1,221

3,189

1,775

1,478,819,000

42,765,000

2,023,516,000

3,448,129,075

May 24 '07 #5

Patrick C

Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.

Ok...that i think/thought I could do.

So i tried this

Expand|Select|Wrap|Line Numbers

 
 import urllib2, urllib, re

 def test():

    s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'

    def scrape():

        for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/'):

             return [line for line in scrape() if s in line]
 
data = r'>([0-9,]+)<'

datalist = []

for item in test():

    datalist.append(re.findall(data, item)

There are a few things that's funky with this i'm sure. Here are the problem i'm having...

1. after the last line it doesn't get me out of that function (maybe that's the wrong word, either way i don't get the >>> prompt)
2. you had suggested i use the line return [line for line in #file_object if s in line]
... what would the variable name for my file_object be in this case...I used test() becuase it calls the urllib.open which you said acts like a file_object

thanks
pc

May 24 '07 #6

Patrick C

Ok I tried your method and had more success than the other way suggested, but still ran into some deadends....

here's what I did:

Expand|Select|Wrap|Line Numbers

 
>>> import urllib2

>>> import re

>>> 

>>> def scrape():

...     for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

...         line = line.strip().split()

...         for item in line:

...             if "COLOR" in item and "</font>" in item and not "Total" in item:

...                 stindex=item.index('">')

...                 endindex=item.index("</font>")

...                 print item[stindex+2:endindex]

...                 

>>> print scrape()

<b>Issues:</b>

<b>NYSE</b>

<b>AMEX</b>

<b>Nasdaq</b>

<b>Advancing</b>

613

309

761

529

<b>Declining</b>

2,689

792

2,278

729

<b>Unchanged</b>

110

99

139

415

3,412

1,200

3,178

1,673

26

40

28

134

15

15

33

169

<b>Volume:</b>

<b>Advancing</b>

322,581,000

12,414,000

336,729,000

370,865,638

<b>Declining</b>

1,437,309,000

31,021,000

2,031,162,000

1,712,486,962

<b>Unchanged</b>

11,209,000

2,062,000

-2,345,316,000

508,293,225

1,771,099,000

45,497,000

22,575,000

2,591,645,825

None

>>>

Now that works GREAT, but, how can I call just one of the numubers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.

If I do return item[stindex+2:endindex]
Then do print scrape()
I just get the top line.

Being as I want to scrape the data and put it into a file, i think i'll need to return at some point.

How do i return w getting mor than just 1 line.

thanks
pc

only tested with provided data:

Expand|Select|Wrap|Line Numbers

for line in open("file"):

 line = line.strip().split()

 for item in line:

 if "COLOR" in item and "" in item \

 and not "Total" in item:

 stindex=item.index('">')

 endindex=item.index("")

 print item[stindex+2:endindex]

output:

Expand|Select|Wrap|Line Numbers

3,421

1,221

3,189

1,775

1,478,819,000

42,765,000

2,023,516,000

3,448,129,075

May 24 '07 #7

ghostdog74

511

Expert 256MB

Now that works GREAT, but, how can I call just one of the numbers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.

don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..

May 24 '07 #8

bvdet

2,851

Expert Mod 2GB

Patrick,

Here's a little test script similar to what you are doing:

Expand|Select|Wrap|Line Numbers

 def scrape():

    patt = re.compile(r'>([0-9,]+)<')

    s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'

    data = [line for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/') if s in line]

    dataList = []

    for item in data:

        dataList += patt.findall(item)

    return dataList
 
numList = scrape()    

print numList    
 
'''

>>> ['3,412', '1,200', '3,178', '1,673', '1,771,099,000', '45,497,000', '22,575,000', '2,591,645,825']

'''

Now that you have the data stored in a list, you can do a number of things. I don't like testing for 'in' with a string that long though.

May 25 '07 #9

Patrick C

What i'm shooting for is to make a scrape that when it runs it gets the numbers that are being constantly updated from the site. The example you gave does, but I can't seem to figure out how to just get one of the numbers.

I'd then store them in a file and build a database from it.

In the other example that BVDET gave it prints the numers out in a string, which is what I need because I only want to reference some of the numbers scraped. Now i'll add a few lines so it writes to a file.

Does that explain what I'm aiming to do better? Sorry for the confusion.
And thanks a lot for your help, it's very much appreciated.
pc

don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..

May 25 '07 #10

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...

Python

Parsing library for Python?

by: Viktor Rosenfeld | last post by:

Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: -...

Python

parsing in python

by: Peter Sprenger | last post by:

Hello, I hope somebody can help me with my problem. I am writing Zope python scripts that will do parsing on text for dynamic webpages: I am getting a text from an oracle database that contains...

Python

parsing

by: Todd Moyer | last post by:

I would like to use Python to parse a *python-like* data description language. That is, it would have it's own keywords, but would have a syntax like Python. For instance: Ob1 ('A'): Ob2...

Python

Choosing the right parser for parsing C headers

by: Jean de Largentaye | last post by:

Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....

Python

XML file parsing with SAX

by: Willem Ligtenberg | last post by:

I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...

Python

HTML parsing/scraping & python

by: Sanjay Arora | last post by:

We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...

Python

possible issue with mechanize/python parsing

by: bruce | last post by:

hi... it appears that i'm running into a possible problem with mechanize/browser/python rgarding the "select_form" method. i've tried the following and get the error listed: br.select_form(nr...

Python

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...

Python

Re: parsing javascript

by: Philip Semanchuk | last post by:

On Oct 12, 2008, at 5:25 AM, S.Selvam Siva wrote: Selvam, You can try to find them yourself using string parsing, but that's difficult. The closer you want to get to "perfect" at finding URLs...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Parsing Html code in Python

Similar topics