473,378 Members | 1,383 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Parsing Html code in Python

Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,421</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,221</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,189</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,775</font></td></tr>



And



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,478,819,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">42,765,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,023,516,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,448,129,075</font></td></tr>



I believe the only difference is the numbers in them, which, again, are constantly changing.



I tried this :



>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']



In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.



Any guess why that might be happening? And how to fix this?

Thanks

Pat
May 23 '07 #1
9 1521
bartonc
6,596 Expert 4TB
I think that you'll want to look into the re module. Regular Expressions are the way to go for this kind of task.
May 23 '07 #2
bvdet
2,851 Expert Mod 2GB
Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,421</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,221</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,189</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,775</font></td></tr>



And



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,478,819,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">42,765,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,023,516,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,448,129,075</font></td></tr>



I believe the only difference is the numbers in them, which, again, are constantly changing.



I tried this :



>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']



In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.



Any guess why that might be happening? And how to fix this?

Thanks

Pat
Pat,
Your scrape() only returns one string. You need to do something like this (untested):
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. def scrape():
  4.     s = 'part of the string'
  5.     # you have to create the file object
  6.     return [line for line in file_object if s in line]
  7.  
  8. data = r'>([0-9,]+)<'
  9. dataList = []
  10. for item in scrape():
  11.     dataList.append(re.findall(data, item)
May 23 '07 #3
bvdet
2,851 Expert Mod 2GB
Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.
May 24 '07 #4
ghostdog74
511 Expert 256MB
only tested with provided data:
Expand|Select|Wrap|Line Numbers
  1. for line in open("file"):
  2.     line = line.strip().split()
  3.     for item in line:
  4.         if "COLOR" in item and "</font>" in item \
  5.            and not "Total" in item:            
  6.             stindex=item.index('">')
  7.             endindex=item.index("</font>")
  8.             print item[stindex+2:endindex]
  9.  
output:
Expand|Select|Wrap|Line Numbers
  1. 3,421
  2. 1,221
  3. 3,189
  4. 1,775
  5. 1,478,819,000
  6. 42,765,000
  7. 2,023,516,000
  8. 3,448,129,075
  9.  
May 24 '07 #5
Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.
Ok...that i think/thought I could do.

So i tried this

Expand|Select|Wrap|Line Numbers
  1.  import urllib2, urllib, re
  2.  def test():
  3.     s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'
  4.     def scrape():
  5.         for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/'):
  6.              return [line for line in scrape() if s in line]
  7.  
  8. data = r'>([0-9,]+)<'
  9. datalist = []
  10. for item in test():
  11.     datalist.append(re.findall(data, item)
  12.  
  13.  
There are a few things that's funky with this i'm sure. Here are the problem i'm having...

1. after the last line it doesn't get me out of that function (maybe that's the wrong word, either way i don't get the >>> prompt)
2. you had suggested i use the line return [line for line in #file_object if s in line]
... what would the variable name for my file_object be in this case...I used test() becuase it calls the urllib.open which you said acts like a file_object


thanks
pc
May 24 '07 #6
Ok I tried your method and had more success than the other way suggested, but still ran into some deadends....

here's what I did:

Expand|Select|Wrap|Line Numbers
  1. >>> import urllib2
  2. >>> import re
  3. >>> 
  4. >>> def scrape():
  5. ...     for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):
  6. ...         line = line.strip().split()
  7. ...         for item in line:
  8. ...             if "COLOR" in item and "</font>" in item and not "Total" in item:
  9. ...                 stindex=item.index('">')
  10. ...                 endindex=item.index("</font>")
  11. ...                 print item[stindex+2:endindex]
  12. ...                 
  13. >>> print scrape()
  14. <b>Issues:</b>
  15. <b>NYSE</b>
  16. <b>AMEX</b>
  17. <b>Nasdaq</b>
  18. <b>Advancing</b>
  19. 613
  20. 309
  21. 761
  22. 529
  23. <b>Declining</b>
  24. 2,689
  25. 792
  26. 2,278
  27. 729
  28. <b>Unchanged</b>
  29. 110
  30. 99
  31. 139
  32. 415
  33. 3,412
  34. 1,200
  35. 3,178
  36. 1,673
  37. 26
  38. 40
  39. 28
  40. 134
  41. 15
  42. 15
  43. 33
  44. 169
  45. <b>Volume:</b>
  46. <b>Advancing</b>
  47. 322,581,000
  48. 12,414,000
  49. 336,729,000
  50. 370,865,638
  51. <b>Declining</b>
  52. 1,437,309,000
  53. 31,021,000
  54. 2,031,162,000
  55. 1,712,486,962
  56. <b>Unchanged</b>
  57. 11,209,000
  58. 2,062,000
  59. -2,345,316,000
  60. 508,293,225
  61. 1,771,099,000
  62. 45,497,000
  63. 22,575,000
  64. 2,591,645,825
  65. None
  66. >>> 
  67.  

Now that works GREAT, but, how can I call just one of the numubers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.

If I do return item[stindex+2:endindex]
Then do print scrape()
I just get the top line.

Being as I want to scrape the data and put it into a file, i think i'll need to return at some point.

How do i return w getting mor than just 1 line.

thanks
pc


only tested with provided data:
Expand|Select|Wrap|Line Numbers
  1. for line in open("file"):
  2.     line = line.strip().split()
  3.     for item in line:
  4.         if "COLOR" in item and "</font>" in item \
  5.            and not "Total" in item:            
  6.             stindex=item.index('">')
  7.             endindex=item.index("</font>")
  8.             print item[stindex+2:endindex]
  9.  
output:
Expand|Select|Wrap|Line Numbers
  1. 3,421
  2. 1,221
  3. 3,189
  4. 1,775
  5. 1,478,819,000
  6. 42,765,000
  7. 2,023,516,000
  8. 3,448,129,075
  9.  
May 24 '07 #7
ghostdog74
511 Expert 256MB

Now that works GREAT, but, how can I call just one of the numbers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.
don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..
May 24 '07 #8
bvdet
2,851 Expert Mod 2GB
Patrick,

Here's a little test script similar to what you are doing:
Expand|Select|Wrap|Line Numbers
  1. def scrape():
  2.     patt = re.compile(r'>([0-9,]+)<')
  3.     s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'
  4.     data = [line for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/') if s in line]
  5.     dataList = []
  6.     for item in data:
  7.         dataList += patt.findall(item)
  8.     return dataList
  9.  
  10. numList = scrape()    
  11. print numList    
  12.  
  13. '''
  14. >>> ['3,412', '1,200', '3,178', '1,673', '1,771,099,000', '45,497,000', '22,575,000', '2,591,645,825']
  15. '''
Now that you have the data stored in a list, you can do a number of things. I don't like testing for 'in' with a string that long though.
May 25 '07 #9
What i'm shooting for is to make a scrape that when it runs it gets the numbers that are being constantly updated from the site. The example you gave does, but I can't seem to figure out how to just get one of the numbers.

I'd then store them in a file and build a database from it.

In the other example that BVDET gave it prints the numers out in a string, which is what I need because I only want to reference some of the numbers scraped. Now i'll add a few lines so it writes to a file.

Does that explain what I'm aiming to do better? Sorry for the confusion.
And thanks a lot for your help, it's very much appreciated.
pc

don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..
May 25 '07 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
14
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: -...
2
by: Peter Sprenger | last post by:
Hello, I hope somebody can help me with my problem. I am writing Zope python scripts that will do parsing on text for dynamic webpages: I am getting a text from an oracle database that contains...
2
by: Todd Moyer | last post by:
I would like to use Python to parse a *python-like* data description language. That is, it would have it's own keywords, but would have a syntax like Python. For instance: Ob1 ('A'): Ob2...
11
by: Jean de Largentaye | last post by:
Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....
3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...
0
by: bruce | last post by:
hi... it appears that i'm running into a possible problem with mechanize/browser/python rgarding the "select_form" method. i've tried the following and get the error listed: br.select_form(nr...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
1
by: Philip Semanchuk | last post by:
On Oct 12, 2008, at 5:25 AM, S.Selvam Siva wrote: Selvam, You can try to find them yourself using string parsing, but that's difficult. The closer you want to get to "perfect" at finding URLs...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.