By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,199 Members | 1,125 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,199 IT Pros & Developers. It's quick & easy.

Parsing Html code in Python

P: 54
Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,421</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,221</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,189</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,775</font></td></tr>



And



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,478,819,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">42,765,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,023,516,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,448,129,075</font></td></tr>



I believe the only difference is the numbers in them, which, again, are constantly changing.



I tried this :



>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']



In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.



Any guess why that might be happening? And how to fix this?

Thanks

Pat
May 23 '07 #1
Share this Question
Share on Google+
9 Replies


bartonc
Expert 5K+
P: 6,596
I think that you'll want to look into the re module. Regular Expressions are the way to go for this kind of task.
May 23 '07 #2

bvdet
Expert Mod 2.5K+
P: 2,851
Hello world, i'm a big-time rookie using 2.5 and windows...

I have a block of HTML to parse that has 2 damn near identical lines. The difference is the numbers in them and they are always changing. The two nearly identical lines are:



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,421</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,221</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,189</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,775</font></td></tr>



And



<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,478,819,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">42,765,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,023,516,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">3,448,129,075</font></td></tr>



I believe the only difference is the numbers in them, which, again, are constantly changing.



I tried this :



>>> def scrape():

... for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):

... if '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">' in line:

... return line

...

>>> data = r'>([0-9,]+)<'

>>> dataList = re.findall(data, scrape())

>>> print dataList

['3,421', '1,221', '3,189', '1,775']



In my If statement I used the code that goes all the way up to the numbers. But as you see, the code is only getting the 1st line with numbers, not the 2nd.



Any guess why that might be happening? And how to fix this?

Thanks

Pat
Pat,
Your scrape() only returns one string. You need to do something like this (untested):
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. def scrape():
  4.     s = 'part of the string'
  5.     # you have to create the file object
  6.     return [line for line in file_object if s in line]
  7.  
  8. data = r'>([0-9,]+)<'
  9. dataList = []
  10. for item in scrape():
  11.     dataList.append(re.findall(data, item)
May 23 '07 #3

bvdet
Expert Mod 2.5K+
P: 2,851
Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.
May 24 '07 #4

Expert 100+
P: 511
only tested with provided data:
Expand|Select|Wrap|Line Numbers
  1. for line in open("file"):
  2.     line = line.strip().split()
  3.     for item in line:
  4.         if "COLOR" in item and "</font>" in item \
  5.            and not "Total" in item:            
  6.             stindex=item.index('">')
  7.             endindex=item.index("</font>")
  8.             print item[stindex+2:endindex]
  9.  
output:
Expand|Select|Wrap|Line Numbers
  1. 3,421
  2. 1,221
  3. 3,189
  4. 1,775
  5. 1,478,819,000
  6. 42,765,000
  7. 2,023,516,000
  8. 3,448,129,075
  9.  
May 24 '07 #5

P: 54
Patrick,

urllib.urlopen(url) creates a file-like object, suitable for iteration.
Ok...that i think/thought I could do.

So i tried this

Expand|Select|Wrap|Line Numbers
  1.  import urllib2, urllib, re
  2.  def test():
  3.     s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'
  4.     def scrape():
  5.         for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/'):
  6.              return [line for line in scrape() if s in line]
  7.  
  8. data = r'>([0-9,]+)<'
  9. datalist = []
  10. for item in test():
  11.     datalist.append(re.findall(data, item)
  12.  
  13.  
There are a few things that's funky with this i'm sure. Here are the problem i'm having...

1. after the last line it doesn't get me out of that function (maybe that's the wrong word, either way i don't get the >>> prompt)
2. you had suggested i use the line return [line for line in #file_object if s in line]
... what would the variable name for my file_object be in this case...I used test() becuase it calls the urllib.open which you said acts like a file_object


thanks
pc
May 24 '07 #6

P: 54
Ok I tried your method and had more success than the other way suggested, but still ran into some deadends....

here's what I did:

Expand|Select|Wrap|Line Numbers
  1. >>> import urllib2
  2. >>> import re
  3. >>> 
  4. >>> def scrape():
  5. ...     for line in urllib2.urlopen('http://bigcharts.marketwatch.com/markets/'):
  6. ...         line = line.strip().split()
  7. ...         for item in line:
  8. ...             if "COLOR" in item and "</font>" in item and not "Total" in item:
  9. ...                 stindex=item.index('">')
  10. ...                 endindex=item.index("</font>")
  11. ...                 print item[stindex+2:endindex]
  12. ...                 
  13. >>> print scrape()
  14. <b>Issues:</b>
  15. <b>NYSE</b>
  16. <b>AMEX</b>
  17. <b>Nasdaq</b>
  18. <b>Advancing</b>
  19. 613
  20. 309
  21. 761
  22. 529
  23. <b>Declining</b>
  24. 2,689
  25. 792
  26. 2,278
  27. 729
  28. <b>Unchanged</b>
  29. 110
  30. 99
  31. 139
  32. 415
  33. 3,412
  34. 1,200
  35. 3,178
  36. 1,673
  37. 26
  38. 40
  39. 28
  40. 134
  41. 15
  42. 15
  43. 33
  44. 169
  45. <b>Volume:</b>
  46. <b>Advancing</b>
  47. 322,581,000
  48. 12,414,000
  49. 336,729,000
  50. 370,865,638
  51. <b>Declining</b>
  52. 1,437,309,000
  53. 31,021,000
  54. 2,031,162,000
  55. 1,712,486,962
  56. <b>Unchanged</b>
  57. 11,209,000
  58. 2,062,000
  59. -2,345,316,000
  60. 508,293,225
  61. 1,771,099,000
  62. 45,497,000
  63. 22,575,000
  64. 2,591,645,825
  65. None
  66. >>> 
  67.  

Now that works GREAT, but, how can I call just one of the numubers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.

If I do return item[stindex+2:endindex]
Then do print scrape()
I just get the top line.

Being as I want to scrape the data and put it into a file, i think i'll need to return at some point.

How do i return w getting mor than just 1 line.

thanks
pc


only tested with provided data:
Expand|Select|Wrap|Line Numbers
  1. for line in open("file"):
  2.     line = line.strip().split()
  3.     for item in line:
  4.         if "COLOR" in item and "</font>" in item \
  5.            and not "Total" in item:            
  6.             stindex=item.index('">')
  7.             endindex=item.index("</font>")
  8.             print item[stindex+2:endindex]
  9.  
output:
Expand|Select|Wrap|Line Numbers
  1. 3,421
  2. 1,221
  3. 3,189
  4. 1,775
  5. 1,478,819,000
  6. 42,765,000
  7. 2,023,516,000
  8. 3,448,129,075
  9.  
May 24 '07 #7

Expert 100+
P: 511

Now that works GREAT, but, how can I call just one of the numbers, say for example 45,497,000, or whatever number would fall into that place when it gets updated.
don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..
May 24 '07 #8

bvdet
Expert Mod 2.5K+
P: 2,851
Patrick,

Here's a little test script similar to what you are doing:
Expand|Select|Wrap|Line Numbers
  1. def scrape():
  2.     patt = re.compile(r'>([0-9,]+)<')
  3.     s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">'
  4.     data = [line for line in urllib.urlopen('http://bigcharts.marketwatch.com/markets/') if s in line]
  5.     dataList = []
  6.     for item in data:
  7.         dataList += patt.findall(item)
  8.     return dataList
  9.  
  10. numList = scrape()    
  11. print numList    
  12.  
  13. '''
  14. >>> ['3,412', '1,200', '3,178', '1,673', '1,771,099,000', '45,497,000', '22,575,000', '2,591,645,825']
  15. '''
Now that you have the data stored in a list, you can do a number of things. I don't like testing for 'in' with a string that long though.
May 25 '07 #9

P: 54
What i'm shooting for is to make a scrape that when it runs it gets the numbers that are being constantly updated from the site. The example you gave does, but I can't seem to figure out how to just get one of the numbers.

I'd then store them in a file and build a database from it.

In the other example that BVDET gave it prints the numers out in a string, which is what I need because I only want to reference some of the numbers scraped. Now i'll add a few lines so it writes to a file.

Does that explain what I'm aiming to do better? Sorry for the confusion.
And thanks a lot for your help, it's very much appreciated.
pc

don't understand. does it mean that you want to only get the numbers that are updated on the website, and you only want to see those updated from your previous values? then one way is to store the values somewhere, like a file, database or pickle objects, then on the next run, grab the numbers and compare to saved data. however, the code will be longer that what i have provided, also you will need to redefine how you are actually going to search for specific patterns..
May 25 '07 #10

Post your reply

Sign in to post your reply or Sign up for a free account.