I've put the script at the bottom. But i'm also having a problem with it.
What i'm trying to do:
1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
2.) collect all the links for these new 10-q's
3.) add the link to the end of what i call pageroot (which is www.sec.gov)
4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap"><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
6.) clean up the linked addresses with a bunch of regex
now once I get that working i'll add more, but my problem is this...
it seems to be reading to do this: (purely for example)
"google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
and again it should be
"google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"
if anyone could help i'd be very appreciative.
here's the code.
Expand|Select|Wrap|Line Numbers
- import urllib
- import re
- page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'
- raw = []
- for line in urllib.urlopen(page):
- if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:
- raw.append(line)
- codestring = ' '.join(raw)
- pattern = re.compile('/\S+')
- results = re.findall(pattern, codestring)
- pageroot = 'http://www.sec.gov'
- count= len(results)
- fn = open("c://Python25/tmp.txt", 'w')
- line10q = []
- number = 0
- while number < count:
- newpage = pageroot + results[number]
- for line in urllib.urlopen(newpage):
- if '<td nowrap="nowrap"><a href="' in line:
- line10q.append(line)
- fn.write(line)
- number += 1
- fn.close()
- line10qstring = ' '.join(line10q)
- pattern2 = re.compile('="/\S+">')
- results10q = re.findall(pattern, line10qstring)
- newstring = ' '.join(results10q)
- pattern3 = re.compile('/\S+.htm')
- linkresults = re.findall(pattern3, newstring)
- pattern4 = re.compile('/\S+.[a-z]{3}"')
- linktest2 = ' '.join(linkresults)
- link2 = re.findall(pattern4, linktest2)
- link2string = ' '.join(link2)
- pattern5 = re.compile('/\S+.htm')
- link4 = re.findall(pattern5, link2string)
- link4string = ' '.join(link4)
- linkNumber = len(link4)