473,395 Members | 1,742 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

slow import

I'm importing a script that I made and it's literally take 10+mins to to run or import into PythonWin.

I've put the script at the bottom. But i'm also having a problem with it.
What i'm trying to do:
1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
2.) collect all the links for these new 10-q's
3.) add the link to the end of what i call pageroot (which is www.sec.gov)
4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap"><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
6.) clean up the linked addresses with a bunch of regex

now once I get that working i'll add more, but my problem is this...
it seems to be reading to do this: (purely for example)
"google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
and again it should be
"google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"

if anyone could help i'd be very appreciative.

here's the code.
Expand|Select|Wrap|Line Numbers
  1. import urllib
  2. import re
  3. page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'
  4. raw = []
  5. for line in urllib.urlopen(page):
  6.     if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:
  7.         raw.append(line)
  8.  
  9. codestring = ' '.join(raw)
  10. pattern = re.compile('/\S+') 
  11. results = re.findall(pattern, codestring)
  12.  
  13. pageroot = 'http://www.sec.gov' 
  14. count= len(results) 
  15.  
  16. fn = open("c://Python25/tmp.txt", 'w')
  17.  
  18. line10q = []
  19. number = 0
  20. while number < count:
  21.     newpage = pageroot + results[number]
  22.     for line in urllib.urlopen(newpage):
  23.         if '<td nowrap="nowrap"><a href="' in line:
  24.             line10q.append(line)
  25.         fn.write(line)
  26.     number += 1
  27.  
  28. fn.close()
  29.  
  30. line10qstring = ' '.join(line10q)
  31. pattern2 = re.compile('="/\S+">')
  32. results10q = re.findall(pattern, line10qstring) 
  33.  
  34. newstring = ' '.join(results10q)
  35. pattern3 = re.compile('/\S+.htm')
  36. linkresults = re.findall(pattern3, newstring) 
  37.  
  38. pattern4 = re.compile('/\S+.[a-z]{3}"')
  39. linktest2 = ' '.join(linkresults)
  40. link2 = re.findall(pattern4, linktest2) 
  41.  
  42. link2string = ' '.join(link2)
  43. pattern5 = re.compile('/\S+.htm')
  44. link4 = re.findall(pattern5, link2string) 
  45. link4string = ' '.join(link4)
  46.  
  47. linkNumber = len(link4)
  48.  
  49.  
Aug 31 '07 #1
1 2645
It's because your code is executed everytime it is imported. enclosing it in a function would fix the problem.

so change

Expand|Select|Wrap|Line Numbers
  1. import urllib
  2. import re
  3. page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'
  4. raw = []
  5. for line in urllib.urlopen(page):
  6.     if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:
  7.         raw.append(line)
  8.  
  9. codestring = ' '.join(raw)
  10. pattern = re.compile('/\S+') 
  11. results = re.findall(pattern, codestring)
  12.  
  13. pageroot = 'http://www.sec.gov' 
  14. count= len(results) 
  15.  
  16. fn = open("c://Python25/tmp.txt", 'w')
  17.  
  18. line10q = []
  19. number = 0
  20. while number < count:
  21.     newpage = pageroot + results[number]
  22.     for line in urllib.urlopen(newpage):
  23.         if '<td nowrap="nowrap"><a href="' in line:
  24.             line10q.append(line)
  25.         fn.write(line)
  26.     number += 1
  27.  
  28. fn.close()
  29.  
  30. line10qstring = ' '.join(line10q)
  31. pattern2 = re.compile('="/\S+">')
  32. results10q = re.findall(pattern, line10qstring) 
  33.  
  34. newstring = ' '.join(results10q)
  35. pattern3 = re.compile('/\S+.htm')
  36. linkresults = re.findall(pattern3, newstring) 
  37.  
  38. pattern4 = re.compile('/\S+.[a-z]{3}"')
  39. linktest2 = ' '.join(linkresults)
  40. link2 = re.findall(pattern4, linktest2) 
  41.  
  42. link2string = ' '.join(link2)
  43. pattern5 = re.compile('/\S+.htm')
  44. link4 = re.findall(pattern5, link2string) 
  45. link4string = ' '.join(link4)
  46.  
  47. linkNumber = len(link4)
  48.  
  49.  
to

Expand|Select|Wrap|Line Numbers
  1. import urlib
  2. import re
  3.  
  4. def myfunc():
  5.     page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'
  6.     # rest of code....
  7.  
That way you just do:
Expand|Select|Wrap|Line Numbers
  1. import myscript
  2. myscript.myfunc()
  3.  
and your done!
Sep 1 '07 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Freddie | last post by:
Hi, I posted a while ago for some help with my word finder program, which is now quite a lot faster than I could manage. Thanks to all who helped :) This time, I've written a basic batch...
16
by: Jason | last post by:
Hey, I'm an experience programmer but new to Python. I'm doing a simple implementation of a field morphing techinique due to Beier and Neely (1992) and I have the simple case working in Python...
2
by: David | last post by:
Hi, We have an internal network of 3 users. Myself & one other currently have individual copies of the front-end MS Access forms and via our individual ODBC links we have used the: File > Get...
3
by: chrisperkins99 | last post by:
It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable *...
4
by: HYRY | last post by:
Why the exec time of test(readdata()) and test(randomdata()) of following program is different? my test file 150Hz10dB.wav has 2586024 samples, so I set randomdata function to return a list with...
4
by: Andrew Jackson | last post by:
I am writing a newsgroup client. I have the protocol figured out. But I get slow transfer speeds off any of the network objects read the data from For example one of the commands for a news...
4
by: Joshua Kugler | last post by:
We've recently been doing some profiling on a project of ours. It runs quite fast on Linux but *really* bogs down on Windows 2003. We initially thought it was the simplejson libraries (we don't...
2
by: dineshchand | last post by:
I m working on a application importing data into oracle.But after few import into Oracle DB my import process is taking so long time.My oracle is getting slow after running import few times.Can...
21
by: Michele | last post by:
Hi, I'm trying to encode a byte data. Let's not focus on the process of encoding; in fact, I want to emphasize that the method create_random_block takes 0.5s to be executed (even Java it's faster)...
3
by: Rüdiger Werner | last post by:
Hello! Out of curiosity and to learn a little bit about the numpy package i've tryed to implement a vectorised version of the 'Sieve of Zakiya'. While the code itself works fine it is...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.