slow import - Python

I'm importing a script that I made and it's literally take 10+mins to to run or import into PythonWin.

I've put the script at the bottom. But i'm also having a problem with it.
What i'm trying to do:
1.) Go the the SEC's website and look for recently filed 10-q's (that's a financial report)
2.) collect all the links for these new 10-q's
3.) add the link to the end of what i call pageroot (which is www.sec.gov)
4.) on the newly formed full web address go one page at a time and look for a piece in the source code that is " <td nowrap="nowrap"><a href= " which will lead me to the next linked addres i need. (to navigate to the actual 10-q its 2 or 3 links away from the original search)
5.) also write these 2nd linked addresses to a file, so that i can check to make sure that it is working the intended way
6.) clean up the linked addresses with a bunch of regex

now once I get that working i'll add more, but my problem is this...
it seems to be reading to do this: (purely for example)
"google, apple, ebay, and IBM filed 10-q's, now lets collect a history of 10-qs filed for just google"
and again it should be
"google, apple, ebay and IBM filed 10-q's, now lets collect the link for each of them so that I can redirect my scrape to the actual 10-q"

if anyone could help i'd be very appreciative.

here's the code.

Expand|Select|Wrap|Line Numbers

 
import urllib

import re

page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'

raw = []

for line in urllib.urlopen(page):

    if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:

        raw.append(line)
 
codestring = ' '.join(raw)

pattern = re.compile('/\S+') 

results = re.findall(pattern, codestring)
 
pageroot = 'http://www.sec.gov' 

count= len(results) 
 
fn = open("c://Python25/tmp.txt", 'w')
 
line10q = []

number = 0

while number < count:

    newpage = pageroot + results[number]

    for line in urllib.urlopen(newpage):

        if '<td nowrap="nowrap"><a href="' in line:

            line10q.append(line)

        fn.write(line)

    number += 1
 
fn.close()
 
line10qstring = ' '.join(line10q)

pattern2 = re.compile('="/\S+">')

results10q = re.findall(pattern, line10qstring) 
 
newstring = ' '.join(results10q)

pattern3 = re.compile('/\S+.htm')

linkresults = re.findall(pattern3, newstring) 
 
pattern4 = re.compile('/\S+.[a-z]{3}"')

linktest2 = ' '.join(linkresults)

link2 = re.findall(pattern4, linktest2) 
 
link2string = ' '.join(link2)

pattern5 = re.compile('/\S+.htm')

link4 = re.findall(pattern5, link2string) 

link4string = ' '.join(link4)
 
linkNumber = len(link4)

Aug 31 '07 #1

Subscribe Post Reply

2645

William Manley

It's because your code is executed everytime it is imported. enclosing it in a function would fix the problem.

so change

Expand|Select|Wrap|Line Numbers

 
import urllib

import re

page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'

raw = []

for line in urllib.urlopen(page):

    if '<td bgcolor="#E6E6E6" valign="top" align="left"><a href="' in line:

        raw.append(line)
 
codestring = ' '.join(raw)

pattern = re.compile('/\S+') 

results = re.findall(pattern, codestring)
 
pageroot = 'http://www.sec.gov' 

count= len(results) 
 
fn = open("c://Python25/tmp.txt", 'w')
 
line10q = []

number = 0

while number < count:

    newpage = pageroot + results[number]

    for line in urllib.urlopen(newpage):

        if '<td nowrap="nowrap"><a href="' in line:

            line10q.append(line)

        fn.write(line)

    number += 1
 
fn.close()
 
line10qstring = ' '.join(line10q)

pattern2 = re.compile('="/\S+">')

results10q = re.findall(pattern, line10qstring) 
 
newstring = ' '.join(results10q)

pattern3 = re.compile('/\S+.htm')

linkresults = re.findall(pattern3, newstring) 
 
pattern4 = re.compile('/\S+.[a-z]{3}"')

linktest2 = ' '.join(linkresults)

link2 = re.findall(pattern4, linktest2) 
 
link2string = ' '.join(link2)

pattern5 = re.compile('/\S+.htm')

link4 = re.findall(pattern5, link2string) 

link4string = ' '.join(link4)
 
linkNumber = len(link4)

Expand|Select|Wrap|Line Numbers

 
import urlib

import re
 
def myfunc():

    page = 'http://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=10-Q&owner=include&count=100&action=getcurrent'

    # rest of code....

That way you just do:

Expand|Select|Wrap|Line Numbers

 
import myscript

myscript.myfunc()

and your done!

Sep 1 '07 #2

Similar topics

yEnc implementation in Python, bit slow

by: Freddie | last post by:

Hi, I posted a while ago for some help with my word finder program, which is now quite a lot faster than I could manage. Thanks to all who helped :) This time, I've written a basic batch...

Python

Slow Python - what can be done?

by: Jason | last post by:

Hey, I'm an experience programmer but new to Python. I'm doing a simple implementation of a field morphing techinique due to Beier and Neely (1992) and I have the simple case working in Python...

Python

Link Tables problem....Way too slow to operate.

by: David | last post by:

Hi, We have an internal network of 3 users. Myself & one other currently have individual copies of the front-end MS Access forms and via our individual ODBC links we have used the: File > Get...

Microsoft Access / VBA

str.count is slow

by: chrisperkins99 | last post by:

It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable *...

Python

why scipy cause my program slow?

by: HYRY | last post by:

Why the exec time of test(readdata()) and test(randomdata()) of following program is different? my test file 150Hz10dB.wav has 2586024 samples, so I set randomdata function to return a list with...

Python

Slow Transfer Rates on Socket and TCPClient data reads

by: Andrew Jackson | last post by:

I am writing a newsgroup client. I have the protocol figured out. But I get slow transfer speeds off any of the network objects read the data from For example one of the commands for a news...

Visual Basic .NET

Is __import__ known to be slow in windows?

by: Joshua Kugler | last post by:

We've recently been doing some profiling on a project of ours. It runs quite fast on Linux but *really* bogs down on Windows 2003. We initially thought it was the simplejson libraries (we don't...

Python

It's urgent (Oracle running slow)

by: dineshchand | last post by:

I m working on a application importing data into oracle.But after few import into Oracle DB my import process is taking so long time.My oracle is getting slow after running import few times.Can...

Oracle Database

xor: how come so slow?

by: Michele | last post by:

Hi, I'm trying to encode a byte data. Let's not focus on the process of encoding; in fact, I want to emphasize that the method create_random_block takes 0.5s to be executed (even Java it's faster)...

Python

Why is indexing into an numpy array that slow?

by: Rüdiger Werner | last post by:

Hello! Out of curiosity and to learn a little bit about the numpy package i've tryed to implement a vectorised version of the 'Sieve of Zakiya'. While the code itself works fine it is...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice