473,325 Members | 2,860 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

help: SIGABRT intermittent crash for threaded website crawleron python 2.4.4c1

I've been experiencing an intermittent crash where no python
stacktrace is provided. It happens for a url downloading process that
can last up to 12 hours and crawls about 50,000 urls.

I'm using urllib2 for the downloads. There are 5-10 downloading
threads, and some custom website exploration code for providing the
urls to crawl.

The downloads are completed in memory (not piped), then saved to a
file. There are also nice per domain / IP guidelines upheld so lots
of concurrent downloads and exploration are either waiting or taking
place sometimes up to 40 at once. As a result, I've seen the process
memory footprint clime upwards of 800 megs.

About 20-40% of the time, the entire process bails out with no
stacktrace, at random memory allocation and running time periods..
sometimes as little as 2 hours. My guess is that there is a bug in
urllib2 or some third party software I'm using, or it was not meant to
be run in a multithreaded environment. Decreasing the
bandwidth/aggressiveness of the crawler MAY seem to have an effect on
the frequency.. haven't done any formal 'studies' on that yet. My
current solution is to restart the crawler, but this is bad business
to the websites (recrawling), and extra crawl time on my part.

I bet if I switch to a 1-download-per-process scenario with pyro for
IPC (to uphold niceness rules, etc), I will fix this situation as I
suspect from reading similar SIGABRT issues that it has something to
do with the multi-threading. But I figured I'd ask around before I
take such drastic measures.

Since the process is so long-running, I have not tried running strace,
and I'm not even sure if it would make sense to me or someone else.
Let me know if you have a method of catching just the last 1000 calls
and not saving earlier ones or whatever, if that would be useful.

I'm using an older version of Python 2.4.4c1. Since the bug is
intermittent, I'm not sure yet if an upgrade to Pyhton 2.5 has solved
my problem.

Does anyone have any clues for me to try? My threading code uses a
messaging queue per thread, and one notification queue that the main
thread checks and assigns new crawls back to free threads. No other
variables are referenced by multiple threads other than the thread
objects themselves (to my knowledge).
Oct 3 '08 #1
0 881

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Berteun Damman | last post by:
Hello, First I was trying to get PyOSD, but as soon as I did `import pyosd' Python received a SIGABRT. Then I wrote my own module, which looks like: #include <Python.h> static PyMethodDef...
1
by: Roger Davis | last post by:
I am having a problem with a program that allocates very large amounts of memory (approaching 2Gb total) in small chunks, e.g., a few Mb at a time. The program is dumping core because it...
5
by: Charlie | last post by:
Dear all, I'm running a trace analyzer over a large trace file(several gig hz). However it stopped in the middle. I got the call stack from the gdb. I wonder if anyone could figure out the...
8
by: Ben | last post by:
Hi, I am having trouble debugging a segmentation fault...here's my data structure: typedef struct CELL *pCELL; /* Pointers to cells */ struct CELL { SYMBOL symbol; pCELL prev_in_block;...
4
by: Russell Warren | last post by:
I've been having a hard time tracking down a very intermittent problem where I get a "permission denied" error when trying to rename a file to something that has just been deleted (on win32). ...
14
by: Hendrik van Rooyen | last post by:
Hi, I get the following: hvr@LINUXBOXMicrocorp:~/Controller/libpython display.py UpdateStringProc should not be invoked for type font Aborted and I am back at the bash prompt - this is...
14
by: Snor | last post by:
I'm attempting to create a lobby & game server for a multiplayer game, and have hit a problem early on with the server design. I am stuck between using a threaded server, and using an event driven...
0
by: =?Utf-8?B?QnJhZA==?= | last post by:
We are developing a complex ActiveX control and for the most part all is well. We test this in many environments and one thing we have noticed is that ALL of our C# apps (using .NET 2003) have an...
1
by: jpw | last post by:
I am writing a Python / C++ embed app and it need to work on 3 platforms I have the PYTHONPATH variable set correctly and have gone back and downloaded compiled and installed the latest Python...
0
by: Joey Bersche | last post by:
I've been experiencing an intermittent crash where no python stacktrace is provided. It happens for a url downloading process that can last up to 12 hours and crawls about 50,000 urls. I'm...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.