design choice: multi-threaded / asynchronous wxpython client?

bullockbefriending bard

I am a complete ignoramus and newbie when it comes to designing and
coding networked clients (or servers for that matter). I have a copy
of Goerzen (Foundations of Python Network Programming) and once
pointed in the best direction should be able to follow my nose and get
things sorted... but I am not quite sure which is the best path to
take and would be grateful for advice from networking gurus.

I am writing a program to display horse racing tote odds in a desktop
client program. I have access to an HTTP (open one of several URLs,
and I get back an XML doc with some data... not XML-RPC.) source of
XML data which I am able to parse and munge with no difficulty at all.
I have written and successfully tested a simple command line program
which allows me to repeatedly poll the server and parse the XML. Easy
enough, but the real world production complications are:

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming
race... I should query for this perhaps every 150s to be safe. But for
the upcoming race, I must not miss any updates and should query every
~7s to be safe. So... in the middle of a race meeting the situation
might be:
race 1 (race done with, no-longer querying), race 2 (race done with,
no longer querying) race 3 (about to start, data on server for this
race updating every 15s, my client querying every 7s), races 4-8 (data
on server for these races updating every 5 mins, my client querying
every 2.5 mins)

2) After a race has started and betting is cut off and there are
consequently no more tote updates for that race (it is possible to
determine when this occurs precisely because of an attribute in the
XML data), I need to stop querying (say) race 3 every 7s and remove
race 4 from the 150s query group and begin querying its data every 7s.

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

My initial thought was to have two threads for the different update
polling cycles. In addition I would probably need another thread to
handle UI stuff, and perhaps another for dealing with file/DB data
write out. But, I wonder if using Twisted is a better idea? I will
still need to handle some threading myself, but (I think) only for
keeping wxpython happy by doing all this other stuff off the main
thread + perhaps also persisting received data in yet another thread.

I have zero experience with these kinds of design choices and would be
very happy if those with experience could point out the pros and cons
of each (synchronous/multithreaded, or Twisted) for dealing with the
two differing sample rates problem outlined above.

Many TIA!

Jun 27 '08 #1

Subscribe Post Reply

2154

Larry Bates

bullockbefriending bard wrote:

I am a complete ignoramus and newbie when it comes to designing and
coding networked clients (or servers for that matter). I have a copy
of Goerzen (Foundations of Python Network Programming) and once
pointed in the best direction should be able to follow my nose and get
things sorted... but I am not quite sure which is the best path to
take and would be grateful for advice from networking gurus.

I am writing a program to display horse racing tote odds in a desktop
client program. I have access to an HTTP (open one of several URLs,
and I get back an XML doc with some data... not XML-RPC.) source of
XML data which I am able to parse and munge with no difficulty at all.
I have written and successfully tested a simple command line program
which allows me to repeatedly poll the server and parse the XML. Easy
enough, but the real world production complications are:

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming
race... I should query for this perhaps every 150s to be safe. But for
the upcoming race, I must not miss any updates and should query every
~7s to be safe. So... in the middle of a race meeting the situation
might be:
race 1 (race done with, no-longer querying), race 2 (race done with,
no longer querying) race 3 (about to start, data on server for this
race updating every 15s, my client querying every 7s), races 4-8 (data
on server for these races updating every 5 mins, my client querying
every 2.5 mins)

2) After a race has started and betting is cut off and there are
consequently no more tote updates for that race (it is possible to
determine when this occurs precisely because of an attribute in the
XML data), I need to stop querying (say) race 3 every 7s and remove
race 4 from the 150s query group and begin querying its data every 7s.

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

My initial thought was to have two threads for the different update
polling cycles. In addition I would probably need another thread to
handle UI stuff, and perhaps another for dealing with file/DB data
write out. But, I wonder if using Twisted is a better idea? I will
still need to handle some threading myself, but (I think) only for
keeping wxpython happy by doing all this other stuff off the main
thread + perhaps also persisting received data in yet another thread.

I have zero experience with these kinds of design choices and would be
very happy if those with experience could point out the pros and cons
of each (synchronous/multithreaded, or Twisted) for dealing with the
two differing sample rates problem outlined above.

Many TIA!

IMHO using twisted will give you the best performance and framework. Since it
uses callbacks for every request, your machine could handle a LOT of different
external queries and keep everything updated in WX. Might be a little tricky to
get working with WX, but I recall Googling for something like this not long ago
and there appeared to be sufficient information on how to get working.

http://twistedmatrix.com/projects/co...g-reactor.html

Twisted even automatically uses threads to keep SQL database storage routines
from blocking (see Chapter 4 of Twisted Network Programming Essentials)

This is an ambitious project, good luck.

-Larry

Jun 27 '08 #2

Eric Wertman

HI, that does look like a lot of fun... You might consider breaking
that into 2 separate programs. Write one that's threaded to keep a db
updated properly, and write a completely separate one to handle
displaying data from your db. This would allow you to later change or
add a web interface without having to muck with the code that handles
data.

Jun 27 '08 #3

David

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming

Try using an HTTP HEAD instruction instead to check if the data has
changed since last time.

Jun 27 '08 #4

bullockbefriending bard

On Apr 27, 10:10*pm, David <wizza...@gmail.comwrote:

*1) The data for the race about to start updates every (say) 15
*seconds, and the data for earlier and later races updates only every
*(say) 5 minutes. There is *no point for me to be hammering the server
*with requests every 15 seconds for data for races after the upcoming

Try using an HTTP HEAD instruction instead to check if the data has
changed since last time.

Thanks for the suggestion... am I going about this the right way here?

import urllib2
request = urllib2.Request("http://get-rich.quick.com")
request.get_method = lambda: "HEAD"
http_file = urllib2.urlopen(request)

print http_file.headers

->>>
Age: 0
Date: Sun, 27 Apr 2008 16:07:11 GMT
Content-Length: 521
Content-Type: text/xml; charset=utf-8
Expires: Sun, 27 Apr 2008 16:07:41 GMT
Cache-Control: public, max-age=30, must-revalidate
Connection: close
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Via: 1.1 jcbw-nc3 (NetCache NetApp/5.5R4D6)

Date is the time of the server response and not last data update. Data
is definitely time of server response to my request and bears no
relation to when the live XML data was updated. I know this for a fact
because right now there is no active race meeting and any data still
available is static and many hours old. I would not feel confident
rejecting incoming data as duplicate based only on same content length
criterion. Am I missing something here?

Actually there doesn't seem to be too much difficulty performance-wise
in fetching and parsing (minidom) the XML data and checking the
internal (it's an attribute) update time stamp in the parsed doc. If
timings got really tight, presumably I could more quickly check each
doc's time stamp with SAX (time stamp comes early in data as one might
reasonably expect) before deciding whether to go the whole hog with
minidom if the time stamp has in fact changed since I last polled the
server.

But if there is something I don't get about HTTP HEAD approach, please
let me know as a simple check like this would obviously be a good
thing for me.

Jun 27 '08 #5

Jorge Godoy

bullockbefriending bard wrote:

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

Why in a BLOB? Why not into specific data types and normalized tables? You
can also save the BLOB for backup or auditing, but this won't allow you to
use your DB to the best of its capabilities... It will just act as a data
container, the same as a network share (which would not penalize you too
much to have connections open/closed).

Jun 27 '08 #6

Jarkko Torppa

On 2008-04-27, David <wi******@gmail.comwrote:

>>
1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming

Try using an HTTP HEAD instruction instead to check if the data has
changed since last time.

Get If-Modified-Since is still better
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html 14.25)

--
Jarkko Torppa

Jun 27 '08 #7

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

I think twisted is overkill for this problem. Threading, elementtree
and urllib should more than suffice. One thread polling the server for
each race with the desired polling interval. Each time some data is
treated, that thread sends a signal containing information about what
changed. The gui listens to the signal and will, if needed, update
itself with the new information. The database handler also listens to
the signal and updates the db.

2008/4/27, bullockbefriending bard <ki*******@gmail.com>:

I am a complete ignoramus and newbie when it comes to designing and
coding networked clients (or servers for that matter). I have a copy
of Goerzen (Foundations of Python Network Programming) and once
pointed in the best direction should be able to follow my nose and get
things sorted... but I am not quite sure which is the best path to
take and would be grateful for advice from networking gurus.

I am writing a program to display horse racing tote odds in a desktop
client program. I have access to an HTTP (open one of several URLs,
and I get back an XML doc with some data... not XML-RPC.) source of
XML data which I am able to parse and munge with no difficulty at all.
I have written and successfully tested a simple command line program
which allows me to repeatedly poll the server and parse the XML. Easy
enough, but the real world production complications are:

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming
race... I should query for this perhaps every 150s to be safe. But for
the upcoming race, I must not miss any updates and should query every
~7s to be safe. So... in the middle of a race meeting the situation
might be:
race 1 (race done with, no-longer querying), race 2 (race done with,
no longer querying) race 3 (about to start, data on server for this
race updating every 15s, my client querying every 7s), races 4-8 (data
on server for these races updating every 5 mins, my client querying
every 2.5 mins)

2) After a race has started and betting is cut off and there are
consequently no more tote updates for that race (it is possible to
determine when this occurs precisely because of an attribute in the
XML data), I need to stop querying (say) race 3 every 7s and remove
race 4 from the 150s query group and begin querying its data every 7s.

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

My initial thought was to have two threads for the different update
polling cycles. In addition I would probably need another thread to
handle UI stuff, and perhaps another for dealing with file/DB data
write out. But, I wonder if using Twisted is a better idea? I will
still need to handle some threading myself, but (I think) only for
keeping wxpython happy by doing all this other stuff off the main
thread + perhaps also persisting received data in yet another thread.

I have zero experience with these kinds of design choices and would be
very happy if those with experience could point out the pros and cons
of each (synchronous/multithreaded, or Twisted) for dealing with the
two differing sample rates problem outlined above.

Many TIA!

--
http://mail.python.org/mailman/listinfo/python-list

--
mvh Björn

Jun 27 '08 #8

bullockbefriending bard

On Apr 27, 11:27*pm, "BJörn Lindqvist" <bjou...@gmail.comwrote:

I think twisted is overkill for this problem. Threading, elementtree
and urllib should more than suffice. One thread polling the server for
each race with the desired polling interval. Each time some data is
treated, that thread sends a signal containing information about what
changed. The gui listens to the signal and will, if needed, update
itself with the new information. The database handler also listens to
the signal and updates the db.

So, if i understand you correctly:

Assuming 8 races and we are just about to start the race 1, we would
have 8 polling threads with the race 1 thread polling at faster rate
than the other ones. after race 1 betting closed, could dispense with
that thread, change race 2 thread to poll faster, and so on...? I had
been rather stupidly thinking of just two polling threads, one for the
current race and one for races not yet run... but starting out with a
thread for each extant race seems simpler given there then is no need
to handle the mechanics of shifting the polling of races from the
omnibus slow thread to the current race fast thread.

Having got my minidom parser working nicely, I'm inclined to stick
with it for now while I get other parts of the problem licked into
shape. However, I do take your point that it's probably overkill for
this simple kind of structured, mostly numerical data and will try to
find time to experiment with the elementtree approach later. No harm
at all in shaving the odd second off document parse times.

Jun 27 '08 #9

David

Date is the time of the server response and not last data update. Data

is definitely time of server response to my request and bears no
relation to when the live XML data was updated. I know this for a fact
because right now there is no active race meeting and any data still
available is static and many hours old. I would not feel confident
rejecting incoming data as duplicate based only on same content length
criterion. Am I missing something here?

It looks like the data is dynamically generated on the server, so the
web server doesn't know if/when the data changed. You will usually see
this for static content (images, html files, etc). You could go by the
Cache-Control line and only fetch data every 30 seconds, but it's
possible for you to miss some updates this way.

Another thing you could try (if necessary, this is a bit of an
overkill) - download the first part of the XML (GET request with a
range header), and check the timestamp you mentinoed. If that changed
then re-request the doc (a download resume is risky, the XML might
change between your 2 requests).

David.

Jun 27 '08 #10

David

3) I need to dump this data (for all races, not just current about to

start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

A few important questions:

1) How real-time must the display be? (should update immediately after
you get new XML data, or is it ok to update a few seconds later?).

2) How much data is being processed at peak? (100 records a second, 1000?)

3) Does your app need to share fetched data with other apps? If so,
how? (read from db, download HTML, RPC, etc).

4) Does your app need to use data from previous executions? (eg: if
you restart it, does it need to have a fully populated UI, or can it
start from an empty UI and start updating as it downloads new XML
updates).

How you answer the above questionss determines what kind of algorithm
will work best.

David.

PS: I suggest that you contact the people you're downloading the XML
from if you haven't already. eg: it might be against their TOS to
constantly scrape data (I assume not, since they provide XML). You
don't want them to black-list your IP address ;-). Also, maybe they
have ideas for efficient data retrieval (eg: RSS feeds).

Jun 27 '08 #11

David

Tempting thought, but one of the problems with this kind of horse

racing tote data is that a lot of it is for combinations of runners
rather than single runners. Whilst there might be (say) 14 horses in a
race, there are 91 quinella price combinations (1-2 through 13-14,
i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
It is not really practical (I suspect) to have database tables with
columns for that many combinations?

If you normalise your tables correctly, these will be represented as
one-to many or many-to-many relationships in your database. Like the
other poster I don't know the first thing about horses, and I may be
misunderstanding something, but here is one (basic) normalised db
schema:

tables & descriptions:

- horse - holds info about each horse
- race - one record per race. Has times, etc
- race_hourse - holds records linking horses and races together.

You can derive all possible horse combinations from the above info.
You don't need to store it in the db unless you need to link something
else to it (eg: betting data). In which case:

- combination - represents one combination of horses.
- combination_horse - links a combinaition to 1 horse. 1 of these per
horse per combination.
- bet - Represents a bet. Has foreign relationship with combination
(and other tables, eg: better, race)

With a structure like the above you don't need hudreds of database columns :-)

David.

Jun 27 '08 #12

Bjoern Schliessmann

bullockbefriending bard wrote:

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only
every
(say) 5 minutes. There is no point for me to be hammering the
server with requests every 15 seconds for data for races after the
upcoming race... I should query for this perhaps every 150s to be
safe. But for the upcoming race, I must not miss any updates and
should query every
~7s to be safe. So... in the middle of a race meeting the
situation might be:

I don't fully understand this, but can't you design the server in a
way that you can connect to it and it notifies you about important
things? IMHO, polling isn't ideal.

My initial thought was to have two threads for the different
update polling cycles. In addition I would probably need another
thread to handle UI stuff, and perhaps another for dealing with
file/DB data write out.

No need for any additional threads. UI, networking and file I/O can
operate asynchronously. Using wxPython's timers with callback
functions, you should need only standard Python modules (except
wx).

But, I wonder if using Twisted is a better idea?

IMHO that's only advisable if you like to create own protocols and
reuse them in different apps, or need full-featured customisable
implementations of advanced protocols.

Additionally, you'd *have to* use multiple threads: One for the
Twisted event loop and one for the wxPython one.

There is a wxreactor in Twisted which integrates the wxPython event
loop, but I stopped using it due to strange deadlock problems which
began with some wxPython version. Also, it seems it's no more in
development. But my alternative works perfectly (main thread with
Twisted, and a GUI thread for wxPython, communicating over Python
standard queues).

You'd only need additional threads if you would do heavy number
crunching inside the wxPython or Twisted thread. For the respective
event loop not to hang, it's advisable to use a separate thread for
long-running calculations.

I have zero experience with these kinds of design choices and
would be very happy if those with experience could point out the
pros and cons of each (synchronous/multithreaded, or Twisted) for
dealing with the two differing sample rates problem outlined
above.

I'd favor "as few threads as neccessary" approach. In my experience
this saves pain (i. e. deadlocks and boilerplate queueing code).

Regards,
Björn

--
BOFH excuse #27:

radiosity depletion

Jun 27 '08 #13

design choice: multi-threaded / asynchronous wxpython client?

Similar topics