473,486 Members | 1,889 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Async Client with 1K connections?

Before I take the plunge, I'd appreciate any advice on the feasibility
and degree of difficulty of the following...

I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections and commensurate network throughput. The data
get written to files or sockets, so no GUI needed.

I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?

Any advantage to using a particular unix/version -- Linux 32/64bit?
FreeBSD 4/5? Solaris Sun/Intel?

If anyone who is expert in this area may be available, please contact
me at "w c h a n g at a f f i n i dot com". (I'm in the SF Bay Area.)

My background is C -- I was the principal author of Infoseek (RIP),
including the Python modularization that was the core of Ultraseek aka
Inktomi Enterprise Search aka Verity. (For those of you old enough to
remember!) Unfortunately, I moved upstairs and never did much Python.

Thanks in advance, --William
Jul 18 '05 #1
12 1860
wi***********@hotmail.com (William Chang) writes:
I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections and commensurate network throughput. The data
get written to files or sockets, so no GUI needed.
You're writing a monstrous web spider in Python?
I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?


With enough hardware, you can do practically anything. Some Apache
servers fork off that many processes.
Jul 18 '05 #2
wi***********@hotmail.com (William Chang) writes:
I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections and commensurate network throughput. The data
get written to files or sockets, so no GUI needed.

I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?
Seriously, I'd probably use asyncore since it's the simplest. Twisted
is more flexible but maybe you don't need that.

Why do you want to write this client in Python? What is it doing?

Rather than going crazy tuning the software, you can parallelize it
and run it on multiple boxes. Does that work for you?
Any advantage to using a particular unix/version -- Linux 32/64bit?
FreeBSD 4/5? Solaris Sun/Intel?


Google has something like 8000 servers in its farm, running 32 bit
Linux, so they're probably onto something. Solaris is a lot slower.
64 bit Linux is maybe too new to deploy in some big production system.
Jul 18 '05 #3
[P&M]

William Chang wrote:
Before I take the plunge, I'd appreciate any advice on the feasibility
and degree of difficulty of the following...

I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections
I've got an httpd stress tool that uses asyncore. I can run up 1020
independent simulated clients on my RH9 box(1x3Ghz cpu, 1GB ram),
driving at over 600 requests per second against a modest (2x1Ghz)
webserver, just pulling a static page.
and commensurate network throughput.
That could vary a lot, couldn't it?
The data get written to files or sockets, so no GUI needed.
Writing to files could slow you down a lot, depending on how much needs
to be written, how fast your disks are, how you go about getting the
data from the async client to the file, etc.. Much of the same goes for
sockets, too.
I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?
I should think all can do the job for you, depending on the details
which you haven't told us.
Any advantage to using a particular unix/version -- Linux 32/64bit?
FreeBSD 4/5? Solaris Sun/Intel?

If anyone who is expert in this area may be available, please contact
me at "w c h a n g at a f f i n i dot com". (I'm in the SF Bay Area.)

My background is C -- I was the principal author of Infoseek (RIP),
including the Python modularization that was the core of Ultraseek aka
Inktomi Enterprise Search aka Verity. (For those of you old enough to
remember!) Unfortunately, I moved upstairs and never did much Python.

Thanks in advance, --William

Jul 18 '05 #4
Hi !

See Erlang : the web-server-sample can serve more than 50000 connexions on
one standard cpu.


Jul 18 '05 #5
> Before I take the plunge, I'd appreciate any advice on the feasibility
and degree of difficulty of the following...

I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections and commensurate network throughput. The data
get written to files or sockets, so no GUI needed.
1000+ connections is not a problem, although (on Linux at least, and probably
others) you'll probably need to make sure your process is allowed to have open
more file descriptors, especially if you're turning around and writing data to
disk (since that uses file descriptors too). This is OS-specific and has
nothing to do with Python, but IIRC you can do something like
os.sysconf(os.sysconf_names['SC_OPEN_MAX']) to see how many fd's your process
can have open.
I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?


You're probably going to be ok either way, but what are your throughput
requirements exactly? Are these connections pulling down HTML pages and small
images or are they big, multi-megabyte downloads? How big is your connection?
For 99% of uses asyncore or Twisted will be fine - but if you need very high
numbers of new connections per second (hundreds) or throughput (hundreds of
Mbps) then you might need to modify the framework or build your own - still in
Python but more tailored to your specific needs - in order to get those levels
of performance.

-Dave
Jul 18 '05 #6
Paul Rubin wrote:

wi***********@hotmail.com (William Chang) writes:
I need extremely efficient and robust _client_ software for some
common protocols like HTTP and POP3, supporting 1,000 simultaneous
independent connections and commensurate network throughput. The data
get written to files or sockets, so no GUI needed.

I am not a Python programmer :-( but I am a "fan" :-) and I have been
reading about asyncore/Medusa/Twisted -- which would be my best bet?


Seriously, I'd probably use asyncore since it's the simplest. Twisted
is more flexible but maybe you don't need that.


I agree Twisted is more flexible, but having tried both I'd argue that
it is also simpler. I was able to get farther, faster, just by following
the simple examples (e.g. http://www.twistedmatrix.com/documents/howto/clients)
on the web site than I was with asyncore. I also found the source
_much_ cleaner and more readable when it came time to look there as well.

-Peter
Jul 18 '05 #7
Bill Scherer <Bi**********@verizonwireless.com> writes:
The data get written to files or sockets, so no GUI needed.

Writing to files could slow you down a lot, depending on how much
needs to be written, how fast your disks are, how you go about
getting the data from the async client to the file, etc.. Much of the
same goes for sockets, too.


That's a good point, you should put everything into one file serially,
then sort it afterwards to separate out data from individual
connections.
Jul 18 '05 #8
Thank you all for the discussion! Some additional information:

One of the intended uses is indeed a next-gen web spider. I did the
math, and yes I will need about 10 cutting-edge PCs to spider like
you-know-who. But I shouldn't need 100 -- and would rather not
spend money unnecessarily... Throughput per PC would be on
the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
simultaneous connections. (That's 17M pages per day per PC.)
My search & content engine can index and store at such a rate,
but can the spider initiate (at least) 200 new requests per second,
assuming each request lasts 5-10 seconds?

Of course, that assumes the spider algorithm/coordinator is pretty
intelligent and well-engineered. And the hardware stay up, etc.
Managing storage is certainly nontrivial; at such a scale nothing is
to be taken for granted!

Nevertheless, it shouldn't cost millions. Maybe $100K :-)

Time for a sanity check? --William


Jul 18 '05 #9
"William Chang" <wi***********@hotmail.com> writes:
Thank you all for the discussion! Some additional information:

One of the intended uses is indeed a next-gen web spider. I did the
math, and yes I will need about 10 cutting-edge PCs to spider like
you-know-who. But I shouldn't need 100 -- and would rather not
spend money unnecessarily... Throughput per PC would be on
the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
simultaneous connections. (That's 17M pages per day per PC.)
That's orders of magnitude less than you-know-who. Also, don't forget
how many queries you have to take from users, and the amount of disk seeks
needed for each one.
Nevertheless, it shouldn't cost millions. Maybe $100K :-)


10 MB of internet connectivity is at least a few K$/month all by itself.
Jul 18 '05 #10
In article <ze********************@comcast.com>,
William Chang <wi***********@hotmail.com> wrote:

One of the intended uses is indeed a next-gen web spider. I did the
math, and yes I will need about 10 cutting-edge PCs to spider like
you-know-who.


Note that while you-know-who makes extensive use of Python, I don't
think they're using it for spidering/searching. I do have some
background writing a spider in Python, using Verity's engine for
indexing/retrieval, but we were using threading rather than
asyncore-style operations.
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

"Argue for your limitations, and sure enough they're yours." --Richard Bach
Jul 18 '05 #11
aa**@pythoncraft.com (Aahz) wrote:
Note that while you-know-who makes extensive use of Python, I don't
think they're using it for spidering/searching. I do have some
background writing a spider in Python, using Verity's engine for
indexing/retrieval, but we were using threading rather than
asyncore-style operations.


Interesting, did you try maxing out the number of threads/connections?
On an UltraSparc with hardware thread/lwp support, a thousand threads
can co-exist reliably, at least for computations and disk I/O. Linux
is another matter entirely.

--William
Jul 18 '05 #12
Paul Rubin <http://ph****@NOSPAM.invalid> wrote:
"William Chang" <wi***********@hotmail.com> writes:
... Throughput per PC would be on
the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
simultaneous connections. (That's 17M pages per day per PC.)
That's orders of magnitude less than you-know-who.


Do you know how frequently you-know-who refreshes its entire index? A year
ago things were pretty dire, easily over 10% dead links, if I recall correctly.
10 PCs at 17M/day each will refresh 3B pages in 18 days, easily world-class.
... Also, don't forget
how many queries you have to take from users, and the amount of disk seeks
needed for each one.
Sure, that's what I do. However, spidering and querying are independent tasks,
generally speaking.
10 MB of internet connectivity is at least a few K$/month all by itself.


Yes, $2500 to be specific.

There's no reason to be intimidated (if I may use that word) by you-know-who's
marketing message (80,000 machines). Back in '96 Infoseek could handle 10M
queries per day on a single Sun E4000 with 8CPU (<200Mhz), 4GB, 20x4GB RAID.
Sure the WWW is much bigger now, but so are the disk drives!

-- William
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
5201
by: Martin Fuzzey | last post by:
I am using xmlrpclib (based on httplib) in Python 2.3 on Mandrake Linux. When my client attempts to connect to a server using a "http://localhost:port" style URL there is a long delay before the...
0
1333
by: MisterT | last post by:
Hello, I have a TCPclient and TCPListener created from MS sample code. They do communicate, but the client sockets do not seem to be closed totally even though there is a TCPclient.Close done...
3
1545
by: Jo Davis | last post by:
www.shanje.com does sql server hosting, on shared servers, at a reasonable price. It seems. They also allow client connections. Just playing around I've managed to connect an Access Data Project...
0
1321
by: Jerome Macaranas | last post by:
Is there a way in postgres to view current client connections? including statistics.. how long did it take for him to connect and disconnect.. etc TIA ---------------------------(end of...
1
5851
by: Mark Harrison | last post by:
We have the situation where it would be convenient if we could support a large number (>1024, possibly in the 2000-3000 range) of client connections. What are our options for this? Many TIA,...
3
2272
by: D. Yates | last post by:
Hi, I'm about to embark on a project that will both send and receive information to/from our client computers in the field. The area that I still need to finalize is the method of...
0
2059
by: shyamsunderrai | last post by:
Dear All, I am having the latest version of PostgreSQL i.e. 8.4.2 and in order to increase the number of client connection I have increased the parameters "max_connections" and "shared_buffers"...
9
24504
by: Greg | last post by:
I'm creating a tcp socket connection from the thread in the c# threadpool. Since the default workers thread is 500, sometimes my program tries to open up 500 different tcp socket connections and...
1
1959
by: =?Utf-8?B?TWFyaw==?= | last post by:
Hi... There are a few questions wrapped up in this, but the main one is that the WebService.MyMethodAsync() methods that are automatically generated in the client code by VS 2005 don't seem to...
0
7173
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
6839
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7305
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5427
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4559
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3066
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1378
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
598
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
259
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.