By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,384 Members | 845 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,384 IT Pros & Developers. It's quick & easy.

XML <=> Text conversion platform requiring high performance

P: n/a
Hello everyone.

I have to find a correct architecture to achieve this XML <=Text
conversion platform. The platform (based on Win2003Server) will have to
deal with 21 million XML files and 16 million text files a day. The
average file size is 1,1 Kb, but they are received by the platform in
the form of big archives (7000 files per archive, app. 7.7Mb).

After some investigation on the Internet, I have decided (95% sure) to
use SAX as the API to deal with my files. And I will not use XSLT as
the main converter, because it will be too slow.

However, I must say that I am a complete "newbie" concerning XML, and
those decisions have been taken after much reading, and discussions
with "others", supposed to be slightly better than me at XML. Which
means two things, and those are my questions :

* Does this architecture looks good to you ?
Win2003 Server, Websphere 5.1.1, Java and SAX

* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

Thanks for your help.

Aug 24 '06 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Benjamin Bécar wrote:
* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?
This depends on your way of processing. If you choose to
start one new process for the processing of a file, then
you might run into problems with the amount of available
RAM in your host. If 1000 files are processed at each time
instant, and 1000 processes handle them, then the overhead
of each process will also sum up to a 1000-fold.
Aug 24 '06 #2

P: n/a
Thanks for your answer,

I intend to process one file after the other, and not to use threads.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

Aug 24 '06 #3

P: n/a
If you're talking about 40GB XML files, I'd strongly suggest considering
a serious XML database (which, these days, includes IBM's DB2). Or
keeping the main database in non-XML form and using XML only for the
extracted subsets that you're going to expose to the outside world.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Aug 24 '06 #4

P: n/a
The problem is that this platform is supposed to only be a gateway for
text files (around 17GB a day) and XML files (around 23 GB a day).
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

Aug 24 '06 #5

P: n/a
Benjamin Bécar wrote:
I intend to process one file after the other, and not to use threads.
This is a conservative design decision.
In production environments it is generally a
good idea to choose proven designs instead of
futuristic ones.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Indeed.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?
The rule-of-thumb is that a good SAX parser can parse 10 MB/s.
40 GB of data should therefore take about 4000 seconds. This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
Aug 24 '06 #6

P: n/a
Benjamin Bécar wrote:
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.
Now I understand why you opted for a sequential approach.
For the prototyping phase, the following tool maybe
useful in building a first implementation of the
processing pipeline:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

I know of at least one developer (Andrew Schorr) who
has built a production environment with this tool.
But his constraints differed a bit from yours.
Andrew uses a Solaris server. He has XML files as
large as 1 GB and has to import them into the
PostgreSQL database.
Aug 24 '06 #7

P: n/a
Thank you very much for this information, I'm gonna look into it right
now. And please excuse me for my previous mail that looked a bit
"stubborn", as I did not explained the reasons.

Aug 24 '06 #8

P: n/a

Jürgen Kahrs a écrit :
This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
It doesn't seem to me much time, so that is a good thing. But could you
tell me what kind of hardware have you taken for this estimation ?

Thanks.

Aug 25 '06 #9

This discussion thread is closed

Replies have been disabled for this discussion.