473,322 Members | 1,241 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

XML <=> Text conversion platform requiring high performance

Hello everyone.

I have to find a correct architecture to achieve this XML <=Text
conversion platform. The platform (based on Win2003Server) will have to
deal with 21 million XML files and 16 million text files a day. The
average file size is 1,1 Kb, but they are received by the platform in
the form of big archives (7000 files per archive, app. 7.7Mb).

After some investigation on the Internet, I have decided (95% sure) to
use SAX as the API to deal with my files. And I will not use XSLT as
the main converter, because it will be too slow.

However, I must say that I am a complete "newbie" concerning XML, and
those decisions have been taken after much reading, and discussions
with "others", supposed to be slightly better than me at XML. Which
means two things, and those are my questions :

* Does this architecture looks good to you ?
Win2003 Server, Websphere 5.1.1, Java and SAX

* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

Thanks for your help.

Aug 24 '06 #1
8 1316
Benjamin Bécar wrote:
* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?
This depends on your way of processing. If you choose to
start one new process for the processing of a file, then
you might run into problems with the amount of available
RAM in your host. If 1000 files are processed at each time
instant, and 1000 processes handle them, then the overhead
of each process will also sum up to a 1000-fold.
Aug 24 '06 #2
Thanks for your answer,

I intend to process one file after the other, and not to use threads.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

Aug 24 '06 #3
If you're talking about 40GB XML files, I'd strongly suggest considering
a serious XML database (which, these days, includes IBM's DB2). Or
keeping the main database in non-XML form and using XML only for the
extracted subsets that you're going to expose to the outside world.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Aug 24 '06 #4
The problem is that this platform is supposed to only be a gateway for
text files (around 17GB a day) and XML files (around 23 GB a day).
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

Aug 24 '06 #5
Benjamin Bécar wrote:
I intend to process one file after the other, and not to use threads.
This is a conservative design decision.
In production environments it is generally a
good idea to choose proven designs instead of
futuristic ones.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Indeed.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?
The rule-of-thumb is that a good SAX parser can parse 10 MB/s.
40 GB of data should therefore take about 4000 seconds. This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
Aug 24 '06 #6
Benjamin Bécar wrote:
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.
Now I understand why you opted for a sequential approach.
For the prototyping phase, the following tool maybe
useful in building a first implementation of the
processing pipeline:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

I know of at least one developer (Andrew Schorr) who
has built a production environment with this tool.
But his constraints differed a bit from yours.
Andrew uses a Solaris server. He has XML files as
large as 1 GB and has to import them into the
PostgreSQL database.
Aug 24 '06 #7
Thank you very much for this information, I'm gonna look into it right
now. And please excuse me for my previous mail that looked a bit
"stubborn", as I did not explained the reasons.

Aug 24 '06 #8

Jürgen Kahrs a écrit :
This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
It doesn't seem to me much time, so that is a good thing. But could you
tell me what kind of hardware have you taken for this estimation ?

Thanks.

Aug 25 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Eshrath | last post by:
Hi, What I am trying to do: ======================= I need to form a table in html using the xsl but the table that is formed is quite long and cannot be viewed in our application. So we are...
2
by: Donald Firesmith | last post by:
I am having trouble having Google Adsense code stored in XSL converted properly into HTML. The <> unfortunately become &lt; and &gt; and then no longer work. XSL code is: <script...
11
by: Scott Brady Drummonds | last post by:
Hi, everyone, I've checked a couple of on-line resources and am unable to determine how reinterpret_cast<> is different from static_cast<>. They both seem to perform a compile-time casting of...
0
by: Joergen Bech | last post by:
Help! Looking for a .Net-kompatible component for converting between HTML and RTF (both ways). Asked this question a few days ago and received this link:...
8
by: davihigh | last post by:
My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does...
17
by: roN | last post by:
Hi, I'm creating a Website with divs and i do have some troubles, to make it looking the same way in Firefox and IE (tested with IE7). I checked it with the e3c validator and it says: " This...
3
by: ajay2552 | last post by:
Hi, I have a query. All html tags start with < and end with >. Suppose i want to display either '<' or '>' or say some text like '<Company>' in html how do i do it? One method is to use &lt,...
1
by: Dancefire | last post by:
Hi, everyone, I'm trying to use std::codecvt<to do the encoding conversion. I am using following code for encoding conversion between wchar_t string and char string(MBCS). I am not sure am I...
45
by: Zytan | last post by:
This returns the following error: "Cannot modify the return value of 'System.Collections.Generic.List<MyStruct>.this' because it is not a variable" and I have no idea why! Do lists return copies...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.