473,651 Members | 2,659 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

XML <=> Text conversion platform requiring high performance

Hello everyone.

I have to find a correct architecture to achieve this XML <=Text
conversion platform. The platform (based on Win2003Server) will have to
deal with 21 million XML files and 16 million text files a day. The
average file size is 1,1 Kb, but they are received by the platform in
the form of big archives (7000 files per archive, app. 7.7Mb).

After some investigation on the Internet, I have decided (95% sure) to
use SAX as the API to deal with my files. And I will not use XSLT as
the main converter, because it will be too slow.

However, I must say that I am a complete "newbie" concerning XML, and
those decisions have been taken after much reading, and discussions
with "others", supposed to be slightly better than me at XML. Which
means two things, and those are my questions :

* Does this architecture looks good to you ?
Win2003 Server, Websphere 5.1.1, Java and SAX

* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

Thanks for your help.

Aug 24 '06 #1
8 1340
Benjamin Bécar wrote:
* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?
This depends on your way of processing. If you choose to
start one new process for the processing of a file, then
you might run into problems with the amount of available
RAM in your host. If 1000 files are processed at each time
instant, and 1000 processes handle them, then the overhead
of each process will also sum up to a 1000-fold.
Aug 24 '06 #2
Thanks for your answer,

I intend to process one file after the other, and not to use threads.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

Aug 24 '06 #3
If you're talking about 40GB XML files, I'd strongly suggest considering
a serious XML database (which, these days, includes IBM's DB2). Or
keeping the main database in non-XML form and using XML only for the
extracted subsets that you're going to expose to the outside world.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Aug 24 '06 #4
The problem is that this platform is supposed to only be a gateway for
text files (around 17GB a day) and XML files (around 23 GB a day).
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

Aug 24 '06 #5
Benjamin Bécar wrote:
I intend to process one file after the other, and not to use threads.
This is a conservative design decision.
In production environments it is generally a
good idea to choose proven designs instead of
futuristic ones.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Indeed.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?
The rule-of-thumb is that a good SAX parser can parse 10 MB/s.
40 GB of data should therefore take about 4000 seconds. This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
Aug 24 '06 #6
Benjamin Bécar wrote:
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.
Now I understand why you opted for a sequential approach.
For the prototyping phase, the following tool maybe
useful in building a first implementation of the
processing pipeline:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

I know of at least one developer (Andrew Schorr) who
has built a production environment with this tool.
But his constraints differed a bit from yours.
Andrew uses a Solaris server. He has XML files as
large as 1 GB and has to import them into the
PostgreSQL database.
Aug 24 '06 #7
Thank you very much for this information, I'm gonna look into it right
now. And please excuse me for my previous mail that looked a bit
"stubborn", as I did not explained the reasons.

Aug 24 '06 #8

Jürgen Kahrs a écrit :
This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
It doesn't seem to me much time, so that is a good thing. But could you
tell me what kind of hardware have you taken for this estimation ?

Thanks.

Aug 25 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
3209
by: Eshrath | last post by:
Hi, What I am trying to do: ======================= I need to form a table in html using the xsl but the table that is formed is quite long and cannot be viewed in our application. So we are writing one object in C# which will take the entire table tag contents and renders. Ie., we need to pass "<table>………… <thead>……</thead>. <tr>.<td> <td>..<tr>.<td> <td> </table>" content to
2
10555
by: Donald Firesmith | last post by:
I am having trouble having Google Adsense code stored in XSL converted properly into HTML. The <> unfortunately become &lt; and &gt; and then no longer work. XSL code is: <script type="text/javascript"> <!]> </script> <script type="text/javascript"
11
5139
by: Scott Brady Drummonds | last post by:
Hi, everyone, I've checked a couple of on-line resources and am unable to determine how reinterpret_cast<> is different from static_cast<>. They both seem to perform a compile-time casting of one type to another. However, I'm certain that there is something else that is happening. Can someone explain the difference or recommend an online site that can explain it to me?
0
1191
by: Joergen Bech | last post by:
Help! Looking for a .Net-kompatible component for converting between HTML and RTF (both ways). Asked this question a few days ago and received this link: http://www.windowsforms.com/Samples/download.aspx?PageId=1&ItemId=186&tabindex=4 This was useful, but only converting in one direction. And XHTML is a bit too strict. Only commercial product I found was the ActiveUp ActiveRTF:
8
14303
by: davihigh | last post by:
My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does ifstream. Something I found shows that, I need to have a proper codecvt to set it. I need more information, maybe a small piece of code sample. Thank you!
17
4843
by: roN | last post by:
Hi, I'm creating a Website with divs and i do have some troubles, to make it looking the same way in Firefox and IE (tested with IE7). I checked it with the e3c validator and it says: " This Page Is Valid XHTML 1.0 Transitional!" but it still wouldn't look the same. It is on http://www.dvdnowkiosks.com/new/theproduct.php scroll down and recognize the black bottom bar when you go ewith firefox(2.0) which isn't there with IE7. Why does...
3
3366
by: ajay2552 | last post by:
Hi, I have a query. All html tags start with < and end with >. Suppose i want to display either '<' or '>' or say some text like '<Company>' in html how do i do it? One method is to use &lt, &gt ,&ltCompany&gt to display '<', '>' and '<Company>' respectively. But is there any freeware code available which could implement the above functionality without having to use &gt,&lt and such stuff???
1
3656
by: Dancefire | last post by:
Hi, everyone, I'm trying to use std::codecvt<to do the encoding conversion. I am using following code for encoding conversion between wchar_t string and char string(MBCS). I am not sure am I right. The code works, but I'm not familiar with the codecvt, and I don't know my way is the right way to do the job. Could you help me to review the code? This function try to convert a wide string to a MBCS in the loc's charset. I hardcode...
45
18847
by: Zytan | last post by:
This returns the following error: "Cannot modify the return value of 'System.Collections.Generic.List<MyStruct>.this' because it is not a variable" and I have no idea why! Do lists return copies of their elements? Why can't I change the element itself? class Program { private struct MyStruct
0
8347
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8694
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8457
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7294
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6157
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5605
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4143
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4280
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
1585
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.