Hi everyone,
I've written some code that parses an XML file. The whole thing works correctly, but it's really slow. My example only has two elem_x elements. A normal XML file could theoretically have up to a 5-figure number of elem_x elements. I'm testing with a file that has 150 elem_x elements. And it's just taking too long. Basically I'm trying to turn the XML file into a column-based format, as follows:
Input XML file
<elem_x att_a="1" att_b="2" att_c="3">
<elem_y>
<elem_z att_k="9" att_l="8" att_m="7"></elem_z>
</elem_y>
<elem_y lang="EN-US">
<elem_z att_k="9" att_m="7"></elem_z>
<elem_r>textEN</elem_r>
</elem_y>
<elem_y lang="DE-DE">
<elem_z att_k="9" att_l="8" att_m="7"></elem_z>
<elem_r>textDE</elem_r>
</elem_y>
</elem_x>
<elem_x att_a="4" att_b="5" att_c="6">
<elem_y>
<elem_z att_k="6" att_l="5" att_m="4"></elem_z>
</elem_y>
<elem_y lang="EN-US">
<elem_z att_k="6" att_l="5" att_m="4"></elem_z>
<elem_r>textEN</elem_r>
</elem_y>
<elem_y lang="DE-DE">
<elem_z att_k="6" att_l="5" att_m="4"></elem_z>
<elem_r>textDE</elem_r>
</elem_y>
</elem_x>...
The parsed output is a tab-separated file and should look something like this:
att_a att_b att_c att_k att_l att_m EN-US DE-DE
1 2 3 9 7 textEN textDE
4 5 6 6 5 4... TextEN textDE
Since some attributes can be missing in a particular elememt, I have to loop through the entire file to ensure that the column order does not get mixed up. To complicate the matter slightly, I only want to read the attributes from one of the elem_y elements as they are always the same for each elem_y.
I've used the XMLDocument class and using Xpath and SelectedNodes I can drill down through the XML file, navigating to each node block and looping through it, reading the attribute names and values accordingly. By doing this I can build an array which I can then write to the output file. However, I have a feeling my problem is the high number of loops, which is slowing everything down. I've parsed the XML file using an XmlReader and loaded it into a dataset. This is much fastrer, but it just does not seem to help me solve my problem as the attributes for elem_z are not read out on one line, but line by line.
Is XML my problem? Should I try and use XSLT to transform the XML instead? Or would simply parsing it as a text file be more effective?
Any assistance would be greatly appreciated.
Robert