By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,188 Members | 832 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,188 IT Pros & Developers. It's quick & easy.

How to modify the xml structure internally to work the program?

P: 55
Dear Friends,

I have an application in Python which take input as an XML document. The XML document is supplied externally and cannot change it structure . But there is problem in alignment of XML. I am using xml minidom for parsing purpose.

There is simple position change is enough. But I have no idea how to change the element of DOM i.e self.tree = MD.parse(fichero) Please advise a good way ...

Please refer the problematic html and normal html structure attached here ...
N.B We have no option to edit the source HTML, because it may come from CD also.



Thanks
Anes
Attached Files
File Type: txt working html.txt (1.9 KB, 309 views)
File Type: txt problem html.txt (2.2 KB, 258 views)
Jan 15 '16 #1
Share this Question
Share on Google+
3 Replies


Expert 100+
P: 619
What is the problem and what do you want to extract? It would possibly be easier to process this as a plain text file and split/groupby the <h1>, <h2>, & <span> tags depending. Will post some code later tonight time permitting.
Jan 15 '16 #2

Expert 100+
P: 619
This code should be self explanatory. The combined record(s) are printed, but you could also search for string within the record, or write them to a file.
Expand|Select|Wrap|Line Numbers
  1. def process_group(group_in):
  2.     print " ".join(group_in)
  3.  
  4. with open("problem_or_working_html.txt", "r") as fp_in:
  5.     starters=["<h1", "<h2", "<span", "</body"]
  6.     this_group=[]
  7.     for rec in fp_in:
  8.         rec=rec.strip()
  9.         for start_lit in starters:
  10.             if rec.startswith(start_lit):
  11.                 process_group(this_group)
  12.                 this_group=[]
  13.         this_group.append(rec)
  14.  
  15. ## process last group
  16. process_group(this_group) 
Jan 15 '16 #3

P: 55
Dear dwblas,
Thanks for your fantastic answer . It works fine with small indentation changes.
Expand|Select|Wrap|Line Numbers
  1. #!/bin/python  
  2. def process_group(group_in):
  3.     print " ".join(group_in)
  4. with open("problem_html.txt", "r") as fp_in:
  5.     starters = ["<h1", "<h2", "<span", "</body"]
  6.     this_group = []
  7.     for rec in fp_in:
  8.         rec = rec.strip()
  9.         for start_lit in starters:
  10.             if rec.startswith(start_lit):
  11.                 process_group(this_group)
  12.             #this_group = []
  13.         this_group.append(rec)
  14.  
  15. # process last group
  16. process_group(this_group) #function invoking...
  17.  
But current situation I got the result as DOM element with a normal python print show as
Expand|Select|Wrap|Line Numbers
  1. [<DOM Element: body at 0xb199054c>]
  2.  
So the Node list element . In node list we cannot apply this strip() method. Please advise a solution in this case...

With lots of gratitude

Anes
Jan 16 '16 #4

Post your reply

Sign in to post your reply or Sign up for a free account.