471,573 Members | 1,409 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,573 software developers and data experts.

Replace special characters in xml but do not replace tags

My requirement is that, i recieve a xml file that may contain characters like "<" or ">".

for example like this
<?xml version="1.0" encoding="ISO-8859-1" ?>
<RecordDivision> abc>efg</RecordDivision>

now i want to replace the ">" character in the string abc>efg.

this is just a sample xml. i may recieve any type of xml.

So when i use some other methods i found by googling, they are replacing the xml tags characters "<" ,">" also.

My requirement is tags should not be replaced but the content should be.
i am using c#
Apr 16 '10 #1
9 12184
8,656 Expert Mod 8TB
ask your source to send you well-formed XML (which includes escaping such characters).

if the XML is formatted linewise, you may be able to use a regex

(not C# code)
Expand|Select|Wrap|Line Numbers
  1. // this ain’t correct either, but currently I don’t know how to describe it better
  2. <([^ >]+)[^>]*>(.+)</\1>
  3. // 1st parenthesis: opening XML tag name
  4. // 2nd parenthesis: anything that is between 1 and 3
  5. // 3rd: closing XML tag bearing the same content as 1
Apr 16 '10 #2
no they wont.

i have to handle this my side.

Apr 16 '10 #3
8,656 Expert Mod 8TB
no they wont.
notify them they don’t serve (well-formed) XML (which contradicts the meaning of XML)
Apr 16 '10 #4
if you wont see me down,

could you please explain how to use your regular expression <([^ >]+)[^>]*>(.+)</\1>
just a small example would be great plz
Apr 16 '10 #5
8,656 Expert Mod 8TB
I don’t know C#, so I can’t. besides, I would complain about the invalid XML a lot, there are web standards, after all.
Apr 16 '10 #6
can we do similar to this

it checks for html file.
also it is in Php i dont understand it.

any help?
Apr 16 '10 #7
8,656 Expert Mod 8TB
you will encounter the same problems discussed there. no helping it because the invalid XML is the source of the problem.

otherwise it would work for you too, if, and only if, you have the necessary simplicity the regex requires.

can’t help with C# though.
Apr 16 '10 #8
2,057 Expert 2GB
This is a pretty good algorithm problem. You won't be able to really use any xml parsing tools since this isn't xml, unless you're using them to find the errors. Unpolished idea follows:
Create a stack to keep track of your elements. Assuming you know your root element is <Incident>Take in input, and output it until you reach a <Incident> start node. From here the real work begins.

1. Push <Incident> to the stack. All following output will be considered within the domain of the <Incident> node.

While we have input:
2. Continue parsing until you reach a <. If you see a lone >, convert it
immediately to &gt;

3. When you see a <, add all the preceding output to the current node.

4. Check if this is a start element, or an end element, a complete empty element or just a lone <.
Lone < - easy convert to &lt;
Empty element. <- render complete element.
Start element - Push a new element to the stack and continue.
End element - Peek at the stack. If it matches, good. End the output for this element, and pop the element from the stack. Add this output to the preceding element. If there is not a match, this means either:
1. One or more preceding tags are invalid.
2. This is an invalid tag, just render it
To check 1. try looking through the stack to find the corresponding start element. If it's not found, then we go the route of 2. If it is found, then we treat all the start elements in between as just text.

Eg if we have <a><b><c><d></b>...
Then we look at the top, d and have no match. We go backwards , c-no match, b-match. So we add the following output to the a element. <b>&lt;c&gt;&lt;d&gt;</b>

If you have tag soup like : <r> <b> </r> </b>
this algorithm will render the first one first. eg, <r> &lt;b&gt; </r> &lt;/b&gt;

This algorithm is not guaranteed to give you what you want. It depends on the volatility of your xml.

I am now seriously considering writing such a program for my own personal use, since I come across this problem a lot.
Apr 16 '10 #9
2,057 Expert 2GB
Unfortunately for you I use primarily Java, but it shouldn't be that hard to port.
Expand|Select|Wrap|Line Numbers
  1. /**
  2.  * @author jkmyoung
  3.  * Simple Class to store Element name and inner xml, allowing for easy addition
  4.  */
  5. public static class Element{
  6.     StringBuffer sb; // contents of element
  7.     String ename;    // name of element
  9.     /**
  10.      * Creates element with given start tag
  11.      * @param startTag starting tag. May contain spaces.
  12.      */
  13.     Element(String startTag){
  14.         ename = startTag;
  15.         sb = new StringBuffer();
  16.     }
  18.     Element(){
  19.         ename = null;
  20.         sb = new StringBuffer();
  21.     }
  22.     /**
  23.      * Adds input to the element
  24.      * @param input  The string input to be added.
  25.      */
  26.     void Add(String input){
  27.         sb.append(input);
  28.     }
  29.     /**
  30.      * Adds input to the element
  31.      * @param input  The string input to be added.
  32.      */
  33.     void Add(char input){
  34.         sb.append(input);
  35.     }
  37.     /**
  38.      * Abruptly ends the element, treating it as text.
  39.      * Can be used as a debugging function as well.
  40.      * There may be 
  41.      * @pre Element is not the root.
  42.      */
  43.     String Truncate(){
  44.         return "&lt;"+ename+"&gt;"+sb.toString();
  45.     }
  47.     /**
  48.      * Ends and outputs the entire element.
  49.      * @param endTag given endTag.
  50.      * @pre Element is not the root, endTag matches startTag. 
  51.      */
  52.     String ElementEnd(String endTag){
  53.         //sb.append("<"); // slight speed increase if added to sb first, but this makes it hard to debug.
  54.         //sb.append(endTag);
  55.         //sb.append(">");
  56.         //return "<"+ename+">"+sb.toString();
  57.         return "<"+ename+">"+sb.toString()+"<"+endTag+">";
  58.     }
  60.     /**
  61.      * Outputs the root
  62.      */
  63.     void OutputRoot(){
  64.         System.out.println(sb.toString());
  66.     }
  67. }
This is the basis for the code I am working on. Need to test more before I release here.
Apr 16 '10 #10

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

3 posts views Thread by Jens Kristensen | last post: by
5 posts views Thread by mr h q | last post: by
reply views Thread by reynard.michel | last post: by
reply views Thread by XIAOLAOHU | last post: by
reply views Thread by Vinnie | last post: by
reply views Thread by lumer26 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.