By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,277 Members | 1,244 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,277 IT Pros & Developers. It's quick & easy.

How to remove junk from the beginning of line

P: 30
Hello.

I can't find out a junk, found in every beginning of a line in XML file. The junk is look like a square, It came from Indesign file when extracted. Actually its junk of new line.

see below.


<p align="left"

<p align="left"

<p align="left"

I like to delete the above symbol in begiinig line.

Thanks.
May 31 '07 #1
Share this Question
Share on Google+
9 Replies


KevinADC
Expert 2.5K+
P: 4,059
could be a carriage return, try:

$line =~ s/\r//g;
May 31 '07 #2

prn
Expert 100+
P: 254
prn
Hi vishwa,

Probably the simplest way to handle it doesn't involve Perl at all.

From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.

The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml

If this doesn't work or if some of my other assumptions were incorrect, please let us know.

HTH,
Paul
May 31 '07 #3

P: 30
could be a carriage return, try:

$line =~ s/\r//g;
Dear KevinADC,

I tried, using the above code. but its not deleted.

Thanks
vishwa Ram.
Jun 1 '07 #4

P: 30
Hi vishwa,

Probably the simplest way to handle it doesn't involve Perl at all.

From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.

The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml

If this doesn't work or if some of my other assumptions were incorrect, please let us know.

HTH,
Paul
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .

Thanks

vishwa Ram.
Jun 1 '07 #5

miller
Expert 100+
P: 1,089
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .

Thanks

vishwa Ram.
Vishwa,

Random junk at the beginning of the file isn't specific enough for us to be able to help you. It looks like it may be carriage returns like Kevin suggested, but you say that the perl extraction didn't work. I suggest that you open up the file in a hex editor to determine what this character is, or create a perl script that will analyze each line and convert it to it's hex value so that you can determine what this character is.

Something like this should work.

Expand|Select|Wrap|Line Numbers
  1. foreach my $line (<IN>) {
  2. print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
  3. }
  4.  
Then you simply use a regex like before on the line to strip out the offending character.

Expand|Select|Wrap|Line Numbers
  1. $line =~ s/\x45//g;
  2.  
Although obviously 45 probably isn't the hex number that is the problem.

Alternatively, if you file really is UTF-8, then you need to convert it from that.

Expand|Select|Wrap|Line Numbers
  1. use Unicode::MapUTF8 qw(from_utf8);
  2.  
  3. # Convert a string in 'UTF8' encoding to encoding 'ISO-8859-1'
  4. $line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });
  5.  
Either way, it's up to yo uto determine what the exact problem is. We are not going to be able to decypher what "junk" is without actual data.

Good luck,
- Miller
Jun 1 '07 #6

miller
Expert 100+
P: 1,089
PS.

How did you come to the conclusion that Kevin's code did not work? What exactly did you do?

- Miller
Jun 1 '07 #7

P: 30
Hi Miller.

foreach my $line (<IN>)
{
print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
}

The above your code returns, the hex value: 'ffffffe2-ߦfffff80-€ ffffffa9-©' for '
'

use Unicode::MapUTF8 qw(from_utf8);

$line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });

while executing the above code, I got error is,

Can't locate Unicode/MapUTF8.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .) at 1.pl line 5.
BEGIN failed--compilation aborted at 1.pl line 5.

Also I tried,

$_ =~ s/\r//g; as Kevin suggested.

But I can't get.

This is a real content, after extracted from Indesign(style based element):


<p>In North American popular imagination.</p>
<p>Today’s Inuit usually live in towns.</p>
<p>Welcome to ground zero on the road to environmental apocalypse:</p>
<p>Very suddenly, the Inuit.</p>

The below content was, I replaced using perl for Structured XML(DTD Based):


<p local-id="-1">In North American popular imagination.</p>

<p local-id="-2">Today’s Inuit usually live in towns.</p>

<p local-id="-3">Welcome to ground zero apocalypse:</p>

<p local-id="-4">Very suddenly, the Inuit.</p>

when i open my xml file in DOS Editor it shows:

 for that junk.

I tried, Many ways but can't get a solution. Please suggest.

Thanks & Regards.
vishwa Ram.
Jun 2 '07 #8

savanm
P: 85
If u want to remove A junk

Try this

$line=~/(\n)(.{1})(<([^>]+)>)/$1$3/sg;

This remove any Charecter after a newline before the starting tag
Jun 2 '07 #9

P: 1
Me to have same problem in c#.net, I am having a 24gb xml file with UTF-8 encoding but, there are Hexadecimal codes somewhere I am even able to locate them but cant remove them from my document. can anybody help me in doing that as it is tooo large file i can't use XMLDocument at Present I am trying with SAXParser and XMLReader as well as StreamReader. plz help me in this regard

the xml line is:
<ASIN>0140172572 𤀬 </ASIN>
and the error is :-
Error 1 Character '�', hexdecimal value 0xd850 is illegal in XML documents.
Jul 16 '07 #10

Post your reply

Sign in to post your reply or Sign up for a free account.