469,323 Members | 1,631 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,323 developers. It's quick & easy.

How to remove junk from the beginning of line

Hello.

I can't find out a junk, found in every beginning of a line in XML file. The junk is look like a square, It came from Indesign file when extracted. Actually its junk of new line.

see below.


<p align="left"

<p align="left"

<p align="left"

I like to delete the above symbol in begiinig line.

Thanks.
May 31 '07 #1
9 5694
KevinADC
4,059 Expert 2GB
could be a carriage return, try:

$line =~ s/\r//g;
May 31 '07 #2
prn
254 Expert 100+
Hi vishwa,

Probably the simplest way to handle it doesn't involve Perl at all.

From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.

The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml

If this doesn't work or if some of my other assumptions were incorrect, please let us know.

HTH,
Paul
May 31 '07 #3
could be a carriage return, try:

$line =~ s/\r//g;
Dear KevinADC,

I tried, using the above code. but its not deleted.

Thanks
vishwa Ram.
Jun 1 '07 #4
Hi vishwa,

Probably the simplest way to handle it doesn't involve Perl at all.

From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.

The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml

If this doesn't work or if some of my other assumptions were incorrect, please let us know.

HTH,
Paul
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .

Thanks

vishwa Ram.
Jun 1 '07 #5
miller
1,089 Expert 1GB
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .

Thanks

vishwa Ram.
Vishwa,

Random junk at the beginning of the file isn't specific enough for us to be able to help you. It looks like it may be carriage returns like Kevin suggested, but you say that the perl extraction didn't work. I suggest that you open up the file in a hex editor to determine what this character is, or create a perl script that will analyze each line and convert it to it's hex value so that you can determine what this character is.

Something like this should work.

Expand|Select|Wrap|Line Numbers
  1. foreach my $line (<IN>) {
  2. print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
  3. }
  4.  
Then you simply use a regex like before on the line to strip out the offending character.

Expand|Select|Wrap|Line Numbers
  1. $line =~ s/\x45//g;
  2.  
Although obviously 45 probably isn't the hex number that is the problem.

Alternatively, if you file really is UTF-8, then you need to convert it from that.

Expand|Select|Wrap|Line Numbers
  1. use Unicode::MapUTF8 qw(from_utf8);
  2.  
  3. # Convert a string in 'UTF8' encoding to encoding 'ISO-8859-1'
  4. $line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });
  5.  
Either way, it's up to yo uto determine what the exact problem is. We are not going to be able to decypher what "junk" is without actual data.

Good luck,
- Miller
Jun 1 '07 #6
miller
1,089 Expert 1GB
PS.

How did you come to the conclusion that Kevin's code did not work? What exactly did you do?

- Miller
Jun 1 '07 #7
Hi Miller.

foreach my $line (<IN>)
{
print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
}

The above your code returns, the hex value: 'ffffffe2-ߦfffff80-€ ffffffa9-©' for '
'

use Unicode::MapUTF8 qw(from_utf8);

$line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });

while executing the above code, I got error is,

Can't locate Unicode/MapUTF8.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .) at 1.pl line 5.
BEGIN failed--compilation aborted at 1.pl line 5.

Also I tried,

$_ =~ s/\r//g; as Kevin suggested.

But I can't get.

This is a real content, after extracted from Indesign(style based element):


<p>In North American popular imagination.</p>
<p>Today’s Inuit usually live in towns.</p>
<p>Welcome to ground zero on the road to environmental apocalypse:</p>
<p>Very suddenly, the Inuit.</p>

The below content was, I replaced using perl for Structured XML(DTD Based):


<p local-id="-1">In North American popular imagination.</p>

<p local-id="-2">Today’s Inuit usually live in towns.</p>

<p local-id="-3">Welcome to ground zero apocalypse:</p>

<p local-id="-4">Very suddenly, the Inuit.</p>

when i open my xml file in DOS Editor it shows:

 for that junk.

I tried, Many ways but can't get a solution. Please suggest.

Thanks & Regards.
vishwa Ram.
Jun 2 '07 #8
savanm
85
If u want to remove A junk

Try this

$line=~/(\n)(.{1})(<([^>]+)>)/$1$3/sg;

This remove any Charecter after a newline before the starting tag
Jun 2 '07 #9
Me to have same problem in c#.net, I am having a 24gb xml file with UTF-8 encoding but, there are Hexadecimal codes somewhere I am even able to locate them but cant remove them from my document. can anybody help me in doing that as it is tooo large file i can't use XMLDocument at Present I am trying with SAXParser and XMLReader as well as StreamReader. plz help me in this regard

the xml line is:
<ASIN>0140172572 𤀬 </ASIN>
and the error is :-
Error 1 Character '�', hexdecimal value 0xd850 is illegal in XML documents.
Jul 16 '07 #10

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

8 posts views Thread by m. verkerk | last post: by
7 posts views Thread by RiGGa | last post: by
12 posts views Thread by i6033162556-signup1 | last post: by
2 posts views Thread by collinm | last post: by
4 posts views Thread by Index | last post: by
3 posts views Thread by dh87lfc | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by Gurmeet2796 | last post: by
reply views Thread by harlem98 | last post: by
reply views Thread by listenups61195 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.