Hello.
I can't find out a junk, found in every beginning of a line in XML file. The junk is look like a square, It came from Indesign file when extracted. Actually its junk of new line.
see below.
<p align="left"
<p align="left"
<p align="left"
I like to delete the above symbol in begiinig line.
Thanks.
9 6023
could be a carriage return, try:
$line =~ s/\r//g;
prn 254
Expert 100+
Hi vishwa,
Probably the simplest way to handle it doesn't involve Perl at all.
From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.
The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml
If this doesn't work or if some of my other assumptions were incorrect, please let us know.
HTH,
Paul
could be a carriage return, try:
$line =~ s/\r//g;
Dear KevinADC,
I tried, using the above code. but its not deleted.
Thanks
vishwa Ram.
Hi vishwa,
Probably the simplest way to handle it doesn't involve Perl at all.
From what you have said, it looks like the problem is that InDesign (running on a windows PC) created this file and then you transferred it to a Linux/Unix box to do whatever you plan to do next with it.
The first, and most obvious fix, is to transfer it in text mode rather than binary mode. That will take care of the handling of end-of-line characters. If, for some reason, you cannot do the transfer in text mode, then your unix/linux system almost certainly has a utility on it for exactly this purpose. It would probably be called "dos2unix" and you would use it to fix a file "foo.xml" using the command:
dos2unix foo.xml
If this doesn't work or if some of my other assumptions were incorrect, please let us know.
HTH,
Paul
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .
Thanks
vishwa Ram.
Dear Paul,
Yes. Right, Indesign Extration on Windows PC, using UTF-8 formate (This is only option here), but I have clean up the xml file, Please give a solution in Perl Program .
Thanks
vishwa Ram.
Vishwa,
Random junk at the beginning of the file isn't specific enough for us to be able to help you. It looks like it may be carriage returns like Kevin suggested, but you say that the perl extraction didn't work. I suggest that you open up the file in a hex editor to determine what this character is, or create a perl script that will analyze each line and convert it to it's hex value so that you can determine what this character is.
Something like this should work. -
foreach my $line (<IN>) {
-
print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
-
}
-
Then you simply use a regex like before on the line to strip out the offending character.
Although obviously 45 probably isn't the hex number that is the problem.
Alternatively, if you file really is UTF-8, then you need to convert it from that. -
use Unicode::MapUTF8 qw(from_utf8);
-
-
# Convert a string in 'UTF8' encoding to encoding 'ISO-8859-1'
-
$line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });
-
Either way, it's up to yo uto determine what the exact problem is. We are not going to be able to decypher what "junk" is without actual data.
Good luck,
- Miller
PS.
How did you come to the conclusion that Kevin's code did not work? What exactly did you do?
- Miller
Hi Miller.
foreach my $line (<IN>)
{
print join ' ', map {sprintf("%x", $_) . '-' . pack('c', $_)} unpack 'c*', $line;
}
The above your code returns, the hex value: 'ffffffe2-ߦfffff80-€ ffffffa9-©' for '
'
use Unicode::MapUTF8 qw(from_utf8);
$line = from_utf8({ -string => $line, -charset => 'ISO-8859-1' });
while executing the above code, I got error is,
Can't locate Unicode/MapUTF8.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .) at 1.pl line 5.
BEGIN failed--compilation aborted at 1.pl line 5.
Also I tried,
$_ =~ s/\r//g; as Kevin suggested.
But I can't get.
This is a real content, after extracted from Indesign(style based element):
<p>In North American popular imagination.</p>
<p>Today’s Inuit usually live in towns.</p>
<p>Welcome to ground zero on the road to environmental apocalypse:</p>
<p>Very suddenly, the Inuit.</p>
The below content was, I replaced using perl for Structured XML(DTD Based):
<p local-id="-1">In North American popular imagination.</p>
<p local-id="-2">Today’s Inuit usually live in towns.</p>
<p local-id="-3">Welcome to ground zero apocalypse:</p>
<p local-id="-4">Very suddenly, the Inuit.</p>
when i open my xml file in DOS Editor it shows:

 for that junk.
I tried, Many ways but can't get a solution. Please suggest.
Thanks & Regards.
vishwa Ram.
If u want to remove A junk
Try this
$line=~/(\n)(.{1})(<([^>]+)>)/$1$3/sg;
This remove any Charecter after a newline before the starting tag
Me to have same problem in c#.net, I am having a 24gb xml file with UTF-8 encoding but, there are Hexadecimal codes somewhere I am even able to locate them but cant remove them from my document. can anybody help me in doing that as it is tooo large file i can't use XMLDocument at Present I am trying with SAXParser and XMLReader as well as StreamReader. plz help me in this regard
the xml line is:
<ASIN>0140172572 𤀬 </ASIN>
and the error is :-
Error 1 Character '', hexdecimal value 0xd850 is illegal in XML documents.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: m. verkerk |
last post by:
Hi everybody!
Hope someone can help me out with this!
I'm sending a file to a user with the following code:
header( "Content-Type: application/binary");
header( "Content-disposition:...
|
by: RiGGa |
last post by:
Hi,
I have a html file that I need to process and it contains text in this
format:
<TD><SPAN class=xf id=EmployeeNo
title="Employee Number">0123456</SPAN></TD></TR>
(Note split over two...
|
by: i6033162556-signup1 |
last post by:
How do I remove the new line after </form>
It looks very bad in my HTML, as each <form></form> will create a new
line after the block of </form>
Is there a way for me to remove it?
|
by: collinm |
last post by:
hi
here my code
FILE *fp;
char *line;
#define LINE_MAX 30
fp = fopen("test1.txt", "r");
|
by: Index |
last post by:
Hi,
I am trying to compare a char* with an unsigned char*.I have type cast
the later to char*.Now the problem is, the unsigned char* is populated
with recv() function over the socket and sometimes...
|
by: Franky |
last post by:
What I want to do is delete the last line in a RichTextBox.
The RichTextBox has a ReadOnly property called lines that seems like it
might help but I cant figure out how to use it.
Well, the...
|
by: nalinibala |
last post by:
I'm having issues using XmlTextWriter, saving it out to a file with UTF8
encoding, and seeing "human unreadable" characters show up
*right before* the XML declaration.
I need to have the XML...
|
by: dh87lfc |
last post by:
Hi,
I have a slight problem which is probably easy to fix, but I am still fairly new to this language. Firstly, I shall show you the code:
#!/usr/bin/perl
opendir(DIR, "directory") || die...
|
by: Colloid Snake |
last post by:
Hello,
I'm running into an odd problem - well, at least I think it's odd, but that's probably because I have a Cygwin screen burned into my retinas from staring at it for so long. When I run my...
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
|
by: ryjfgjl |
last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
| |