473,326 Members | 2,732 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Correct use of Unicode in RegExp

I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike

Jul 19 '05 #1
2 2120
On Thu, 22 Apr 2004 22:36:44 +0100, mike blamires scribbled furiously:
I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike


Apologies, incorrect newsgroup first time round. Please see above.
cheers
Mike

Jul 19 '05 #2
mike blamires <mi**@mysurname.co.uk> wrote in message news:<pa****************************@mysurname.co. uk>...
I am having great difficulty using Unicode characters in a Regular
Expression, I am trying to match extended Unicode characters.

I am wishing to split a large Dumpfile (containing only JPEGS) I have used
a hex editor to manually extract a file just to show it can be done, so I
know the input is intact.

Each JPEG starts with the Unicode characters \u00FF \u00D8 \u00FF \u00E1
and there are plenty of these to be found within the file.

open(DUMPFILE, "/pathtodumpfile");
my $line;
while(<DUMPFILE>) {
$line = $line.$_;
}
@files = split(/\x{00FF}\x{00D8}\x{00FF}\x{00E1}/, $line);

(As you may see from the above style I am relatively inexperienced to the
perl side of programming ;)

I have tried inserting the Unicode characters in various ways \xFF, \x{FF}
etc. It just doesn't seem to find the pattern. I am at a bit of a loss as
to whether it is my regexp that is wrong, my use of Unicode characters
or use of Extended Unicode characters.

many thanks for your help.

cheers
Mike


First of all, I've never worked with unicode characters.

I see you've tried to do something with \xFF and \x{FF} without
success. Have you tried \\xFF and \\x\{FF\} instead (notice the '\'
before all characters that aren't alphapetic or numeric)?

Good luck,
DNA
Jul 19 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: David Opstad | last post by:
I have a question about text rendering I'm hoping someone here can answer. Is there a way of doing linguistically correct rendering of Unicode strings in Python? In simple cases like Latin or...
1
by: marek | last post by:
trying this example to make print MatchObject reference. Fails (prints None). Does anybody know where I am wrong? # -*- coding: cp1251 -*- import re # pattern in Ukrainian ('привіт') p = ...
2
by: Ron Brennan | last post by:
I'm trying to test a character for not being uppercase, including Unicode Latin-1. The following JavaScript expression doesn't work. It is embedded in a servlet that targets IE 5 or later. if...
1
by: Mark Johnson | last post by:
I wonder if anyone has a solution? I wanted to use the web browser control as a 'zoom' box for a smaller textbox. I can format in the control, and save whatever formatting as HTML code back to the...
0
by: Mark Johnson | last post by:
The last reply got sort of cutoff. So here again: So for anyone interested, here's the simple regexp patterns for the substitutions required. The textbox control is being 'zoomed' in a popup...
4
by: Jon Maz | last post by:
Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as...
4
by: possibilitybox | last post by:
I'm trying to make a unicode friendly regexp to grab sentences reasonably reliably for as many unicode languages as possible, focusing on european languages first, hence it'd be useful to be able...
0
by: Ben Wolfson | last post by:
I've got a db some of whose elements have been created automatically from filesystem data (whose encoding is iso-8859-1). If I try to select one of those elements using a standard SQL construct,...
1
by: deepusrp | last post by:
Helo, I have a text box, in which i want to display the unicode characters of a particular language; on pressing of a key event. How can i accomplish that. What i am doing is var...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.