473,327 Members | 1,896 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Regex Matching on Readline()

Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:

string = "Accounting - General"
pat = ".+\s-"

Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".

I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:

Accounting - General

And do a:

line = f.readline()
print line

I get:

A c c o u n t i n g - G e n e r a l

Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)
- james
Dec 20 '07 #1
3 2625
On Dec 21, 6:50 am, jwwest <jww...@gmail.comwrote:
Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:

string = "Accounting - General"
pat = ".+\s-"

Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".

I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:

Accounting - General

And do a:

line = f.readline()
print line

I get:

A c c o u n t i n g - G e n e r a l

Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)

- james
To find out what the pseudo-spaces are, do this:

print repr(open("the_file", "rb").read()[:100])

and show us (copy/paste) what you get.

Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).

Dec 20 '07 #2
On Dec 20, 2:13 pm, John Machin <sjmac...@lexicon.netwrote:
On Dec 21, 6:50 am, jwwest <jww...@gmail.comwrote:
Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:
string = "Accounting - General"
pat = ".+\s-"
Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".
I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:
Accounting - General
And do a:
line = f.readline()
print line
I get:
A c c o u n t i n g - G e n e r a l
Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)
- james

To find out what the pseudo-spaces are, do this:

print repr(open("the_file", "rb").read()[:100])

and show us (copy/paste) what you get.

Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).
Here's my output:
'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x0 0 \x00-\x00 \x00G
\x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00'

I'm running Python on Windows. The file was initially created as
output from SQL Management Studio. I've re-saved it using TextPad
which tells me it's Unicode and PC formatted.
Dec 20 '07 #3
On Dec 21, 7:21 am, jwwest <jww...@gmail.comwrote:
On Dec 20, 2:13 pm, John Machin <sjmac...@lexicon.netwrote:
On Dec 21, 6:50 am, jwwest <jww...@gmail.comwrote:
Anyone have any trouble pattern matching on lines returned by
readline? Here's an example:
string = "Accounting - General"
pat = ".+\s-"
Should match on "Accounting -". However, if I read that string in from
a file it will not match. In fact, I can't get anything to match
except ".*".
I'm almost certain that it has something to do with the characters
that python returns from readline(). If I have this in a file:
Accounting - General
And do a:
line = f.readline()
print line
I get:
A c c o u n t i n g - G e n e r a l
Not sure why, I'm a nub at Python so any help is appreciated. They
look like spaces to me, but aren't (I've tried matching on spacs too)
- james
To find out what the pseudo-spaces are, do this:
print repr(open("the_file", "rb").read()[:100])
and show us (copy/paste) what you get.
Also, tell us what platform you are running Python on, and how the
file was created (by what software, on what platform).

Here's my output:
'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x0 0 \x00-\x00 \x00G
\x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00'

I'm running Python on Windows. The file was initially created as
output from SQL Management Studio. I've re-saved it using TextPad
which tells me it's Unicode and PC formatted.
"Unicode" means "utf16".

Try this:

import codecs
f = codecs.open("the_file", "r", encoding="utf16le")
for uline in f:
line = uline.encode('cp1252') # or some other encoding if my guess
isn't correct
# proceed as usual

Cheers,
John
Dec 20 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: aeuglein | last post by:
Hello! I have this RegEx: /(+:\/\/+)/i Now, I want to exlude on the end of a String the formats .gif / .jpg / ..png / .exe / .zip / .rar How I can this add to my regex ?
2
by: mikea59 | last post by:
I am getting errors in XMLSpy (Pro) in the following case: Source Document: <test> 12345 AB 12345 </test> Stylesheet: <xsl:stylesheet version="2.0"...
3
by: Day Of The Eagle | last post by:
Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. You can download that source here ( use the class...
7
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...
7
by: bill tie | last post by:
I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence of characters "a", "b", "c", "d" with "x", (b)...
5
by: Kofi | last post by:
Any takers? Got a string of DNA as an input sequence GGATGGATG, apply the simple regex "GGATG" as in Regex r = new Regex("GGATG", (RegexOptions.Compiled)); MatchCollection matches =...
8
by: Bob | last post by:
I need to create a Regex to extract all strings (including quotations) from a C# or C++ source file. After being unsuccessful myself, I found this sample on the internet: ...
7
by: CB | last post by:
Trying to match the entire following object literal code using a RegEx. var Punctuators = { '{' : 'LeftCurly', '}' : 'RightCurly' } Variations on the idea of using /var.*{.*}/ of course stops...
0
by: Tidane | last post by:
Visual Basic.NET Framework 2.0 I've created a program to parse out text as the program recieved it and use Regex matching to decide what should be done. My problem is that the text is matching when...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.