469,934 Members | 2,667 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,934 developers. It's quick & easy.

Regular Expression Blues

Iasthaai
Hello everyone, I'm new to Python and also new to this forum and I have a question regarding regular expressions, so here goes:

I am attempting to read in a text and do some pattern matching before I store words in a dictionary. I basically want to read in the words and remove any numbers, punctuation, or other special characters from the beginning and end of the word. However, I do wish to keep all of the above in the word if the character is embedded between an A-Z or a-z character.

I have tried many different regular expressions and I will list a few below:

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z]+ (\S* [A-Z]+)*, re.I)
This one seems to work when I just use the following script:
Expand|Select|Wrap|Line Numbers
  1. for line in open('text.txt', 'r'):
  2.     for word in regword.finditer(line):
  3.         print (word.group(0)
  4.  
However, when I go to put these words into a dictionary, the first letter of each word is getting chopped off! Here is the following code that I used, attempting to place these words into a dictionary...
Expand|Select|Wrap|Line Numbers
  1. >>> def manip_word(file):
  2.        wordlist = re_word.findall(open(file).read())
  3.        wordfreq = [wordlist.count(p) for p in wordlist]
  4.        dictionary = dict(zip(wordlist, wordfreq))
  5.        aux = [(key, dictionary[key]) for key in dictionary]
  6.        aux.sort()
  7.        for a in aux: print a
  8.  
After using this code, I realized that file seemed to be a key word, and thus I changed file in both the argument for the function definition and the open argument to file1. This did seem to affect my output, but in an even worse way...it only listed words beginning with ' and -. IE 't like in haven't and -expression as in sub-expression.

I also attempted to explicitly state the characters I would accept between a-z characters...but to no avail.
Expand|Select|Wrap|Line Numbers
  1.  
  2. import re
  3. regword = re.compile("[A-Z]+([\.,\?,...]*[A-Z]+)*, re.I)
  4.  
  5.  

Does anyone have any suggestions? The fact that my first attempt worked when I simply listed the words out seems to point to the way I'm storing in a dictionary as incorrect.

Thanks for reading and any possible help! :-)
Feb 25 '07 #1
5 1395
bvdet
2,851 Expert Mod 2GB
Hello everyone, I'm new to Python and also new to this forum and I have a question regarding regular expressions, so here goes:

I am attempting to read in a text and do some pattern matching before I store words in a dictionary. I basically want to read in the words and remove any numbers, punctuation, or other special characters from the beginning and end of the word. However, I do wish to keep all of the above in the word if the character is embedded between an A-Z or a-z character.

I have tried many different regular expressions and I will list a few below:

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z]+ (\S* [A-Z]+)*, re.I)
This one seems to work when I just use the following script:
Expand|Select|Wrap|Line Numbers
  1. for line in open('text.txt', 'r'):
  2.     for word in regword.finditer(line):
  3.         print (word.group(0)
  4.  
However, when I go to put these words into a dictionary, the first letter of each word is getting chopped off! Here is the following code that I used, attempting to place these words into a dictionary...
Expand|Select|Wrap|Line Numbers
  1. >>> def manip_word(file):
  2.        wordlist = re_word.findall(open(file).read())
  3.        wordfreq = [wordlist.count(p) for p in wordlist]
  4.        dictionary = dict(zip(wordlist, wordfreq))
  5.        aux = [(key, dictionary[key]) for key in dictionary]
  6.        aux.sort()
  7.        for a in aux: print a
  8.  
After using this code, I realized that file seemed to be a key word, and thus I changed file in both the argument for the function definition and the open argument to file1. This did seem to affect my output, but in an even worse way...it only listed words beginning with ' and -. IE 't like in haven't and -expression as in sub-expression.

I also attempted to explicitly state the characters I would accept between a-z characters...but to no avail.
Expand|Select|Wrap|Line Numbers
  1.  
  2. import re
  3. regword = re.compile("[A-Z]+([\.,\?,...]*[A-Z]+)*, re.I)
  4.  
  5.  

Does anyone have any suggestions? The fact that my first attempt worked when I simply listed the words out seems to point to the way I'm storing in a dictionary as incorrect.

Thanks for reading and any possible help! :-)
Do you have to use 're'?
Expand|Select|Wrap|Line Numbers
  1. import re
  2. def manip_word(file):
  3.     patt = re.compile(r'[a-zA-Z]')
  4.     wordlist = open(fn).read().strip().split()
  5.     wordlist1 = []
  6.     for w in wordlist:
  7.         if patt.search(w):
  8.             wordlist1.append(w.lower().strip(".,:?!()[]/\\\n\"\'"))
  9.     wordfreq = [wordlist1.count(p) for p in wordlist]
  10.     dictionary = dict(zip(wordlist1, wordfreq))
  11.     aux = [(key, dictionary[key]) for key in dictionary]
  12.     aux.sort()
  13.     for a in aux: print a
  14.     return aux
Feb 25 '07 #2
bartonc
6,596 Expert 4TB
Regex drive me nuts. Hours of fiddling with them to get them to work just right.
Here is a site that might help with the regex part. Please post back here for help with the python part of the problem.
Feb 26 '07 #3
ghostdog74
511 Expert 256MB
assuming input file sample is like this
this is a 123$@#test$#@000 string
with mulitple 3453line400##@*&
In this case, i expect to find 'test' and 'line', if i interpreted your requirements correctly.
I tested with this
Expand|Select|Wrap|Line Numbers
  1. >>> import re
  2. >>> pat = re.compile(r"[^a-zA-Z \n]+(.*?)[^a-zA-Z \n]+", re.I|re.M)
  3. >>> data = open("file3").read()
  4. >>> pat.findall(data)
  5. ['test', 'line']
  6.  
Feb 26 '07 #4
Hey everyone! Thanks for all the responses, they were all very helpful. It turns out that I solved this one shortly after posting (that always happens to me!) and I did use a regular expression, mostly the same as my first one (it turns out that the + flag on the first [A-Z] was unnecessary for my purposes):

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z](\S*[A-Z]+)*", re.I)
  3.  
It turns out that the above solves all of my problems for defining the regular expression. My problem was in the way I was USING the regular expression, and I have still to understand why fully. Anyway, here is the code I used that definitely works!

Expand|Select|Wrap|Line Numbers
  1. def word_manip(file1):
  2.     wordlist = []
  3.     for line in open(file1, 'r'):
  4.         for word in regword.finditer(line): 
  5.             lowercase = (word.group(0)).lower()  
  6.             wordlist.append(lowercase)
  7.     wordfreq = [wordlist.count(p) for p in wordlist]
  8.     dictionary = dict(zip(wordlist, wordfreq))
  9.     aux = [(key, dictionary[key]) for key in dictionary]
  10.     aux.sort()
  11.     for a in aux: print a
  12.  
Earlier, I was using regword.findall(line) which returns a list. I didn't use the for loop and it was storing in the dictionary, except it stored all the words with the first letter chopped off of the front. I guess there must be a function to tell the dictionary where to start each word and I didn't specify.

So that's that. Thanks again!
Feb 26 '07 #5
bartonc
6,596 Expert 4TB
I'm glad that you have kept us up to date. Thanks for that.
Welcome to TheScripts.com. I hope that you keep posting.
Feb 26 '07 #6

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

4 posts views Thread by Buddy | last post: by
4 posts views Thread by Neri | last post: by
11 posts views Thread by Dimitris Georgakopuolos | last post: by
3 posts views Thread by James D. Marshall | last post: by
7 posts views Thread by Billa | last post: by
25 posts views Thread by Mike | last post: by
1 post views Thread by NvrBst | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.