473,378 Members | 1,607 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Regular Expression Blues

Iasthaai
Hello everyone, I'm new to Python and also new to this forum and I have a question regarding regular expressions, so here goes:

I am attempting to read in a text and do some pattern matching before I store words in a dictionary. I basically want to read in the words and remove any numbers, punctuation, or other special characters from the beginning and end of the word. However, I do wish to keep all of the above in the word if the character is embedded between an A-Z or a-z character.

I have tried many different regular expressions and I will list a few below:

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z]+ (\S* [A-Z]+)*, re.I)
This one seems to work when I just use the following script:
Expand|Select|Wrap|Line Numbers
  1. for line in open('text.txt', 'r'):
  2.     for word in regword.finditer(line):
  3.         print (word.group(0)
  4.  
However, when I go to put these words into a dictionary, the first letter of each word is getting chopped off! Here is the following code that I used, attempting to place these words into a dictionary...
Expand|Select|Wrap|Line Numbers
  1. >>> def manip_word(file):
  2.        wordlist = re_word.findall(open(file).read())
  3.        wordfreq = [wordlist.count(p) for p in wordlist]
  4.        dictionary = dict(zip(wordlist, wordfreq))
  5.        aux = [(key, dictionary[key]) for key in dictionary]
  6.        aux.sort()
  7.        for a in aux: print a
  8.  
After using this code, I realized that file seemed to be a key word, and thus I changed file in both the argument for the function definition and the open argument to file1. This did seem to affect my output, but in an even worse way...it only listed words beginning with ' and -. IE 't like in haven't and -expression as in sub-expression.

I also attempted to explicitly state the characters I would accept between a-z characters...but to no avail.
Expand|Select|Wrap|Line Numbers
  1.  
  2. import re
  3. regword = re.compile("[A-Z]+([\.,\?,...]*[A-Z]+)*, re.I)
  4.  
  5.  

Does anyone have any suggestions? The fact that my first attempt worked when I simply listed the words out seems to point to the way I'm storing in a dictionary as incorrect.

Thanks for reading and any possible help! :-)
Feb 25 '07 #1
5 1515
bvdet
2,851 Expert Mod 2GB
Hello everyone, I'm new to Python and also new to this forum and I have a question regarding regular expressions, so here goes:

I am attempting to read in a text and do some pattern matching before I store words in a dictionary. I basically want to read in the words and remove any numbers, punctuation, or other special characters from the beginning and end of the word. However, I do wish to keep all of the above in the word if the character is embedded between an A-Z or a-z character.

I have tried many different regular expressions and I will list a few below:

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z]+ (\S* [A-Z]+)*, re.I)
This one seems to work when I just use the following script:
Expand|Select|Wrap|Line Numbers
  1. for line in open('text.txt', 'r'):
  2.     for word in regword.finditer(line):
  3.         print (word.group(0)
  4.  
However, when I go to put these words into a dictionary, the first letter of each word is getting chopped off! Here is the following code that I used, attempting to place these words into a dictionary...
Expand|Select|Wrap|Line Numbers
  1. >>> def manip_word(file):
  2.        wordlist = re_word.findall(open(file).read())
  3.        wordfreq = [wordlist.count(p) for p in wordlist]
  4.        dictionary = dict(zip(wordlist, wordfreq))
  5.        aux = [(key, dictionary[key]) for key in dictionary]
  6.        aux.sort()
  7.        for a in aux: print a
  8.  
After using this code, I realized that file seemed to be a key word, and thus I changed file in both the argument for the function definition and the open argument to file1. This did seem to affect my output, but in an even worse way...it only listed words beginning with ' and -. IE 't like in haven't and -expression as in sub-expression.

I also attempted to explicitly state the characters I would accept between a-z characters...but to no avail.
Expand|Select|Wrap|Line Numbers
  1.  
  2. import re
  3. regword = re.compile("[A-Z]+([\.,\?,...]*[A-Z]+)*, re.I)
  4.  
  5.  

Does anyone have any suggestions? The fact that my first attempt worked when I simply listed the words out seems to point to the way I'm storing in a dictionary as incorrect.

Thanks for reading and any possible help! :-)
Do you have to use 're'?
Expand|Select|Wrap|Line Numbers
  1. import re
  2. def manip_word(file):
  3.     patt = re.compile(r'[a-zA-Z]')
  4.     wordlist = open(fn).read().strip().split()
  5.     wordlist1 = []
  6.     for w in wordlist:
  7.         if patt.search(w):
  8.             wordlist1.append(w.lower().strip(".,:?!()[]/\\\n\"\'"))
  9.     wordfreq = [wordlist1.count(p) for p in wordlist]
  10.     dictionary = dict(zip(wordlist1, wordfreq))
  11.     aux = [(key, dictionary[key]) for key in dictionary]
  12.     aux.sort()
  13.     for a in aux: print a
  14.     return aux
Feb 25 '07 #2
bartonc
6,596 Expert 4TB
Regex drive me nuts. Hours of fiddling with them to get them to work just right.
Here is a site that might help with the regex part. Please post back here for help with the python part of the problem.
Feb 26 '07 #3
ghostdog74
511 Expert 256MB
assuming input file sample is like this
this is a 123$@#test$#@000 string
with mulitple 3453line400##@*&
In this case, i expect to find 'test' and 'line', if i interpreted your requirements correctly.
I tested with this
Expand|Select|Wrap|Line Numbers
  1. >>> import re
  2. >>> pat = re.compile(r"[^a-zA-Z \n]+(.*?)[^a-zA-Z \n]+", re.I|re.M)
  3. >>> data = open("file3").read()
  4. >>> pat.findall(data)
  5. ['test', 'line']
  6.  
Feb 26 '07 #4
Hey everyone! Thanks for all the responses, they were all very helpful. It turns out that I solved this one shortly after posting (that always happens to me!) and I did use a regular expression, mostly the same as my first one (it turns out that the + flag on the first [A-Z] was unnecessary for my purposes):

Expand|Select|Wrap|Line Numbers
  1. import re
  2. regword = re.compile("[A-Z](\S*[A-Z]+)*", re.I)
  3.  
It turns out that the above solves all of my problems for defining the regular expression. My problem was in the way I was USING the regular expression, and I have still to understand why fully. Anyway, here is the code I used that definitely works!

Expand|Select|Wrap|Line Numbers
  1. def word_manip(file1):
  2.     wordlist = []
  3.     for line in open(file1, 'r'):
  4.         for word in regword.finditer(line): 
  5.             lowercase = (word.group(0)).lower()  
  6.             wordlist.append(lowercase)
  7.     wordfreq = [wordlist.count(p) for p in wordlist]
  8.     dictionary = dict(zip(wordlist, wordfreq))
  9.     aux = [(key, dictionary[key]) for key in dictionary]
  10.     aux.sort()
  11.     for a in aux: print a
  12.  
Earlier, I was using regword.findall(line) which returns a list. I didn't use the for loop and it was storing in the dictionary, except it stored all the words with the first letter chopped off of the front. I guess there must be a function to tell the dictionary where to start each word and I didn't specify.

So that's that. Thanks again!
Feb 26 '07 #5
bartonc
6,596 Expert 4TB
I'm glad that you have kept us up to date. Thanks for that.
Welcome to TheScripts.com. I hope that you keep posting.
Feb 26 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
4
by: Buddy | last post by:
Can someone please show me how to create a regular expression to do the following My text is set to MyColumn{1, 100} Test I want a regular expression that sets the text to the following...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
11
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
3
by: James D. Marshall | last post by:
The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...
1
by: NvrBst | last post by:
I want to use the .replace() method with the regular expression /^ %VAR % =,($|&)/. The following DOESN'T replace the "^default.aspx=,($|&)" regular expression with "":...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.