python/regex question... hope someone can help

charonzen

I have a list of strings. These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on). I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list. So in my text string, '... and the...
' becomes ' ... and_the...'. I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'(????)', textstring)

Any suggestions?

Thank you if you can help!

Dec 9 '07 #1

Subscribe Post Reply

1476

John Machin

On Dec 9, 6:13 pm, charonzen <your.mas...@gmail.comwrote:

I have a list of strings. These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on). I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list. So in my text string, '... and the...
' becomes ' ... and_the...'. I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'(????)', textstring)

Any suggestions?

The usual suggestion is: Don't bother with regexes when simple string
methods will do the job.

>>def ch_replace(alist, text):

.... for bigram in alist:
.... original = bigram.replace('_', ' ')
.... text = text.replace(original, bigram)
.... return text
....

>>print ch_replace(

.... ['quick_brown', 'lazy_dogs', 'brown_fox'],
.... 'The quick brown fox jumped over the lazy dogs.'
.... )
The quick_brown_fox jumped over the lazy_dogs.

>>print ch_replace(['red_herring'], 'He prepared herring fillets.')

He prepared_herring fillets.

>>>

Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John

Dec 9 '07 #2

John Machin

On Dec 9, 6:13 pm, charonzen <your.mas...@gmail.comwrote:

The following *may* come close to doing what your revised spec
requires:

import re
def ch_replace2(alist, text):
for bigram in alist:
pattern = r'\b' + bigram.replace('_', ' ') + r'\b'
text = re.sub(pattern, bigram, text)
return text

Cheers,
John

Dec 9 '07 #3

charonzen

Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John

Thank you John, that definitely puts things in perspective! I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees. If you're asking, I'm working on a project
that utilizes Church's mutual information score. I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams. The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest. By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with. So I think that
covers the question the same parsing method. I'm sure there are more
pythonic ways to do it, but I'm on deadline :)

Thanks again!

Brandon

Dec 9 '07 #4

Gabriel Genellina

En Sun, 09 Dec 2007 16:45:53 -0300, charonzen <yo*********@gmail.com>
escribió:

>[John Machin] Another suggestion is to ensure that the job
specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Thank you John, that definitely puts things in perspective! I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees. If you're asking, I'm working on a project
that utilizes Church's mutual information score. I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams. The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest. By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with. So I think that
covers the question the same parsing method. I'm sure there are more
pythonic ways to do it, but I'm on deadline :)

Looks like you should work with the list of tokens, collapsing consecutive
elements, not with the original text. Should be easier, and faster because
you don't regenerate the text and tokenize it again and again.

--
Gabriel Genellina

Dec 10 '07 #5

Similar topics

General Numerical Python question

by: 2mc | last post by:

Generally speaking, if one had a list (from regular Python) and an array (from Numerical Python) that contained the same number of elements, would a While loop or a For loop process them at the...

Python

Python versus Perl ?

by: surfunbear | last post by:

I've read some posts on Perl versus Python and studied a bit of my Python book. I'm a software engineer, familiar with C++ objected oriented development, but have been using Perl because it is...

Python

[perl-python] Python documentation moronicities (continued)

by: Xah Lee | last post by:

http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...

Python

Which RegEx Testing Tool Do You Prefer?

by: clintonG | last post by:

I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher ...

ASP.NET

RegEx question

by: vbmark | last post by:

I'm new to RegEx in vb.net so I'm not sure how to do this. I want to know if a string contains two minus signs "-". If there are two then I want it to return TRUE. I also need to know if the...

Visual Basic .NET

Regex question (easy?)

by: sb | last post by:

Hello, I have a text file which contains plain text with the normal carriage-return/linefeed line terminators. With that file I want to find any occurence of "%R" (case-sensitive) on any line...

.NET Framework

Python new user question - file writeline error

by: James | last post by:

Hello, I'm a newbie to Python & wondering someone can help me with this... I have this code: -------------------------- #! /usr/bin/python import sys

Python

Strings in Python

by: Johny | last post by:

Playing a little more with strings, I found out that string.find function provides the position of the first occurance of the substring in the string. Is there a way how to find out all...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp