469,362 Members | 2,361 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,362 developers. It's quick & easy.

How to find all the same words in a text?

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
L.

Feb 10 '07 #1
10 3706
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
>>'45 324 45324'.split().count('324')
1
>>>
ciao
marco

--
reply to `python -c "print 'm********@itsuig.ocram'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzcu6mQRKGuVp5FMRArzTAKCpmT/ykP1K8HQaF30phLeq8zBUzQCfZCEU
6RA4kH2QdMe0wcm97MrUWfM=
=p9iU
-----END PGP SIGNATURE-----

Feb 10 '07 #2
On Feb 10, 2:42 pm, Marco Giusti <marco.giu...@gmail.comwrote:
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text
'45 324 45324'
there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
>>'45 324 45324'.split().count('324')
1
>>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
Thanks.
L

Feb 10 '07 #3
ZeD
Johny wrote:
>Let suppose I want to find a number 324 in the text
>'45 324 45324'
>there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
> >>'45 324 45324'.split().count('324')
1
> >>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
>>[i for i, e in enumerate('45 324 45324'.split()) if e=='324']
[1]
>>>
--
Under construction
Feb 10 '07 #4
On Sat, Feb 10, 2007 at 06:00:05AM -0800, Johny wrote:
>On Feb 10, 2:42 pm, Marco Giusti <marco.giu...@gmail.comwrote:
>On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text
>'45 324 45324'
>there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
> >>'45 324 45324'.split().count('324')
1
> >>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
>>li = '45 324 45324'.split()
li.index('324')
1
>>
play with count and index and take a look at the help of both

ciao
marco

--
reply to `python -c "print 'm********@itsuig.ocram'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzdOomQRKGuVp5FMRAt3/AKCSyzCOdSRijxL0GjK3tspZ/sHaYwCfeDzZ
5pmB1RyUlGjhrnxy1YBFArU=
=r/Hl
-----END PGP SIGNATURE-----

Feb 10 '07 #5
* Johny (10 Feb 2007 05:29:23 -0800)
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
There are two approaches: one is the "solve once and forget" approach
where you code around this particular problem. Mario showed you one
solution for this.

The other approach would be to realise that your problem is a specific
case of two general problems: partitioning a sequence by a separator
and partioning a sequence into equivalence classes. The bonus for this
approach is that you will have a /lot/ of problems that can be solved
with either one of these utils or a combination of them.

1>>a = '45 324 45324'
2>>quotient_set(part(a, [' ', ' '], 'sep'), ident)
2: {'324': ['324'], '45': ['45'], '45324': ['45324']}

The latter approach is much more flexible. Just imagine your problem
changes to a string that's separated by newlines (instead of spaces)
and you want to find words that start with the same character (instead
of being the same as criterion).
Thorsten
Feb 10 '07 #6
"Johny" <py****@hope.czon 10 Feb 2007 05:29:23 -0800 didst step
forth and proclaim thus:
I need to find all the same words in a text .
What would be the best idea to do that?
I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.

--
Sam Peterson
skpeterson At nospam ucdavis.edu
"if programmers were paid to remove code instead of adding it,
software would be much better" -- unknown
Feb 11 '07 #7
On 2007-02-10, Johny <py****@hope.czwrote:
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.

--
Neil Cerutti
Feb 11 '07 #8
In order to find all the words in a text, you need to tokenize it first.
The rest is a matter of calling the count method on the list of
tokenized words. For tokenization look here:
http://nltk.sourceforge.net/lite/doc/en/words.html
A little bit of warning: depending on what exactly you need to do, the
seemingly trivial taks of tokenizing a text can become quite complex.

Enjoy,

MaŽl

Neil Cerutti schrieb:
On 2007-02-10, Johny <py****@hope.czwrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help

The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.
Feb 11 '07 #9
On Feb 11, 5:13 am, Samuel Karl Peterson
<skpeter...@nospam.please.ucdavis.eduwrote:
"Johny" <pyt...@hope.czon 10 Feb 2007 05:29:23 -0800 didst step
forth and proclaim thus:
I need to find all the same words in a text .
What would be the best idea to do that?

I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.


More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)

--
Cheers,
Steven

Feb 11 '07 #10
at*************@gmail.com on 11 Feb 2007 08:16:11 -0800 didst step
forth and proclaim thus:
More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)
Thank you, this is educational. I didn't realize that finditer
returned match objects instead of tuples.
Cheers,
Steven
--
Sam Peterson
skpeterson At nospam ucdavis.edu
"if programmers were paid to remove code instead of adding it,
software would be much better" -- unknown
Feb 12 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by SoftComplete Development | last post: by
5 posts views Thread by Paula | last post: by
2 posts views Thread by Raed Sawalha | last post: by
14 posts views Thread by micklee74 | last post: by
7 posts views Thread by =?Utf-8?B?Q2hyaXM=?= | last post: by
8 posts views Thread by inFocus | last post: by
reply views Thread by zhoujie | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.