473,725 Members | 1,811 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to find all the same words in a text?

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
L.

Feb 10 '07 #1
10 3906
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
>>'45 324 45324'.split(). count('324')
1
>>>
ciao
marco

--
reply to `python -c "print 'm********@itsu ig.ocram'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzcu6mQR KGuVp5FMRArzTAK CpmT/ykP1K8HQaF30phL eq8zBUzQCfZCEU
6RA4kH2QdMe0wcm 97MrUWfM=
=p9iU
-----END PGP SIGNATURE-----

Feb 10 '07 #2
On Feb 10, 2:42 pm, Marco Giusti <marco.giu...@g mail.comwrote:
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text
'45 324 45324'
there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
>>'45 324 45324'.split(). count('324')
1
>>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
Thanks.
L

Feb 10 '07 #3
ZeD
Johny wrote:
>Let suppose I want to find a number 324 in the text
>'45 324 45324'
>there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
> >>'45 324 45324'.split(). count('324')
1
> >>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
>>[i for i, e in enumerate('45 324 45324'.split()) if e=='324']
[1]
>>>
--
Under construction
Feb 10 '07 #4
On Sat, Feb 10, 2007 at 06:00:05AM -0800, Johny wrote:
>On Feb 10, 2:42 pm, Marco Giusti <marco.giu...@g mail.comwrote:
>On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text
>'45 324 45324'
>there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)
> >>'45 324 45324'.split(). count('324')
1
> >>>

ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
>>li = '45 324 45324'.split()
li.index('324 ')
1
>>
play with count and index and take a look at the help of both

ciao
marco

--
reply to `python -c "print 'm********@itsu ig.ocram'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzdOomQR KGuVp5FMRAt3/AKCSyzCOdSRijxL 0GjK3tspZ/sHaYwCfeDzZ
5pmB1RyUlGjhrnx y1YBFArU=
=r/Hl
-----END PGP SIGNATURE-----

Feb 10 '07 #5
* Johny (10 Feb 2007 05:29:23 -0800)
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
There are two approaches: one is the "solve once and forget" approach
where you code around this particular problem. Mario showed you one
solution for this.

The other approach would be to realise that your problem is a specific
case of two general problems: partitioning a sequence by a separator
and partioning a sequence into equivalence classes. The bonus for this
approach is that you will have a /lot/ of problems that can be solved
with either one of these utils or a combination of them.

1>>a = '45 324 45324'
2>>quotient_set (part(a, [' ', ' '], 'sep'), ident)
2: {'324': ['324'], '45': ['45'], '45324': ['45324']}

The latter approach is much more flexible. Just imagine your problem
changes to a string that's separated by newlines (instead of spaces)
and you want to find words that start with the same character (instead
of being the same as criterion).
Thorsten
Feb 10 '07 #6
"Johny" <py****@hope.cz on 10 Feb 2007 05:29:23 -0800 didst step
forth and proclaim thus:
I need to find all the same words in a text .
What would be the best idea to do that?
I make no claims of this being the best approach:

=============== =====
def findOccurances( a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b %s\b' % word, re.I)

while True:
match = pattern.search( a_string)
if not match: break
count += 1;
indexes.append( match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
=============== =====

Seems to work for me. No guarantees.

--
Sam Peterson
skpeterson At nospam ucdavis.edu
"if programmers were paid to remove code instead of adding it,
software would be much better" -- unknown
Feb 11 '07 #7
On 2007-02-10, Johny <py****@hope.cz wrote:
I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.

--
Neil Cerutti
Feb 11 '07 #8
In order to find all the words in a text, you need to tokenize it first.
The rest is a matter of calling the count method on the list of
tokenized words. For tokenization look here:
http://nltk.sourceforge.net/lite/doc/en/words.html
A little bit of warning: depending on what exactly you need to do, the
seemingly trivial taks of tokenizing a text can become quite complex.

Enjoy,

Maël

Neil Cerutti schrieb:
On 2007-02-10, Johny <py****@hope.cz wrote:
>I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help

The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.
Feb 11 '07 #9
On Feb 11, 5:13 am, Samuel Karl Peterson
<skpeter...@nos pam.please.ucda vis.eduwrote:
"Johny" <pyt...@hope.cz on 10 Feb 2007 05:29:23 -0800 didst step
forth and proclaim thus:
I need to find all the same words in a text .
What would be the best idea to do that?

I make no claims of this being the best approach:

=============== =====
def findOccurances( a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b %s\b' % word, re.I)

while True:
match = pattern.search( a_string)
if not match: break
count += 1;
indexes.append( match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
=============== =====

Seems to work for me. No guarantees.


More concisely:

import re

pattern = re.compile(r'\b 324\b')
indices = [ match.start() for match in
pattern.findite r(target_string ) ]
print "Indices", indices
print "Count: ", len(indices)

--
Cheers,
Steven

Feb 11 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
2995
by: Gary | last post by:
Hello, Is it possible to dynamically update a textbox with words chosen from a list using form checkboxes and javascript? Gary
1
1676
by: t0M | last post by:
It's nearly impossible to find anything on this because of the Dictionary class, included within the dotnet framework, that pollutes any search results pertinent to my question. I want to be able to access an array of words, just like it were a physical dictionary. It's for a stupid little project, but it involves decryption. Is there a file that Word has which contains this collection of words that I can import somehow into my dotnet...
0
1632
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing technology provides you with highest indexing performance, possibility to index very large sets of data in minimal time even with memory constraints and unbelievable fast query processing speed. The main AlphaTIX's feature that makes it first and...
5
7838
by: Paula | last post by:
Hi !! I have to find some words in a string. I can use string.IndexOf, LastIndexOf, etc, but they are case sensitive. And there is another problem : If I found the word, I have to get three words before and after the found word . Example:
2
1430
by: Raed Sawalha | last post by:
I have the following text:- Brian went to stadium to watch the soccer game, Brian MacWoods is bussiness man and very rich man. Brian likes to run every morning on beachside. the problem i have I get the list of words that should be replace in the provided text as follows:- Brian (ONLY) : should be replaced by Mr with Brian word itself==> will be
14
22571
by: micklee74 | last post by:
hi say i have string like this astring = 'abcd efgd 1234 fsdf gfds abcde 1234' if i want to find which postion is 1234, how can i achieve this...? i want to use index() but it only give me the first occurence. I want to know the positions of both "1234" thanks
7
1367
by: =?Utf-8?B?Q2hyaXM=?= | last post by:
Hi, How can I implement regex to find complete or partial words or a group of words. Similar to the "Find" in MS Word. I need to scan text files eg look for "test" or "this is a test" or "this is a te" Thanks
8
2319
by: inFocus | last post by:
Hello, I am new to python and wanted to write something for myself where after inputing two words it would search entire drive and when finding both names in files name would either copy or move thoe files to a specified directory. But couple of attempts did not work as desired this is one of them. Could someone help fix it or maybe give a better example.
6
2277
by: jeddiki | last post by:
I am writing a little script that will improve authors writing skills by finding repeated phrases in the text. The text of a chapter will average about 10,000 words, however, I could reduce the size of the files if it is better to do so. So the idea is to search through a string and find repeats of any 3 or 4 word group. So if the author has repeated the phrase "then I went" 6 times in the text, then this would be found and...
0
8872
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8747
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9392
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9162
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9091
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8069
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5997
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4773
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2619
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.