469,934 Members | 2,308 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,934 developers. It's quick & easy.

What is best way to find 3-word groups in text?

290 100+
I am writing a little script that will improve authors writing skills by
finding repeated phrases in the text.

The text of a chapter will average about 10,000 words, however, I could
reduce the size of the files if it is better to do so.

So the idea is to search through a string and find repeats of any 3 or 4 word group.

So if the author has repeated the phrase "then I went" 6 times in the text, then this would be found and highlighted.

I am not sure where to start with this :o

Maybe it is best to start by converting the string into an array of all the words?
Expand|Select|Wrap|Line Numbers
  1. $word_list = explode(" ", $text);
But I still don't know how the best way to find these repeated 3 or 4 word phrases is.

The other thing I want to provide is a list of all the words used ( maybe I will exclude words like and, the, a, etc) and the number of times they are used.

Any good ideas on how I should proceed ?

Nov 4 '09 #1
6 1943
8,652 Expert Mod 8TB
maybe using regular expressions?
like (to show the general idea)
Expand|Select|Wrap|Line Numbers
  1. // matches 3 or 4 word groups up to 5 letters per word
  2. "#((?:\b\w{1,5}\b\s+){3,4})#" 
Nov 4 '09 #2
290 100+
I guessed it might require regex, but I left the question
open in case there is a method that is less cpu intensive.

Thanks for your example, it will be useful as I am still not all that
good with regex.

What would be the best approach to count up all the different words ?
Nov 4 '09 #3
8,652 Expert Mod 8TB
even if there is, what if the follow-up processes eat up that saved memory/workload/whatever?

get all single words into an array
Nov 4 '09 #4
290 100+
Thanks for the pointers :)

I will follow them up and get some code down.
Nov 4 '09 #5
290 100+

I have been playing about with the resulting word list for a while but ı can not work out how to get the number times the words occur in an array.

For example

Expand|Select|Wrap|Line Numbers
  1. $words = "Mary Had A Little Lamb and She LOVED It So much she had a fit and killed the lamb. She also loved lamb chops you see";
First I would this:

Expand|Select|Wrap|Line Numbers
  1. $words = strtolower($words);
  2. ...
  3. $list = explode(" ", $words);
From here what would you recommend I do to get this:

mary 1
little 1
it 1
so 1
much 1
fit 1
killed 1
also 1
chops 1
you 1
see 1

a 2
had 2
and 2
loved 2

lamb 3
she 3

Any ideas ?
Nov 5 '09 #6
8,652 Expert Mod 8TB
array_count_values() (did I mention that searching the manual is the first step?)
Nov 5 '09 #7

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

23 posts views Thread by darwinist | last post: by
54 posts views Thread by Brandon J. Van Every | last post: by
226 posts views Thread by Stephen C. Waterbury | last post: by
22 posts views Thread by Alper AKCAYOZ | last post: by
13 posts views Thread by Jason Huang | last post: by
184 posts views Thread by jim | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.