Connecting Tech Pros Worldwide Forums | Help | Site Map

What is best way to find 3-word groups in text?

Familiar Sight
 
Join Date: Jan 2009
Posts: 165
#1: 3 Weeks Ago
I am writing a little script that will improve authors writing skills by
finding repeated phrases in the text.

The text of a chapter will average about 10,000 words, however, I could
reduce the size of the files if it is better to do so.

So the idea is to search through a string and find repeats of any 3 or 4 word group.

So if the author has repeated the phrase "then I went" 6 times in the text, then this would be found and highlighted.

I am not sure where to start with this :o

Maybe it is best to start by converting the string into an array of all the words?
Expand|Select|Wrap|Line Numbers
  1. $word_list = explode(" ", $text);
But I still don't know how the best way to find these repeated 3 or 4 word phrases is.

The other thing I want to provide is a list of all the words used ( maybe I will exclude words like and, the, a, etc) and the number of times they are used.

Any good ideas on how I should proceed ?

Thanks



Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,648
#2: 3 Weeks Ago

re: What is best way to find 3-word groups in text?


maybe using regular expressions?
like (to show the general idea)
Expand|Select|Wrap|Line Numbers
  1. // matches 3 or 4 word groups up to 5 letters per word
  2. "#((?:\b\w{1,5}\b\s+){3,4})#" 
Familiar Sight
 
Join Date: Jan 2009
Posts: 165
#3: 3 Weeks Ago

re: What is best way to find 3-word groups in text?


Yep,
I guessed it might require regex, but I left the question
open in case there is a method that is less cpu intensive.

Thanks for your example, it will be useful as I am still not all that
good with regex.

What would be the best approach to count up all the different words ?
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,648
#4: 3 Weeks Ago

re: What is best way to find 3-word groups in text?


Quote:

Originally Posted by jeddiki View Post

Yep,
I guessed it might require regex, but I left the question
open in case there is a method that is less cpu intensive.

even if there is, what if the follow-up processes eat up that saved memory/workload/whatever?

Quote:

Originally Posted by jeddiki View Post

What would be the best approach to count up all the different words ?

get all single words into an array
(lowercase)
array_unique()
count()
Familiar Sight
 
Join Date: Jan 2009
Posts: 165
#5: 3 Weeks Ago

re: What is best way to find 3-word groups in text?


Thanks for the pointers :)

I will follow them up and get some code down.
Familiar Sight
 
Join Date: Jan 2009
Posts: 165
#6: 2 Weeks Ago

re: What is best way to find 3-word groups in text?


Hi,

I have been playing about with the resulting word list for a while but ı can not work out how to get the number times the words occur in an array.

For example

Expand|Select|Wrap|Line Numbers
  1. $words = "Mary Had A Little Lamb and She LOVED It So much she had a fit and killed the lamb. She also loved lamb chops you see";
First I would this:

Expand|Select|Wrap|Line Numbers
  1. $words = strtolower($words);
  2. ...
  3. $list = explode(" ", $words);
  4.  
From here what would you recommend I do to get this:

mary 1
little 1
it 1
so 1
much 1
fit 1
killed 1
also 1
chops 1
you 1
see 1

a 2
had 2
and 2
loved 2

lamb 3
she 3

Any ideas ?
Dormilich's Avatar
Moderator
 
Join Date: Aug 2008
Location: Leipzig, Germany
Posts: 3,648
#7: 2 Weeks Ago

re: What is best way to find 3-word groups in text?


array_count_values() (did I mention that searching the manual is the first step?)
Reply