By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,816 Members | 2,151 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,816 IT Pros & Developers. It's quick & easy.

text analysis repeating phrases

P: 2
I am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
Jun 17 '07 #1
Share this Question
Share on Google+
2 Replies


Expert 10K+
P: 11,448
I am trying to figure out how, given a text file, you can determine if there are repeating phrases of any given length. I can get the word count, line count and other statistical information but no idea where to even start for trying to find repeating phrases. Any ideas? It does not have to be in C++, I just need some help with picking an algorithm. Looked around on google but nothing showed up. Thought you guys might be able to help. Thanks.
Do these repeating phrases have to be adjacent or can they occur anywhere in
the entire string? Finding a pattern/phrase in a string can be done in O(n) where
n is the length of the pattern. Checking all possible patterns in a string takes
O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
all the phrases).

If the phrases are adjacent, this number goes down to O(n^2).

kind regards,

Jos
Jun 17 '07 #2

P: 2
Do these repeating phrases have to be adjacent or can they occur anywhere in
the entire string? Finding a pattern/phrase in a string can be done in O(n) where
n is the length of the pattern. Checking all possible patterns in a string takes
O(n^3) for non-adjacent strings (O(n) for finding one phrase and O(n^2) for checking
all the phrases).

If the phrases are adjacent, this number goes down to O(n^2).

kind regards,

Jos

I think the answer is non adjacent phrases. Let me explain with an example.

the fat cat jumped over the fat cat

Result would be:

the
the fat
the fat cat

I have been looking at regex library trying to find something that could make this task easier but no go. My attempts to work it out by hand have failed. My first attempt was:

take one word, search rest of string for match and if a match is found, search again for word + 1 == match + 1. You loop and stop when when there is no match for word + 1 == match + 1. What is left over is the longest repeating phrase starting at first word.

This process, however, needs to be done with every word in the sentence so it is not efficient at all

My second attempt was to do it incrementally but I got confused and quit L

Do you know of a better approach? Optimizations to the described approach would be useful (using regex to search will probably guarantee me the best search algorithm so I do not have to worry about search optimization). Regards, Max
Jun 17 '07 #3

Post your reply

Sign in to post your reply or Sign up for a free account.