By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,723 Members | 1,876 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,723 IT Pros & Developers. It's quick & easy.

Need to mark similar phrases in two different texts

P: n/a
Hello!
I need to mark similar phrases in two different texts, for example to
use <btag.

Example:

text 1:
Google Chrome is a browser that combines a minimal design with
sophisticated technology to make the web faster, safer, and easier.

text 2:
Hematology Analyzers Simple, Sophisticated Technology Serving All
Patients - Clinical Diagnostics Technology Spotlight - Medcompare.

After comparing the following should be shown:
Google Chrome is a browser that combines a minimal design with
<b>sophisticated technology</bto make the web faster, safer, and
easier.

Hematology Analyzers Simple, <b>Sophisticated Technology</bServing
All Patients - Clinical Diagnostics Technology Spotlight - Medcompare.

Because "sophisticated technology" is repeated. But unfortunately I
don't know how to do it. Can you help me?
Sep 7 '08 #1
Share this Question
Share on Google+
9 Replies


P: n/a
SuperNova wrote:
Hello!
I need to mark similar phrases in two different texts, for example to
use <btag.

Example:

text 1:
Google Chrome is a browser that combines a minimal design with
sophisticated technology to make the web faster, safer, and easier.

text 2:
Hematology Analyzers Simple, Sophisticated Technology Serving All
Patients - Clinical Diagnostics Technology Spotlight - Medcompare.

After comparing the following should be shown:
Google Chrome is a browser that combines a minimal design with
<b>sophisticated technology</bto make the web faster, safer, and
easier.

Hematology Analyzers Simple, <b>Sophisticated Technology</bServing
All Patients - Clinical Diagnostics Technology Spotlight - Medcompare.

Because "sophisticated technology" is repeated. But unfortunately I
don't know how to do it. Can you help me?
That's not quite enough to go on for effectively finding matches. It
would be trivial if you had a pre-determined list of phrases, or you
used a query from the user.

However, as you have it now, and since the phrase could be anything,
you'd end up making bold useless things like indefinite/definite
articles, prepositions, pronouns, etc.

--
Curtis
Sep 7 '08 #2

P: n/a
SuperNova wrote:
I need to mark similar phrases in two different texts, for example to
use <btag.
Why do you want this?

This may work:
1) Make a list of words in each text.
2) Compute the intersection of these lists, so that the result is a list
with words which are present in both texts.
3) Filter this list to avoid common words such as 'it' and 'a'.
4) Mark the all words in the list bold in the texts.

Something like this:

<?php
$text1 = 'Google Chrome[...]';
$text2 = 'Hematology Analyzers[...]';

// We don't want case sensitivity
$lower1 = strtolower($text1);
$lower2 = strtolower($text2);

// Array of words
$array1 = preg_split('/\W/', $lower1);
$array2 = preg_split('/\W/', $lower2);

// Intersect
$intersect = array_intersect($array1, $array2);

// Filter
$filter = array('a', '');
$filtered = array_diff($intersect , $filter);

// Make bold
foreach ($filtered as $word) {
$text1 = preg_replace("/($word)/i", '<b>\1</b>', $text1);
$text2 = preg_replace("/($word)/i", '<b>\1</b>', $text2);
}

echo $text1;
echo $text2;
?>
Sep 7 '08 #3

P: n/a
Why do you want this?

This may work:
1) Make a list of words in each text.
2) Compute the intersection of these lists, so that the result is a list
with words which are present in both texts.
3) Filter this list to avoid common words such as 'it' and 'a'.
4) Mark the all words in the list bold in the texts.
Thank you for the code sample. It's a good thing to think about. But I
need to mark similar phrases, 2 or more words one after another. Your
code marks all the similar words, but I need to mark only 2 or more
words one after another.
Sep 7 '08 #4

P: n/a
SuperNova schreef:
>Why do you want this?

This may work:
1) Make a list of words in each text.
2) Compute the intersection of these lists, so that the result is a list
with words which are present in both texts.
3) Filter this list to avoid common words such as 'it' and 'a'.
4) Mark the all words in the list bold in the texts.

Thank you for the code sample. It's a good thing to think about. But I
need to mark similar phrases, 2 or more words one after another. Your
code marks all the similar words, but I need to mark only 2 or more
words one after another.
than you can 'unmark' if you got only 1 consecutive hit

this will leave all the marked words with 2 or more consecutive hits

(or am i missing something?)

--
Luuk
Sep 7 '08 #5

P: n/a
SuperNova wrote:
Thank you for the code sample. It's a good thing to think about. But I
need to mark similar phrases, 2 or more words one after another. Your
code marks all the similar words, but I need to mark only 2 or more
words one after another.
I am sure you can figure out how to make my example work with two words.
Although my previous post was elaborate and even included a working
example, I have no intentions to write code for you to solve your problem.
Sep 7 '08 #6

P: n/a
On Sep 8, 12:16 am, Sjoerd <sjoer...@gmail.comwrote:
I have no intentions to write code for you to solve your problem.
I don't need code, I need algorithm. But the only thing I'm thinking
about is to split words in array and to check words. If words are
alike, the second word should be checked again, if it is alike too,
the mark should be set. But I hoped that there is more fast algorithm.

Sep 8 '08 #7

P: n/a
On Sep 8, 5:55*am, SuperNova <SerafimPa...@gmail.comwrote:
On Sep 8, 12:16 am, Sjoerd <sjoer...@gmail.comwrote:
I have no intentions to write code for you to solve your problem.

I don't need code, I need algorithm. But the only thing I'm thinking
about is to split words in array and to check words. If words are
alike, the second word should be checked again, if it is alike too,
the mark should be set. But I hoped that there is more fast algorithm.
You are probably looking for something along the line of a dictionary
coder, the process used in some compression algorithms. see:
http://en.wikipedia.org/wiki/Dictionary_coder for how it works.
Instead of looking for characters, you will be looking for words.

Bill H
Sep 8 '08 #8

P: n/a
"SuperNova" <Se**********@gmail.comschreef in bericht
news:66**********************************@d77g2000 hsb.googlegroups.com...
On Sep 8, 12:16 am, Sjoerd <sjoer...@gmail.comwrote:
>I have no intentions to write code for you to solve your problem.

I don't need code, I need algorithm. But the only thing I'm thinking
about is to split words in array and to check words. If words are
alike, the second word should be checked again, if it is alike too,
the mark should be set. But I hoped that there is more fast algorithm.

Start by selecting two words in a sentence. Copy those, and search for them
in the other sentence. If you don't find a match, forward the word pointer
by one, select the second and third word, redo until you've reached the last
two words (i.e. pointer is at the next to last word).

Every time you do find a match, try finding a longer match until that fails.
Highlight. Then forward the outer pointer not by one word, but by the amount
of words found.

Add in some boundary checking so that you don't fall of the end of a piece
of text.

Make sure you invest some time in selecting the fastest code to do this job,
you probably want to use strpos or strstr depending on how you're going to
code this. strstr allows for some shortcuts, but perhaps a solution using
strpos is faster.

You may need to tweak this algoritm so that you can find more matches, which
may even be longer.

A: If some text starts with abc, then ...
B: if some text contains something else but a substring of some text starts
with abc, then ...

What do you highlight? "some text" and "starts with abc, then...", or "some
text starts with abc, then ..." or both? (better examples will exist, but
you probably got the point)

Sep 9 '08 #9

P: n/a
On Sep 9, 6:37*am, "mijn naam" <whate...@hotmail.invalidwrote:
"SuperNova" <SerafimPa...@gmail.comschreef in berichtnews:66**********************************@d 77g2000hsb.googlegroups.com...
Thanks Bill and Mijn for helping. Your ideas are good, I think it will
help me.

Thanks!
Sep 9 '08 #10

This discussion thread is closed

Replies have been disabled for this discussion.