By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,178 Members | 987 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,178 IT Pros & Developers. It's quick & easy.

Match exact word/phrase

P: 79
Hi,

I have a word like this: "Rna binding proteins" and i want to match this exact phrase. I have written code like this:

Expand|Select|Wrap|Line Numbers
  1.  
  2. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  3.  
  4. $word="RNA binding proteins";
  5.  
  6. if($sentence=~/\b$word\b/)
  7. {
  8.  print "matched";
  9.  $sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i;    
  10.  
  11. print "<br> sentence=$sentence<br>";
  12.  
  13.  
Rna binding proteins will be highlighted in sentence.

I have a problem. Many sentences contains this same phrases but it is not getting matched and its not getting highlighted!!

How to write a code so that it should pick up the following sentences which has phrases like this : "Rna-binding protein" "Rna binding protein"?

I don't understand why the sentences with "Rna binding proteins" is not getting retrieved?

with regards
Archana
Aug 11 '08 #1
Share this Question
Share on Google+
12 Replies


nithinpes
Expert 100+
P: 410
You can try this:
Expand|Select|Wrap|Line Numbers
  1. $word="RNA[ -]binding proteins?";
  2.  
  3. if($sentence=~/\b$word\b/i)
  4. {
  5.  print "matched";
  6. ########
  7.  
- The words RNA and binding may be separated by a space or a hipen(hence included space and - inside character class, it will match one of these characters).
- Also, from your description, the phrase may contain 'protein' or 'proteins'. The '?' will match one or zero occurence of the character preceeding it ('s' in this case).
- The /i option in the regex makes the pattern match case insensitive (both RNA and Rna will be matched). Also, you can make use of /g option if you want to extend the search for multiple occurences within a line.
Aug 11 '08 #2

KevinADC
Expert 2.5K+
P: 4,059
also, there is an error in the HTML code you posted

Expand|Select|Wrap|Line Numbers
  1. <spanstyle=
should be:

Expand|Select|Wrap|Line Numbers
  1. <span style=
Aug 11 '08 #3

P: 79
also, there is an error in the HTML code you posted

Expand|Select|Wrap|Line Numbers
  1. <spanstyle=
should be:

Expand|Select|Wrap|Line Numbers
  1. <span style=

Hi,

Ya i corrected but still the same!!!

These are 2 sentences.

Expand|Select|Wrap|Line Numbers
  1. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  2.  
  3. RNA chaperones are non-specific RNA binding proteins that help RNA folding by resolving misfolded structures or preventing their formation.
  4.  
  5.  
only second sentence "RNA binding proteins" is matched and highlighted but first sentence in not matched.

why the phrase is not matched?

With regards

Archana
Aug 11 '08 #4

KevinADC
Expert 2.5K+
P: 4,059
Are you reading lines from a file or are all the sentences one long string, or what?
Aug 11 '08 #5

Kelicula
Expert 100+
P: 176
It is only remembering the last match made. Well actually the first, but perl goes backwards.

You need "match global" modifier.
add a "g" along with your "case-insensitive" modifier (i).

Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2.       $word="RNA binding proteins";
  3.  
  4.       if($sentence=~/\b$word\b/)
  5.       {
  6.        print "matched";
  7.        $sentence=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/ig;   
  8. }       
  9.       print "<br> sentence=$sentence<br>";
  10.  
That should do it.
You may need a loop.
For instance, to remove (nested (even deeply nested (like this))) remarks. You could use:

Expand|Select|Wrap|Line Numbers
  1. 1 while s/\([^()]*\)//g; # This works on $_
  2.  
That's right out of the "Programming Perl" book.

or in your case:

Expand|Select|Wrap|Line Numbers
  1. while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  2. print "matched";       
  3. }       
  4. print "<br> sentence=$sentence<br>";
  5.  
In the later case you should NOT use the "g" modifier, it will create an infinite loop. You also wouldn't need the first if statement. It will continue to match, and substitute as long as it can. If it can't right from the start that's ok.

So the final result would be:
Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2. $word="RNA binding proteins";
  3.  
  4.        while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  5. print "matched";   
  6. }       
  7. print "<br> sentence=$sentence<br>";
  8.  

Hope it helps!
Aug 11 '08 #6

P: 79
It is only remembering the last match made. Well actually the first, but perl goes backwards.

You need "match global" modifier.
add a "g" along with your "case-insensitive" modifier (i).

Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2.       $word="RNA binding proteins";
  3.  
  4.       if($sentence=~/\b$word\b/)
  5.       {
  6.        print "matched";
  7.        $sentence=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/ig;   
  8. }       
  9.       print "<br> sentence=$sentence<br>";
  10.  
That should do it.
You may need a loop.
For instance, to remove (nested (even deeply nested (like this))) remarks. You could use:

Expand|Select|Wrap|Line Numbers
  1. 1 while s/\([^()]*\)//g; # This works on $_
  2.  
That's right out of the "Programming Perl" book.

or in your case:

Expand|Select|Wrap|Line Numbers
  1. while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  2. print "matched";       
  3. }       
  4. print "<br> sentence=$sentence<br>";
  5.  
In the later case you should NOT use the "g" modifier, it will create an infinite loop. You also wouldn't need the first if statement. It will continue to match, and substitute as long as it can. If it can't right from the start that's ok.

So the final result would be:
Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2. $word="RNA binding proteins";
  3.  
  4.        while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  5. print "matched";   
  6. }       
  7. print "<br> sentence=$sentence<br>";
  8.  

Hope it helps!
Hi,

I tried it didn't work!!!

I have paragraph in file an splitting paragraph into sentences and then i am matching word for that sentence.

Here is the code.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2.  
  3. while(<FH>)
  4. {
  5.  
  6.         $content.=$_;
  7.         if($_=~/PMID:(.*)/)
  8.         {
  9.                 $content_all{$_}=$content;
  10.                 $content="";
  11.                 @sentences=split("\n\n",$content_all{$_});
  12.                 for($i=0;$i<=$#sentences;$i++)
  13.                 {
  14.                      print "<br> ***** $sentences[$i] ---> $i <br>";
  15.                 }
  16.                 &subpassparam($content_all{$_});
  17.         }
  18. }
  19. sub subpassparam
  20. {
  21.                 $abs=$_[0];
  22.                 #print "<br>abstract passed = $abs <br>";
  23.                 @sentences=split("\n\n",$content_all{$_});
  24.                 @list=split /[.]\s+\W*/, $sentences[4];
  25.                  foreach(@list)
  26.                 {
  27.                         #print "<br>sentences = $_ <br>";
  28.                          if($_=~/\b$word\b/)
  29.                           {
  30.                                 print "<br>yes its matching <br>";
  31.                                 $_=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  32.                                  print "<br>sentences only at list = $_ <br> ";
  33.                              }
  34.                          }
  35. }
  36.  
Here is the input file.
Expand|Select|Wrap|Line Numbers
  1. Pull down experiments identified five novel TDRD3 interacting partners, most of which are potentially methylated RNA binding proteins 
  2.  
  3. Here we develop the notion that mRNA regulation via RNA binding proteins, or ribonomics, also contributes to post-ischemic TA 
  4.  
  5.  
  6. PUF proteins comprise a highly conserved family of sequence-specific RNA binding proteins that regulate target mRNAs via binding directly to their 3'UTRs 
  7.  
  8. We review here results arising from the systematic functional analysis of Nova, a neuron-specific RNA binding protein targeted in an autoimmune neurological disorder associated with cancer 
  9.  
  10. A group of RNA binding proteins exerts their roles through the autonomous flowering pathway
  11.  
  12. We have previously identified and characterized two novel nuclear RNA binding proteins, p34 and p37, which have been shown to bind 5S rRNA in Trypanosoma brucei  
  13.  
  14. These mRNAs encode RNA binding proteins, signaling molecules and a replication-independent histone
  15.  
  16.  
  17. Among the observed 3'UTR RNA binding proteins, we have confirmed a 52 kDa protein as the human La autoantigen by using purified recombinant protein and a polyclonal La antibody 
  18.  
  19. The modulation of mRNA binding proteins, therefore, illuminates a promising approach for the pharmacotherapy of those key pathologies mentioned above and characterized by a posttranscriptional dysregulation.
  20.  
How to match exact word?
Is this approach wrong!!!
None of these sentences are getting picked by the program!!!
How to solve this problem?



With regards
Archana
Aug 12 '08 #7

Kelicula
Expert 100+
P: 176
Hi,

I tried it didn't work!!!

I have paragraph in file an splitting paragraph into sentences and then i am matching word for that sentence.

Here is the code.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2.  
  3. while(<FH>)
  4. {
  5.  
  6.         $content.=$_;
  7.         if($_=~/PMID:(.*)/)
  8.         {
  9.                 $content_all{$_}=$content;
  10.                 $content="";
  11.                 @sentences=split("\n\n",$content_all{$_});
  12.                 for($i=0;$i<=$#sentences;$i++)
  13.                 {
  14.                      print "<br> ***** $sentences[$i] ---> $i <br>";
  15.                 }
  16.                 &subpassparam($content_all{$_});
  17.         }
  18. }
  19. sub subpassparam
  20. {
  21.                 $abs=$_[0];
  22.                 #print "<br>abstract passed = $abs <br>";
  23.                 @sentences=split("\n\n",$content_all{$_});
  24.                 @list=split /[.]\s+\W*/, $sentences[4];
  25.                  foreach(@list)
  26.                 {
  27.                         #print "<br>sentences = $_ <br>";
  28.                          if($_=~/\b$word\b/)
  29.                           {
  30.                                 print "<br>yes its matching <br>";
  31.                                 $_=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  32.                                  print "<br>sentences only at list = $_ <br> ";
  33.                              }
  34.                          }
  35. }
  36.  
Here is the input file.
Expand|Select|Wrap|Line Numbers
  1. Pull down experiments identified five novel TDRD3 interacting partners, most of which are potentially methylated RNA binding proteins 
  2.  
  3. Here we develop the notion that mRNA regulation via RNA binding proteins, or ribonomics, also contributes to post-ischemic TA 
  4.  
  5.  
  6. PUF proteins comprise a highly conserved family of sequence-specific RNA binding proteins that regulate target mRNAs via binding directly to their 3'UTRs 
  7.  
  8. We review here results arising from the systematic functional analysis of Nova, a neuron-specific RNA binding protein targeted in an autoimmune neurological disorder associated with cancer 
  9.  
  10. A group of RNA binding proteins exerts their roles through the autonomous flowering pathway
  11.  
  12. We have previously identified and characterized two novel nuclear RNA binding proteins, p34 and p37, which have been shown to bind 5S rRNA in Trypanosoma brucei  
  13.  
  14. These mRNAs encode RNA binding proteins, signaling molecules and a replication-independent histone
  15.  
  16.  
  17. Among the observed 3'UTR RNA binding proteins, we have confirmed a 52 kDa protein as the human La autoantigen by using purified recombinant protein and a polyclonal La antibody 
  18.  
  19. The modulation of mRNA binding proteins, therefore, illuminates a promising approach for the pharmacotherapy of those key pathologies mentioned above and characterized by a posttranscriptional dysregulation.
  20.  
How to match exact word?
Is this approach wrong!!!
None of these sentences are getting picked by the program!!!
How to solve this problem?



With regards
Archana

The statement you have:
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. 1;
  3. }
  4.  
Will automatically go through each "line" of the input file, loading them one at a time into $_. I'm not sure what this line does:

Expand|Select|Wrap|Line Numbers
  1. if($_=~/PMID:(.*)/){
  2.  
But this code will find all matches and add the span around them. I assumed you also wanted to match "mRNA binding proteins".

Expand|Select|Wrap|Line Numbers
  1.  
  2. use diagnostics;
  3. use warnings;
  4.  
  5. my $word = 'RNA binding proteins';
  6.  
  7. open(FH, "param.txt") or die "Can't open file: $!";
  8.  
  9. while(<FH>){
  10. if(/$word/){
  11. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  12. }
  13. print;
  14. }
  15.  
  16. close(FH);
  17.  
If you also want to print ONLY the lines that contained a match try this.

Expand|Select|Wrap|Line Numbers
  1. use diagnostics;
  2. use warnings;
  3.  
  4. my $word = 'RNA binding proteins';
  5.  
  6. open(FH, "param.txt") or die "Can't open file: $!";
  7.  
  8. while(<FH>){
  9. if(/$word/){
  10. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  11. print "Match Found:\n";
  12. print "$_\n\n";
  13. }
  14.  
  15. }
  16.  
  17. close(FH);
  18.  
Aug 12 '08 #8

P: 79
The statement you have:
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. 1;
  3. }
  4.  
Will automatically go through each "line" of the input file, loading them one at a time into $_. I'm not sure what this line does:

Expand|Select|Wrap|Line Numbers
  1. if($_=~/PMID:(.*)/){
  2.  
But this code will find all matches and add the span around them. I assumed you also wanted to match "mRNA binding proteins".

Expand|Select|Wrap|Line Numbers
  1.  
  2. use diagnostics;
  3. use warnings;
  4.  
  5. my $word = 'RNA binding proteins';
  6.  
  7. open(FH, "param.txt") or die "Can't open file: $!";
  8.  
  9. while(<FH>){
  10. if(/$word/){
  11. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  12. }
  13. print;
  14. }
  15.  
  16. close(FH);
  17.  
If you also want to print ONLY the lines that contained a match try this.

Expand|Select|Wrap|Line Numbers
  1. use diagnostics;
  2. use warnings;
  3.  
  4. my $word = 'RNA binding proteins';
  5.  
  6. open(FH, "param.txt") or die "Can't open file: $!";
  7.  
  8. while(<FH>){
  9. if(/$word/){
  10. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  11. print "Match Found:\n";
  12. print "$_\n\n";
  13. }
  14.  
  15. }
  16.  
  17. close(FH);
  18.  
Hi,

I have found few sentences that has that word like this:
I could not match that word for these sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. eukaryotic type KH-domain, typical of the KH-domain type I superfamily of RNA
  3. binding proteins, and both recombinant and native MOEP19 bind polynucleotides.
  4.  
  5. Karyopherinbeta2 (Kap beta2) or transportin imports numerous RNA binding
  6. proteins into the nucleus. 
  7.  
  8. many questions remain about how these mechanisms are regulated by RNA binding
  9. proteins in the environment of differentiated cells and tissues. 
  10.  
The word is not matching for these sentences and in such case how should i match RNA binding proteins?

How should i check for the conditions to match this word?

With regards
Archana
Aug 13 '08 #9

nithinpes
Expert 100+
P: 410
That is because you have a newline separating the words of the phrase. That needs to be included in the regex that you are using. To include all the conditions that you have mentioned so far, set $word as below:

Expand|Select|Wrap|Line Numbers
  1. $word="RNA( |-|\n)binding( |\n)proteins?";
  2.  
This does not work if you are reading one line at a time. You need to modify your input record separator to read entire text at once.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2. $/ ="";
  3.  
Aug 13 '08 #10

nithinpes
Expert 100+
P: 410
Also, while reading entire text at once, you need to use /g modifier to search for the pattern repeatedy in the input string.
The line :
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. if(/$word/){
  3.  
should be changed to:

Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. while(/$word/g){
  3.  
  4.  
Aug 13 '08 #11

P: 79
Also, while reading entire text at once, you need to use /g modifier to search for the pattern repeatedy in the input string.
The line :
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. if(/$word/){
  3.  
should be changed to:

Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. while(/$word/g){
  3.  
  4.  
Hi,

I have one problem now.

Now i have considered RNA binding protein as an example.

But i have a text box to take user input.

Here is the code.

Expand|Select|Wrap|Line Numbers
  1.  
  2. $word=param('query');
  3.  
  4.  
Here is the input sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. IRES elements consist of cis-acting RNA structures that often operate in association with specific RNA-binding proteins to recruit the translational machinery.
  3.  
  4. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  5.  
  6. We describe here a complete scaffold-independent analysis of the RNA-binding protein of the four KH domains of KSRP. 
  7.  
  8. RNA chaperones are non-specific RNA binding protein that help RNA folding by resolving misfolded structures or preventing their formation. 
  9.  
  10.  
How to match $word to retrieve all these sentences?

I gave like this but its not working!!!

Expand|Select|Wrap|Line Numbers
  1.  
  2. if($word=~/(.*)[\s\-]?/)
  3.  
  4.  
Its matching with last sentence only!!!

But i want to match $word with all the sentences!!!

How should i give the regular expression?

With regards
Archana
Aug 14 '08 #12

Kelicula
Expert 100+
P: 176
Hi,

I have one problem now.

Now i have considered RNA binding protein as an example.

But i have a text box to take user input.

Here is the code.

Expand|Select|Wrap|Line Numbers
  1.  
  2. $word=param('query');
  3.  
  4.  
Here is the input sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. IRES elements consist of cis-acting RNA structures that often operate in association with specific RNA-binding proteins to recruit the translational machinery.
  3.  
  4. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  5.  
  6. We describe here a complete scaffold-independent analysis of the RNA-binding protein of the four KH domains of KSRP. 
  7.  
  8. RNA chaperones are non-specific RNA binding protein that help RNA folding by resolving misfolded structures or preventing their formation. 
  9.  
  10.  
How to match $word to retrieve all these sentences?

I gave like this but its not working!!!

Expand|Select|Wrap|Line Numbers
  1.  
  2. if($word=~/(.*)[\s\-]?/)
  3.  
  4.  
Its matching with last sentence only!!!

But i want to match $word with all the sentences!!!

How should i give the regular expression?

With regards
Archana




Just add a "g" for global search.

Try to understand all of this: Regular Expression Tutorial
Aug 23 '08 #13

Post your reply

Sign in to post your reply or Sign up for a free account.