Help | Site Map
Connecting Tech Pros Worldwide
Reply
 
LinkBack Thread Tools
  #1  
Old August 11th, 2008, 07:23 AM
Member
 
Join Date: Sep 2006
Posts: 69
Default Match exact word/phrase

Hi,

I have a word like this: "Rna binding proteins" and i want to match this exact phrase. I have written code like this:

Expand|Select|Wrap|Line Numbers
  1.  
  2. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  3.  
  4. $word="RNA binding proteins";
  5.  
  6. if($sentence=~/\b$word\b/)
  7. {
  8.  print "matched";
  9.  $sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i;    
  10.  
  11. print "<br> sentence=$sentence<br>";
  12.  
  13.  
Rna binding proteins will be highlighted in sentence.

I have a problem. Many sentences contains this same phrases but it is not getting matched and its not getting highlighted!!

How to write a code so that it should pick up the following sentences which has phrases like this : "Rna-binding protein" "Rna binding protein"?

I don't understand why the sentences with "Rna binding proteins" is not getting retrieved?

with regards
Archana
Reply
  #2  
Old August 11th, 2008, 08:12 AM
nithinpes's Avatar
Expert
 
Join Date: Dec 2007
Age: 24
Posts: 366
Default

You can try this:
Expand|Select|Wrap|Line Numbers
  1. $word="RNA[ -]binding proteins?";
  2.  
  3. if($sentence=~/\b$word\b/i)
  4. {
  5.  print "matched";
  6. ########
  7.  
- The words RNA and binding may be separated by a space or a hipen(hence included space and - inside character class, it will match one of these characters).
- Also, from your description, the phrase may contain 'protein' or 'proteins'. The '?' will match one or zero occurence of the character preceeding it ('s' in this case).
- The /i option in the regex makes the pattern match case insensitive (both RNA and Rna will be matched). Also, you can make use of /g option if you want to extend the search for multiple occurences within a line.
Reply
  #3  
Old August 11th, 2008, 08:15 AM
KevinADC's Avatar
Expert
 
Join Date: Jan 2007
Posts: 3,662
Default

also, there is an error in the HTML code you posted

Expand|Select|Wrap|Line Numbers
  1. <spanstyle=
should be:

Expand|Select|Wrap|Line Numbers
  1. <span style=
Reply
  #4  
Old August 11th, 2008, 09:56 AM
Member
 
Join Date: Sep 2006
Posts: 69
Default

Quote:
Originally Posted by KevinADC
also, there is an error in the HTML code you posted

Expand|Select|Wrap|Line Numbers
  1. <spanstyle=
should be:

Expand|Select|Wrap|Line Numbers
  1. <span style=

Hi,

Ya i corrected but still the same!!!

These are 2 sentences.

Expand|Select|Wrap|Line Numbers
  1. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  2.  
  3. RNA chaperones are non-specific RNA binding proteins that help RNA folding by resolving misfolded structures or preventing their formation.
  4.  
  5.  
only second sentence "RNA binding proteins" is matched and highlighted but first sentence in not matched.

why the phrase is not matched?

With regards

Archana
Reply
  #5  
Old August 11th, 2008, 06:09 PM
KevinADC's Avatar
Expert
 
Join Date: Jan 2007
Posts: 3,662
Default

Are you reading lines from a file or are all the sentences one long string, or what?
Reply
  #6  
Old August 11th, 2008, 11:28 PM
Kelicula's Avatar
Expert
 
Join Date: Jul 2007
Posts: 147
Default

It is only remembering the last match made. Well actually the first, but perl goes backwards.

You need "match global" modifier.
add a "g" along with your "case-insensitive" modifier (i).

Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2.       $word="RNA binding proteins";
  3.  
  4.       if($sentence=~/\b$word\b/)
  5.       {
  6.        print "matched";
  7.        $sentence=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/ig;   
  8. }       
  9.       print "<br> sentence=$sentence<br>";
  10.  
That should do it.
You may need a loop.
For instance, to remove (nested (even deeply nested (like this))) remarks. You could use:

Expand|Select|Wrap|Line Numbers
  1. 1 while s/\([^()]*\)//g; # This works on $_
  2.  
That's right out of the "Programming Perl" book.

or in your case:

Expand|Select|Wrap|Line Numbers
  1. while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  2. print "matched";       
  3. }       
  4. print "<br> sentence=$sentence<br>";
  5.  
In the later case you should NOT use the "g" modifier, it will create an infinite loop. You also wouldn't need the first if statement. It will continue to match, and substitute as long as it can. If it can't right from the start that's ok.

So the final result would be:
Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2. $word="RNA binding proteins";
  3.  
  4.        while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  5. print "matched";   
  6. }       
  7. print "<br> sentence=$sentence<br>";
  8.  

Hope it helps!
Reply
  #7  
Old August 12th, 2008, 10:32 AM
Member
 
Join Date: Sep 2006
Posts: 69
Default

Quote:
Originally Posted by Kelicula
It is only remembering the last match made. Well actually the first, but perl goes backwards.

You need "match global" modifier.
add a "g" along with your "case-insensitive" modifier (i).

Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2.       $word="RNA binding proteins";
  3.  
  4.       if($sentence=~/\b$word\b/)
  5.       {
  6.        print "matched";
  7.        $sentence=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/ig;   
  8. }       
  9.       print "<br> sentence=$sentence<br>";
  10.  
That should do it.
You may need a loop.
For instance, to remove (nested (even deeply nested (like this))) remarks. You could use:

Expand|Select|Wrap|Line Numbers
  1. 1 while s/\([^()]*\)//g; # This works on $_
  2.  
That's right out of the "Programming Perl" book.

or in your case:

Expand|Select|Wrap|Line Numbers
  1. while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  2. print "matched";       
  3. }       
  4. print "<br> sentence=$sentence<br>";
  5.  
In the later case you should NOT use the "g" modifier, it will create an infinite loop. You also wouldn't need the first if statement. It will continue to match, and substitute as long as it can. If it can't right from the start that's ok.

So the final result would be:
Expand|Select|Wrap|Line Numbers
  1. $sentence="Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression.";
  2. $word="RNA binding proteins";
  3.  
  4.        while($sentence=~s/(\b$word\b)/<spanstyle="background-color:#E1FF77">$1<\/span>/i){
  5. print "matched";   
  6. }       
  7. print "<br> sentence=$sentence<br>";
  8.  

Hope it helps!
Hi,

I tried it didn't work!!!

I have paragraph in file an splitting paragraph into sentences and then i am matching word for that sentence.

Here is the code.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2.  
  3. while(<FH>)
  4. {
  5.  
  6.         $content.=$_;
  7.         if($_=~/PMID:(.*)/)
  8.         {
  9.                 $content_all{$_}=$content;
  10.                 $content="";
  11.                 @sentences=split("\n\n",$content_all{$_});
  12.                 for($i=0;$i<=$#sentences;$i++)
  13.                 {
  14.                      print "<br> ***** $sentences[$i] ---> $i <br>";
  15.                 }
  16.                 &subpassparam($content_all{$_});
  17.         }
  18. }
  19. sub subpassparam
  20. {
  21.                 $abs=$_[0];
  22.                 #print "<br>abstract passed = $abs <br>";
  23.                 @sentences=split("\n\n",$content_all{$_});
  24.                 @list=split /[.]\s+\W*/, $sentences[4];
  25.                  foreach(@list)
  26.                 {
  27.                         #print "<br>sentences = $_ <br>";
  28.                          if($_=~/\b$word\b/)
  29.                           {
  30.                                 print "<br>yes its matching <br>";
  31.                                 $_=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  32.                                  print "<br>sentences only at list = $_ <br> ";
  33.                              }
  34.                          }
  35. }
  36.  
Here is the input file.
Expand|Select|Wrap|Line Numbers
  1. Pull down experiments identified five novel TDRD3 interacting partners, most of which are potentially methylated RNA binding proteins 
  2.  
  3. Here we develop the notion that mRNA regulation via RNA binding proteins, or ribonomics, also contributes to post-ischemic TA 
  4.  
  5.  
  6. PUF proteins comprise a highly conserved family of sequence-specific RNA binding proteins that regulate target mRNAs via binding directly to their 3'UTRs 
  7.  
  8. We review here results arising from the systematic functional analysis of Nova, a neuron-specific RNA binding protein targeted in an autoimmune neurological disorder associated with cancer 
  9.  
  10. A group of RNA binding proteins exerts their roles through the autonomous flowering pathway
  11.  
  12. We have previously identified and characterized two novel nuclear RNA binding proteins, p34 and p37, which have been shown to bind 5S rRNA in Trypanosoma brucei  
  13.  
  14. These mRNAs encode RNA binding proteins, signaling molecules and a replication-independent histone
  15.  
  16.  
  17. Among the observed 3'UTR RNA binding proteins, we have confirmed a 52 kDa protein as the human La autoantigen by using purified recombinant protein and a polyclonal La antibody 
  18.  
  19. The modulation of mRNA binding proteins, therefore, illuminates a promising approach for the pharmacotherapy of those key pathologies mentioned above and characterized by a posttranscriptional dysregulation.
  20.  
How to match exact word?
Is this approach wrong!!!
None of these sentences are getting picked by the program!!!
How to solve this problem?



With regards
Archana
Reply
  #8  
Old August 12th, 2008, 02:32 PM
Kelicula's Avatar
Expert
 
Join Date: Jul 2007
Posts: 147
Default

Quote:
Originally Posted by Archanak
Hi,

I tried it didn't work!!!

I have paragraph in file an splitting paragraph into sentences and then i am matching word for that sentence.

Here is the code.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2.  
  3. while(<FH>)
  4. {
  5.  
  6.         $content.=$_;
  7.         if($_=~/PMID:(.*)/)
  8.         {
  9.                 $content_all{$_}=$content;
  10.                 $content="";
  11.                 @sentences=split("\n\n",$content_all{$_});
  12.                 for($i=0;$i<=$#sentences;$i++)
  13.                 {
  14.                      print "<br> ***** $sentences[$i] ---> $i <br>";
  15.                 }
  16.                 &subpassparam($content_all{$_});
  17.         }
  18. }
  19. sub subpassparam
  20. {
  21.                 $abs=$_[0];
  22.                 #print "<br>abstract passed = $abs <br>";
  23.                 @sentences=split("\n\n",$content_all{$_});
  24.                 @list=split /[.]\s+\W*/, $sentences[4];
  25.                  foreach(@list)
  26.                 {
  27.                         #print "<br>sentences = $_ <br>";
  28.                          if($_=~/\b$word\b/)
  29.                           {
  30.                                 print "<br>yes its matching <br>";
  31.                                 $_=~s/(\b$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  32.                                  print "<br>sentences only at list = $_ <br> ";
  33.                              }
  34.                          }
  35. }
  36.  
Here is the input file.
Expand|Select|Wrap|Line Numbers
  1. Pull down experiments identified five novel TDRD3 interacting partners, most of which are potentially methylated RNA binding proteins 
  2.  
  3. Here we develop the notion that mRNA regulation via RNA binding proteins, or ribonomics, also contributes to post-ischemic TA 
  4.  
  5.  
  6. PUF proteins comprise a highly conserved family of sequence-specific RNA binding proteins that regulate target mRNAs via binding directly to their 3'UTRs 
  7.  
  8. We review here results arising from the systematic functional analysis of Nova, a neuron-specific RNA binding protein targeted in an autoimmune neurological disorder associated with cancer 
  9.  
  10. A group of RNA binding proteins exerts their roles through the autonomous flowering pathway
  11.  
  12. We have previously identified and characterized two novel nuclear RNA binding proteins, p34 and p37, which have been shown to bind 5S rRNA in Trypanosoma brucei  
  13.  
  14. These mRNAs encode RNA binding proteins, signaling molecules and a replication-independent histone
  15.  
  16.  
  17. Among the observed 3'UTR RNA binding proteins, we have confirmed a 52 kDa protein as the human La autoantigen by using purified recombinant protein and a polyclonal La antibody 
  18.  
  19. The modulation of mRNA binding proteins, therefore, illuminates a promising approach for the pharmacotherapy of those key pathologies mentioned above and characterized by a posttranscriptional dysregulation.
  20.  
How to match exact word?
Is this approach wrong!!!
None of these sentences are getting picked by the program!!!
How to solve this problem?



With regards
Archana

The statement you have:
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. 1;
  3. }
  4.  
Will automatically go through each "line" of the input file, loading them one at a time into $_. I'm not sure what this line does:

Expand|Select|Wrap|Line Numbers
  1. if($_=~/PMID:(.*)/){
  2.  
But this code will find all matches and add the span around them. I assumed you also wanted to match "mRNA binding proteins".

Expand|Select|Wrap|Line Numbers
  1.  
  2. use diagnostics;
  3. use warnings;
  4.  
  5. my $word = 'RNA binding proteins';
  6.  
  7. open(FH, "param.txt") or die "Can't open file: $!";
  8.  
  9. while(<FH>){
  10. if(/$word/){
  11. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  12. }
  13. print;
  14. }
  15.  
  16. close(FH);
  17.  
If you also want to print ONLY the lines that contained a match try this.

Expand|Select|Wrap|Line Numbers
  1. use diagnostics;
  2. use warnings;
  3.  
  4. my $word = 'RNA binding proteins';
  5.  
  6. open(FH, "param.txt") or die "Can't open file: $!";
  7.  
  8. while(<FH>){
  9. if(/$word/){
  10. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  11. print "Match Found:\n";
  12. print "$_\n\n";
  13. }
  14.  
  15. }
  16.  
  17. close(FH);
  18.  
Reply
  #9  
Old August 13th, 2008, 12:17 PM
Member
 
Join Date: Sep 2006
Posts: 69
Default

Quote:
Originally Posted by Kelicula
The statement you have:
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. 1;
  3. }
  4.  
Will automatically go through each "line" of the input file, loading them one at a time into $_. I'm not sure what this line does:

Expand|Select|Wrap|Line Numbers
  1. if($_=~/PMID:(.*)/){
  2.  
But this code will find all matches and add the span around them. I assumed you also wanted to match "mRNA binding proteins".

Expand|Select|Wrap|Line Numbers
  1.  
  2. use diagnostics;
  3. use warnings;
  4.  
  5. my $word = 'RNA binding proteins';
  6.  
  7. open(FH, "param.txt") or die "Can't open file: $!";
  8.  
  9. while(<FH>){
  10. if(/$word/){
  11. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  12. }
  13. print;
  14. }
  15.  
  16. close(FH);
  17.  
If you also want to print ONLY the lines that contained a match try this.

Expand|Select|Wrap|Line Numbers
  1. use diagnostics;
  2. use warnings;
  3.  
  4. my $word = 'RNA binding proteins';
  5.  
  6. open(FH, "param.txt") or die "Can't open file: $!";
  7.  
  8. while(<FH>){
  9. if(/$word/){
  10. $_ =~ s/(\bm?$word\b)/<span style="background-color:#E1FF77">$1<\/span>/img;
  11. print "Match Found:\n";
  12. print "$_\n\n";
  13. }
  14.  
  15. }
  16.  
  17. close(FH);
  18.  
Hi,

I have found few sentences that has that word like this:
I could not match that word for these sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. eukaryotic type KH-domain, typical of the KH-domain type I superfamily of RNA
  3. binding proteins, and both recombinant and native MOEP19 bind polynucleotides.
  4.  
  5. Karyopherinbeta2 (Kap beta2) or transportin imports numerous RNA binding
  6. proteins into the nucleus. 
  7.  
  8. many questions remain about how these mechanisms are regulated by RNA binding
  9. proteins in the environment of differentiated cells and tissues. 
  10.  
The word is not matching for these sentences and in such case how should i match RNA binding proteins?

How should i check for the conditions to match this word?

With regards
Archana
Reply
  #10  
Old August 13th, 2008, 02:13 PM
nithinpes's Avatar
Expert
 
Join Date: Dec 2007
Age: 24
Posts: 366
Default

That is because you have a newline separating the words of the phrase. That needs to be included in the regex that you are using. To include all the conditions that you have mentioned so far, set $word as below:

Expand|Select|Wrap|Line Numbers
  1. $word="RNA( |-|\n)binding( |\n)proteins?";
  2.  
This does not work if you are reading one line at a time. You need to modify your input record separator to read entire text at once.
Expand|Select|Wrap|Line Numbers
  1. open(FH,"param.txt")|| die "cannot open file\n";
  2. $/ ="";
  3.  
Reply
  #11  
Old August 13th, 2008, 02:17 PM
nithinpes's Avatar
Expert
 
Join Date: Dec 2007
Age: 24
Posts: 366
Default

Also, while reading entire text at once, you need to use /g modifier to search for the pattern repeatedy in the input string.
The line :
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. if(/$word/){
  3.  
should be changed to:

Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. while(/$word/g){
  3.  
  4.  
Reply
  #12  
Old August 14th, 2008, 09:54 AM
Member
 
Join Date: Sep 2006
Posts: 69
Default

Quote:
Originally Posted by nithinpes
Also, while reading entire text at once, you need to use /g modifier to search for the pattern repeatedy in the input string.
The line :
Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. if(/$word/){
  3.  
should be changed to:

Expand|Select|Wrap|Line Numbers
  1. while(<FH>){
  2. while(/$word/g){
  3.  
  4.  
Hi,

I have one problem now.

Now i have considered RNA binding protein as an example.

But i have a text box to take user input.

Here is the code.

Expand|Select|Wrap|Line Numbers
  1.  
  2. $word=param('query');
  3.  
  4.  
Here is the input sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. IRES elements consist of cis-acting RNA structures that often operate in association with specific RNA-binding proteins to recruit the translational machinery.
  3.  
  4. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  5.  
  6. We describe here a complete scaffold-independent analysis of the RNA-binding protein of the four KH domains of KSRP. 
  7.  
  8. RNA chaperones are non-specific RNA binding protein that help RNA folding by resolving misfolded structures or preventing their formation. 
  9.  
  10.  
How to match $word to retrieve all these sentences?

I gave like this but its not working!!!

Expand|Select|Wrap|Line Numbers
  1.  
  2. if($word=~/(.*)[\s\-]?/)
  3.  
  4.  
Its matching with last sentence only!!!

But i want to match $word with all the sentences!!!

How should i give the regular expression?

With regards
Archana
Reply
  #13  
Old August 24th, 2008, 12:35 AM
Kelicula's Avatar
Expert
 
Join Date: Jul 2007
Posts: 147
Default

Quote:
Originally Posted by Archanak
Hi,

I have one problem now.

Now i have considered RNA binding protein as an example.

But i have a text box to take user input.

Here is the code.

Expand|Select|Wrap|Line Numbers
  1.  
  2. $word=param('query');
  3.  
  4.  
Here is the input sentences.
Expand|Select|Wrap|Line Numbers
  1.  
  2. IRES elements consist of cis-acting RNA structures that often operate in association with specific RNA-binding proteins to recruit the translational machinery.
  3.  
  4. Overall, the participation of additional RNA binding proteins in controlling beta-F1-ATPase expression and therefore, in defining the bioenergetic signature of the cancer cell is expected.
  5.  
  6. We describe here a complete scaffold-independent analysis of the RNA-binding protein of the four KH domains of KSRP. 
  7.  
  8. RNA chaperones are non-specific RNA binding protein that help RNA folding by resolving misfolded structures or preventing their formation. 
  9.  
  10.  
How to match $word to retrieve all these sentences?

I gave like this but its not working!!!

Expand|Select|Wrap|Line Numbers
  1.  
  2. if($word=~/(.*)[\s\-]?/)
  3.  
  4.  
Its matching with last sentence only!!!

But i want to match $word with all the sentences!!!

How should i give the regular expression?

With regards
Archana




Just add a "g" for global search.

Try to understand all of this: Regular Expression Tutorial
Reply
Reply

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles