By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,541 Members | 1,476 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,541 IT Pros & Developers. It's quick & easy.

How can I delete contents between

P: 3
How can I delete contents between <SEC-HEADER> and </SEC-HEADER> in a htm file?
Why my code does not work?
Thanks!
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. # This is a program which can process the Edgar 10-k html file into a plain text
  4. # file without graphs and tables.
  5.  
  6. $filename="H:/Test Data/wmt2004.htm";
  7. open IN, '<', $filename or die;
  8. @contents = <IN>;
  9. close IN;
  10.  
  11. @contents = grep !/<SEC-HEADER>.*</SEC-HEADER>/ @contents;
  12.  
  13. $filenameout="H:/Test Data/wmt2004-processed.htm";
  14. open OUT, '>', $filenameout or die;
  15. print OUT @contents;
  16. close OUT;
  17.  
Jun 8 '10 #1
Share this Question
Share on Google+
3 Replies


numberwhun
Expert Mod 2.5K+
P: 3,503
Have you taken a look at the perldoc page for grep in Perl? You will note that your grep statement should actually be coded as follows:

Expand|Select|Wrap|Line Numbers
  1. @contents = grep {!/<SEC-HEADER>.*</SEC-HEADER>/} @contents;
  2.  
As for "not working", can you please elaborate? What are you seeing that is going wrong and what are you expecting to see?

Regards,

Jeff
Jun 8 '10 #2

Expert
P: 70
Your code has syntax errors and does not compile. Please post the actual code you are running.

You should have also posted a small snippet of your input file. Here is my guess: your input file has start and end tags on different lines. Consider:

Expand|Select|Wrap|Line Numbers
  1. use warnings;
  2. use strict;
  3.  
  4. my @contents = <DATA>; 
  5. @contents = grep { !/<SEC-HEADER>.*<\/SEC-HEADER>/ } @contents; 
  6. print @contents;
  7.  
  8. __DATA__
  9. <html>
  10.  
  11. <SEC-HEADER>foo</SEC-HEADER>
  12.  
  13. <SEC-HEADER>
  14. bar</SEC-HEADER>
  15.  
  16. </html>
This prints out:

Expand|Select|Wrap|Line Numbers
  1. <html>
  2.  
  3.  
  4. <SEC-HEADER>
  5. bar</SEC-HEADER>
  6.  
  7. </html>
In any case, you really should use one of the HTML parser modules from CPAN instead of regular expressions.
Jun 8 '10 #3

P: 3
Thank you guys I figure out. Thanks very much!
I am trying to get familiar with perl.
Jun 9 '10 #4

Post your reply

Sign in to post your reply or Sign up for a free account.