By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,516 Members | 1,369 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,516 IT Pros & Developers. It's quick & easy.

regular expression inquiry

P: n/a
I'm processing the following sequence with length more than 100k

1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa ttgaagtgaa
121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca tgtttaacat
181 ttccattggg <snippet>

i wrote a program

<?php
session_start();

if (isset ($_POST['seq']) )
$seq = $_POST['seq'];
else
$seq= $_GET['seq'];

$seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
echo $seq;

?>

but it generates an output of fragmented sequences (i.e. partially processed
result), what is the problem?
Jul 27 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
vito wrote:
i wrote a program

$seq = preg_replace("/[\s\r\n0-9]/", "", $seq);

but it generates an output of fragmented sequences (i.e. partially
processed result), what is the problem?
Your regular expression may be giving unexpected results: since the RE is
double-quoted, the \r and \n are converted to their respective special
characters *before* being sent to preg_replace(), while the \s is left to be
processed by preg_replace(). You might try

$seq = preg_replace('/[[:space:]0-9]/', '', $seq);

which sidesteps the issue entirely by using [:space:]. Personally, I would
just use

$seq = preg_replace('/[^acgt]/', '', $seq);

which removes everything except for the characters [acgt]. This will work
no matter what other stuff happens to be present in the input file.

HTH,
--
Benjamin D. Esham
bd*****@gmail.com | AIM: bdesham128 | Jabber: same as e-mail
Más sabe el diablo por viejo que por diablo. (Spanish proverb)

Jul 27 '06 #2

P: n/a
vito wrote:
I'm processing the following sequence with length more than 100k

1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa ttgaagtgaa
121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca tgtttaacat
181 ttccattggg <snippet>

i wrote a program

<?php
session_start();

if (isset ($_POST['seq']) )
$seq = $_POST['seq'];
else
$seq= $_GET['seq'];

$seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
echo $seq;

?>

but it generates an output of fragmented sequences (i.e. partially processed
result), what is the problem?
To quote from php.net's "Pattern Syntax" article:

"By default, a whitespace character (eg. \s) is any character that the
C library function isspace() recognizes, though it is possible to
compile PCRE with alternative character type tables. Normally isspace()
matches space, formfeed, newline, carriage return, horizontal tab, and
vertical tab."

Per that, including \r and \n in the class definition is redundant.

Anyway, what you should get with the code you wrote is an uninterupted
sequence of a, g, t, and c's. What would you get had you provided the
above sequence? And what's the output that you want?

Jul 27 '06 #3

P: n/a
vito wrote:
I'm processing the following sequence with length more than 100k

1 cagatgctga taaaaaagtg tgttcctcat agcatttatt taattgaaat atttcaagaa
61 cttgaatgta ctaaaaattg agacaaacag tagcaaatca taaaaaaaaa
ttgaagtgaa
121 ttttacaact ggattcatgt gcctaatatt ttcattggga agtggattca
tgtttaacat 181 ttccattggg <snippet>
Nice genes.
i wrote a program

<?php
session_start();

if (isset ($_POST['seq']) )
$seq = $_POST['seq'];
else
$seq= $_GET['seq'];

$seq = preg_replace("/[\s\r\n0-9]/", "", $seq);
echo $seq;

?>
You provided an example of the input but not the output, so I'm not quite
sure. But...

Since you're using PCRE, why not:
/[\s\r\n\d]/

or even

/[^ctag]/

It improbable, but you do know that \s doesn't match ascii chr(11) but
[:space:] does.

HTH
C.
Jul 27 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.