469,964 Members | 1,714 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,964 developers. It's quick & easy.

Removing binary string from text

Hello,

I have a html text string like this:

" When  creating  a 
new  message,  Reset  occurs  when  '\' 
is  entered  in  Address.     
S-1a<BR>メール作成時アドレスに\を入れるとリセットす 襦

How can I find and remove the strings like
メール作成時アドレスに\を入れるとリセットする  from the text.

Thanks for any tips
Nov 17 '05 #1
5 2430
Isa Janfada <is*********@comhem.se> wrote:
I have a html text string like this:

" When&nbsp; creating&nbsp; a&nbsp;
new&nbsp; message,&nbsp; Reset&nbsp; occurs&nbsp; when&nbsp; '\'&nbsp;
is&nbsp; entered&nbsp; in&nbsp; Address.&nbsp; &nbsp; &nbsp;
S-1a<BR>メール作成時アドレスに\を入れるとリセットす 襦

How can I find and remove the strings like
メール作成時アドレスに\を入れるとリセットする  from thetext.


Well, there are two things to worry about here:

1) How do you want to distinguish between "real" data and "bad" data?
Should your real data always be in ASCII, for instance? If so, you
could create a StringBuilder, and then go through each character in the
string, appending it if its integer value is less than 127. (It would
be more efficient to append a whole substring at a time, but slightly
more complicated - unless efficiency is a concern, I'd go with the
"character at a time" route to start with.)

2) Why have you got bogus data in the first place?

The second point is a more important one, to my mind - if you find out
why you're getting data you don't want, you may find you've got a
problem higher up the food chain, or you may find a way of not
receiving the "bad" data in the first place, which is better than
filtering it out later.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Nov 17 '05 #2


2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet

Nov 17 '05 #3
Isa Janfada <is*********@comhem.se> wrote:
2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.


So are you assuming that *none* of the data you're interested will be
non-ASCII?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Nov 17 '05 #4
On Sun, 30 Oct 2005 19:59:29 +0100, "Isa Janfada"
<is*********@comhem.se> wrote:


2) Why have you got bogus data in the first place?

The html page has mixt by two different character set. I want take ASCII
characters to database, therefore I must remove the bad characters.

Thank you Jon Skeet


Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.

- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.

rossum

The ultimate truth is that there is no ultimate truth
Nov 17 '05 #5
rossum <ro******@coldmail.com> wrote:
Some ideas:

- look for any characters outside the normal range of ASCII, this
might be anything above 255.
Not quite - anything above 127. That's where ASCII ends.
- does the different character set always start with the same
character? If so then you might be able to look for the first
occurrence of that character and ignore anything from then on.

- does the different character set always come at the end of the
string? If not then you are going to have to think of a way to pick
up normal ASCII again.


If the character encodings are okay, that should be fine - hopefully
it's correctly decoded it to Unicode, it's just the non-ASCII
characters which can be discarded.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Nov 17 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Quinn | last post: by
7 posts views Thread by elliotng.ee | last post: by
29 posts views Thread by Harlin Seritt | last post: by
5 posts views Thread by bwv539 | last post: by
11 posts views Thread by Freddy Coal | last post: by
13 posts views Thread by zach | last post: by
10 posts views Thread by rory | last post: by
5 posts views Thread by dm3281 | last post: by
1 post views Thread by rainxy | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.