471,350 Members | 1,609 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,350 software developers and data experts.

an idiot question: if ord() returns a value between 0 and 127, then the character is ASCII?


I know I'm missing something obvious, but I looked hard at this page
and did not see the format of the return specified:

http://us3.php.net/manual/en/function.ord.php
From the limited example I assume it is the decimal (not hex or binary)

value of the character being returned?

So long as the returned value is between 0 and 127, I can treat it as
ascii?

If, for some reason, I had to restrict my output to ascii, then any
time I encountered a value outside of 0 to 127, then I'd have to
replace it with something in range?

Despite all the terrific help I've gotten on comp.lang.php regarding
this issue, my RSS feeds continue to fail when users write weblog posts
in Microsoft Word and then copy and paste their post to a form and
input that. So I'm resorting to severe measures. I don't care if I end
up with a lot of garbage characters, that is fine, I just want the RSS
feed to validate.

Aug 18 '05 #1
5 2211
lk******@geocities.com wrote:

: I know I'm missing something obvious, but I looked hard at this page
: and did not see the format of the return specified:

: http://us3.php.net/manual/en/function.ord.php

: >From the limited example I assume it is the decimal (not hex or binary)
: value of the character being returned?

The value returned is a number, it has no format unless you try to use it
as a string, then php would have to decide how to format the value.

: So long as the returned value is between 0 and 127, I can treat it as
: ascii?

As ascii yes, but as a character no.

ord() returns the numeric value of a byte, not the byte itself.

You can use the numeric value to do compares, but if you try to put the
numeric value into a string then you get the formatted number representing
the value, not the original byte.

If you want to get the original byte then you use chr() to convert the
numeric value into a byte that php can insert as-is into a string.

: If, for some reason, I had to restrict my output to ascii, then any
: time I encountered a value outside of 0 to 127, then I'd have to
: replace it with something in range?

Sure, I wouldn't use those words, but I guess you could say that.
: Despite all the terrific help I've gotten on comp.lang.php regarding
: this issue, my RSS feeds continue to fail when users write weblog posts
: in Microsoft Word and then copy and paste their post to a form and
: input that. So I'm resorting to severe measures. I don't care if I end
: up with a lot of garbage characters, that is fine, I just want the RSS
: feed to validate.

You need to do more than check ascii, you need to check for printable
characters. Binary data can contain ascii control codes, and those are
not allowed in xml.

Without checking all the details of the specs, a quick start would be to
check each character, and perhaps replace it with a . or ? when ever it
was below the value of a space char and greater than equal to 127
(decimal)
# untested
$len = strlen($the_input_data)
$space = ' ';
for ($i = 0; $i<$len; $i++)
{
$ch = substr($the_input_data,$i,1);
$value = ord($ch);
if ( $value >= ord($space) and $value < 127 )
{
echo chr($value);
}else
{
echo '?';
}
}
php has lots of functions that would do various part of this, and
perhaps even the entire thing, so personally I would expect to do this
without any of the above, but on the other hand, I know I sometimes need
to do it myself before I realize and properly appreciate what the
existing functions would be doing for me if I used them, so I show the
above.

--

This space not for rent.
Aug 18 '05 #2
On 18/08/2005 18:33, lk******@geocities.com wrote:
I know I'm missing something obvious, but I looked hard at [the ord()
function] and did not see the format of the return specified:
A number between 0 and 255. This is the range of characters represented
in PHP strings, as mentioned in the first paragraph of the description
of the string type.

[snip]
So long as the returned value is between 0 and 127, I can treat it as
ascii?
Not necessarily. It depends upon the origin of the string, and whether
that string is encoded.

Some encoding schemes have a direct correspondence between themselves
and ASCII in the 0 to 127 range, but not all. For instance, bytes in a
UTF-8 string may or may not map to 7-bit ASCII, whereas a UTF-16 string
will never do so (each two-byte (16-bit) word combines to form one code
point value).
If, for some reason, I had to restrict my output to ascii, then any
time I encountered a value outside of 0 to 127, then I'd have to
replace it with something in range?
That would qualify as restricting the output, but I wouldn't say that
was sensible. On the face of it, you'd have no idea what you're
converting, nor what the consequences will be should you take that action.

Perhaps you should explain the background of this issue. What sort of
data is being received that's causing these problems? Characters from a
foreign language? Bizarre characters that seem out-of-place amongst
something entirely recognisable (a paragraph of English, for example)?
[...] my RSS feeds continue to fail when users write weblog posts
in Microsoft Word and then copy and paste their post to a form and
input that. [...]
So exactly what's being sent that's invalid? Perhaps "smart" quotes that
haven't been converted to their Unicode counterparts (the characters
represented by the entities, &ldquo; and &rdquo;)?
I don't care if I end up with a lot of garbage characters, that is
fine, I just want the RSS feed to validate.


I would be more worried about the data. Validity might play a part in
that, but it's not the be-all and end-all. Indeed, well-formedness is
certainly critical with regards to XML, but what use is valid data if
it's complete garbage?

Mike

--
Michael Winter
Prefix subject with [News] before replying by e-mail.
Aug 18 '05 #3
> That would qualify as restricting the output, but I wouldn't say that
was sensible. On the face of it, you'd have no idea what you're
converting, nor what the consequences will be should you take that action.
RSS validators insist that output must have character encoding. I
decided I would go with UTF-8. However, sometimes my users write their
weblog posts in Microsoft Word, or Word Perfect, or MacWrite, or
OpenOffice, and then they copy and paste their entry to the input form,
and they hit input. What they've input is not UTF-8. And then my RSS
feeds fail validation, because they have characters in them that are
not UTF-8. I'm trying to get my RSS feeds to validate, no matter what
people input. I see that many services on the web have solved this
problem, but I haven't yet figured out how they do it. If I had more
resources, I could probably develop more extensive tests for character
encoding.

I keep trying to fix this problem but I never get it fixed, and I feel
like I've harrassed people on comp.lang.php for quite a bit of help
already

Perhaps you should explain the background of this issue. What sort of
data is being received that's causing these problems? Characters from a
foreign language? Bizarre characters that seem out-of-place amongst
something entirely recognisable (a paragraph of English, for example)?


Characters outside of my chosen character encoding is the problem. I"m
trying to scrub them out of the posts.

Aug 19 '05 #4
> I would be more worried about the data. Validity might play a part in
that, but it's not the be-all and end-all. Indeed, well-formedness is
certainly critical with regards to XML, but what use is valid data if
it's complete garbage?


Frankly, I think I would be delighted if one of my users ended up with
an RSS feed where every single character was a garbage character. Maybe
then they'd finally listen to me and stop inputting stuff from
Microsoft Word, encoded in who-knows-what encoding.

Aug 19 '05 #5
You can do this far more efficiently with regular expression. Something
like preg_replace('/[\x80-\xFF]/', '?', $s) should do.

Aug 19 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Alexander Eisenhuth | last post: by
14 posts views Thread by wolfgang haefelinger | last post: by
7 posts views Thread by Kevin Stern | last post: by
21 posts views Thread by nephish | last post: by
1 post views Thread by siliconwafer | last post: by
4 posts views Thread by chris_fieldhouse | last post: by
1 post views Thread by tony | last post: by
2 posts views Thread by guilesf2 | last post: by
reply views Thread by XIAOLAOHU | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.