Test for contiguous alphabet in character set

Stefan Krah

Hello,

I am currently writing code where it is convenient to convert
char [a-zA-Z] to int [0-25]. The conversion function relies on
a character set with contiguous alphabets.

int set_mesg(Key *key, char *s)
{
char *x;

if (strlen(s) != 3)
return 0;

x = s;
while (*x != '\0') {
if (!isalpha(*x))
return 0;
*x = tolower(*x);
x++;
}

x = s;
/* l_mesg, m_mesg, r_mesg are int */
key->l_mesg = *x++ - 'a';
key->m_mesg = *x++ - 'a';
key->r_mesg = *x - 'a';

return 1;
}
According to K&R2 (p43) contiguous alphabets cannot be safely assumed.
This function would test the lowercase alphabet:

int cont_lower_alpha(void)
{
char a[26] = "abcdefghijklmnopqrstuvwxyz";
int i;

for (i = 0; i < 26; i++)
if (a[i] - 'a' != i)
return 0;

return 1;
}

Is there an easier way of doing this?
Stefan Krah

Nov 14 '05 #1

Subscribe Post Reply

6624

Eric Sosman

Stefan Krah wrote:

Hello,

I am currently writing code where it is convenient to convert
char [a-zA-Z] to int [0-25]. The conversion function relies on
a character set with contiguous alphabets.

int set_mesg(Key *key, char *s)
{
char *x;

if (strlen(s) != 3)
return 0;

x = s;
while (*x != '\0') {
if (!isalpha(*x))
FYI: Use `isalpha((unsigned char)*x)', and similarly
for the other <ctype.h> functions.
return 0;
*x = tolower(*x);
x++;
}

x = s;
/* l_mesg, m_mesg, r_mesg are int */
key->l_mesg = *x++ - 'a';
key->m_mesg = *x++ - 'a';
key->r_mesg = *x - 'a';

return 1;
}

According to K&R2 (p43) contiguous alphabets cannot be safely assumed.
This function would test the lowercase alphabet:

int cont_lower_alpha(void)
{
char a[26] = "abcdefghijklmnopqrstuvwxyz";
int i;

for (i = 0; i < 26; i++)
if (a[i] - 'a' != i)
return 0;

return 1;
}

Is there an easier way of doing this?

This tests contiguity of the lower-case alphabet, and
upper-case could be tested in the same way. But if the
test says "discontiguous," then what? Your real problem,
I think, is not to determine whether the alphabets are
contiguous, but to find some code that will work correctly
even if they are not contiguous.

One way would be to use the position of the character
in a reference string instead of the character's code. For
example, you could write

const char alphabet[] = "abcdefghijklmnopqrstuvwxyz";
...
key->l_mesg = strchr(alphabet, *x++) - alphabet;

As written, this is unsafe if there's any chance that
the target character might not be found in alphabet[], because
strchr() would then return NULL and `NULL - alphabet' is
nonsense. (You may think this cannot happen since the code
has eliminated all non-alphabetics and converted everything
to lower-case, but keep in mind that isalpha() and tolower()
are locale-dependent. Letters like å, ç, ñ, and þ are found
in many character sets, and may be considered lower-case
alphabetics in some locales -- so they would pass through the
earlier portions of your code only to be found missing from
the alphabet[] array.) You could call strchr() and check the
result for NULL before trying to subtract, or you could make
sure that strchr() always finds the target character:

char alphabetplus[] = "abcdefghijklmnopqrstuvwxyz?";
int pos;
...
alphabetplus[26] = *x;
pos = strchr(alphabetplus, *x++) - alphabetplus;
if (pos < 26)
key->l_mesg = pos;
else
return 0; /* unknown lower-case alphabetic */

If you intend to compute a large number of these message
codes, though, it is probably better to use a table:

static char code[1+UCHAR_MAX];
if (code['a'] == 0) {
/* initialize table on the first call */
const char alpha[] = "abcdefghijklmnopqrstuvwxyz";
int i;
for (i = 0; alpha[i] != '\0'; ++i) {
code[alpha[i]] = i + 1;
code[toupper(alpha[i])] = i + 1;
}
}
...
if (code[(unsigned char)*x] == 0)
return 0;
key->l_mesg = code[(unsigned char)*x] - 1;

(Note 1: Yes, I warned you to cast the argument of <ctype.h>
functions, yet I did not do so in the toupper() call. This
happens to be safe because I know that all the characters in
a..z are in the "basic execution" character set, and all these
are guaranteed to have non-negative code values. When you don't
have such knowledge of the input string, though, you must cast --
as in the references to the code[] array, although only the first
of those two is strictly necessary.)

(Note 2: Observe that the table-based method eliminates
the need to weed out non-alphabetics and convert case. All
letters outside a..z and A..Z will be detected by virtue of
their zero code[] values, and for all the rest you will have
code['a'] == code['A'], code['b'] == code['B'], and so on.)

--
Er*********@sun.com

Nov 14 '05 #2

Malcolm

"Stefan Krah" <sf**@bigfoot.com> wrote

int cont_lower_alpha(void)
{
char a[26] = "abcdefghijklmnopqrstuvwxyz";
int i;

for (i = 0; i < 26; i++)
if (a[i] - 'a' != i)
return 0;

return 1;
}

Is there an easier way of doing this?

You could do conditional compilation with
#if 'z' == 'a' + 25

strictly it isn't foolproof, because a perverse implementer could make the
intervening letters non-contiguous. But there are only a few character sets
out there, and it is probably much more likely that you program will break
in some other way than that someone will bring out a character set that
breaks your program.

Nov 14 '05 #3

Stefan Krah

* Eric Sosman <er*********@sun.com> wrote:
[I 'm quoting a lot, it will be needed later]

Stefan Krah wrote:
Hello,

I am currently writing code where it is convenient to convert
char [a-zA-Z] to int [0-25]. The conversion function relies on
a character set with contiguous alphabets.

int set_mesg(Key *key, char *s)
{
char *x;

if (strlen(s) != 3)
return 0;

x = s;
while (*x != '\0') {
if (!isalpha(*x))
FYI: Use `isalpha((unsigned char)*x)', and similarly
for the other <ctype.h> functions.
return 0;
*x = tolower(*x);
x++;
}

x = s;
/* l_mesg, m_mesg, r_mesg are int */
key->l_mesg = *x++ - 'a';
key->m_mesg = *x++ - 'a';
key->r_mesg = *x - 'a';

return 1;
}
According to K&R2 (p43) contiguous alphabets cannot be safely assumed.
This function would test the lowercase alphabet:

int cont_lower_alpha(void)
{
char a[26] = "abcdefghijklmnopqrstuvwxyz";
int i;

for (i = 0; i < 26; i++)
if (a[i] - 'a' != i)
return 0;

return 1;
}

Is there an easier way of doing this?

This tests contiguity of the lower-case alphabet, and
upper-case could be tested in the same way. But if the
test says "discontiguous," then what? Your real problem,
I think, is not to determine whether the alphabets are
contiguous, but to find some code that will work correctly
even if they are not contiguous.

Your observations are absolutely correct. My vague idea was to use
the test as a safety net for the (potentially) rare cases in which
a discontiguous alphabet is encountered.

Obviously it is silly not to write portable code in the first place.

One way would be to use the position of the character
in a reference string instead of the character's code. For
example, you could write

const char alphabet[] = "abcdefghijklmnopqrstuvwxyz";
...
key->l_mesg = strchr(alphabet, *x++) - alphabet;

As written, this is unsafe if there's any chance that
the target character might not be found in alphabet[], because
strchr() would then return NULL and `NULL - alphabet' is
nonsense. (You may think this cannot happen since the code
has eliminated all non-alphabetics and converted everything
to lower-case, but keep in mind that isalpha() and tolower()
are locale-dependent. Letters like å, ç, ñ, and þ are found
in many character sets, and may be considered lower-case
alphabetics in some locales -- so they would pass through the
earlier portions of your code only to be found missing from
the alphabet[] array.) You could call strchr() and check the
I was aware of the locale issue, but I thought the default locale
is "C" unless explicitly set:

SETLOCALE(3) (Linux Programmer's Manual):
| On startup of the main program, the portable "C" locale
| is selected as default.

C Standard Draft -- August 3, 1998:
| At program startup, the equivalent of
| setlocale(LC_ALL, "C");
| is executed.

isalpha() is equivalent to (isupper(c) || islower(c)) and those
two (in the C locale) only return true for [a-zA-Z].

So isalpha() shouldn't return false positives or am I missing something?

If you intend to compute a large number of these message
codes, though, it is probably better to use a table:

static char code[1+UCHAR_MAX];
if (code['a'] == 0) {
/* initialize table on the first call */
const char alpha[] = "abcdefghijklmnopqrstuvwxyz";
int i;
for (i = 0; alpha[i] != '\0'; ++i) {
code[alpha[i]] = i + 1;
code[toupper(alpha[i])] = i + 1;
}
}
...
if (code[(unsigned char)*x] == 0)
return 0;
key->l_mesg = code[(unsigned char)*x] - 1;
I like this one even though conversion doesn't have to be fast in my case.

(Note 1: Yes, I warned you to cast the argument of <ctype.h>
functions, yet I did not do so in the toupper() call. This
happens to be safe because I know that all the characters in
a..z are in the "basic execution" character set, and all these
are guaranteed to have non-negative code values. When you don't
have such knowledge of the input string, though, you must cast --
as in the references to the code[] array, although only the first
of those two is strictly necessary.)

(Note 2: Observe that the table-based method eliminates
the need to weed out non-alphabetics and convert case. All
letters outside a..z and A..Z will be detected by virtue of
their zero code[] values, and for all the rest you will have
code['a'] == code['A'], code['b'] == code['B'], and so on.)

Thanks a lot for your detailed comments,

Stefan Krah

Nov 14 '05 #4

Jens.Toerring

Stefan Krah <sf**@bigfoot.com> wrote:

I was aware of the locale issue, but I thought the default locale
is "C" unless explicitly set:

Well, until you use some library that changes it behind your back.
And, of course, that never happens on your own machine, but always
on a machine some 1000 km away. This resulted in a _lot_ of head
scratching until I figured out what was going on.... It's astoni-
shing in which places a wrongly set locale can mess things up;-)

Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@physik.fu-berlin.de
\__________________________ http://www.toerring.de

Nov 14 '05 #5

SM Ryan

Stefan Krah <sf**@bigfoot.com> wrote:
# Hello,
#
# I am currently writing code where it is convenient to convert
# char [a-zA-Z] to int [0-25]. The conversion function relies on
# a character set with contiguous alphabets.

#include <limits.h>

char map0[CHAR_MAX-CHAR_MIN+1]; memset(map0,26,sizeof map0);
char map = map0-CHAR_MIN;

int i; for (i=0; i<=25; i++) map[i["abcdefghijklmnopqrstuvwxyz"]] = i;
Anywhere you want to map letters to the integers, use map[lettercharacter].
Whether the letters are contiguous or positive or negative can be ignored.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
So basically, you just trace.

Nov 14 '05 #6

by: Kyler Laird | last post by:

I'm trying to discover if there's an efficient way to determine if all of the values of a Numeric array are the same. In C, I would search from the second value, checking each against the first...

Python

Need Help - Checking char values throughout the alphabet

by: Jack Addington | last post by:

I want to scroll through the alphabet in order to scroll some data to the closest name that starts with a letter. If the user hits the H button then it should scroll to the letter closest to H. ...

C# / C Sharp

checking to see if a string contains every letter of the alphabet

by: booksnore | last post by:

I am writing some code to search for strings that contain every letter of the alphabet. At the moment I am using the method below to check to see if a string contains every letter of the alphabet....

C# / C Sharp

are arrays contiguous in memory?

by: Peteroid | last post by:

I looked at the addresses in an 'array<>' during debug and noticed that the addresses were contiguous. Is this guaranteed, or just something it does if it can? PS = VS C++.NET 2005 Express...

.NET Framework

'new'd memeroy is contiguous?

by: divya_rathore_ | last post by:

No pun intended in the subject :) Is dynamically allocated memory contiguous in C++? In C? Deails would be appreciated. warm regards, Divya Rathore (remove underscores for email ID)

C / C++

alphabet q

by: Joe Smith | last post by:

"ABCDEFGHIJKLMNOPQRSTUVWXYZ" "abcdefghijklmnopqrstuvwxyz" "0123456789" " " "!#%^&*()-_" "+=~\|;:\'" "\"{},.<>/\?" "\a\b\f\n\r\t\v\\" Do the above string literals comprise an alphabet for C?...

C / C++

Restricting the alphabet of a string

by: Nathan Harmston | last post by:

Hi, I ve being thinking about playing around with bit strings but use in some encoding problems I m considering and was trying to decide how to implement a bit string class. Is there a library...

Python

how to test a string if it contains special characters

by: titan nyquist | last post by:

How do you test a string to see if it contains special characters? I want to ensure that any names typed into my form has only letters (and maybe allow a dash and an apostrophe). I can loop...

C# / C Sharp

Only number not alphabet. how? c++

by: kundasang | last post by:

Hello, how to do simple code for: if(alphabet) output ERROR because char is int I cannot do like this: if(sale >= 'a' && sale =< 'z')

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Test for contiguous alphabet in character set

Similar topics