473,386 Members | 1,801 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

How to allow special character's like ï,ù,acute e etc...

Dear All,
I am working on a module that validates the provided CSV data in a text
format, which must be in a predefined format.
We check for the :

1. Number of fields provided in the text file,

2. Text checks for max. length of the field & whether the field is
mandatory or optional
Example:
Text('Description', 100, optional=True)
Parameters: "Name of the field" ='Description'
"Max length " =100
"Optional" ='True' (the field is not mandaory)

3. valid-text expressions,
Example:
ValidText('Minor', '[yYnN]')

Parameters:
name =field name
regex =the regular expression y/Y for Yes & n/N for No

Recently we are getting data, where, the name contains non-english
characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc

Using the Text function, these names are not validated as they contain
special characters or non-english characters (ï,ù). But the data is
correct.
Is there any function that can allow such special character's but not
numbers...?

Secondly, If I were to get the data in Russian text, are there any
(lingual) packages available so that i can use the the same module for
validation.
Such that I just have to import the package and the module can be used
for validating russian text or japanese text....

Regards,
Sonal.

Sep 5 '06 #1
1 4537
sonald wrote:
Dear All,
I am working on a module that validates the provided CSV data in a text
format, which must be in a predefined format.
We check for the :
[snip]
>
3. valid-text expressions,
Example:
ValidText('Minor', '[yYnN]')

Parameters:
name =field name
regex =the regular expression y/Y for Yes & n/N for No

Recently we are getting data, where, the name contains non-english
characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc
The offending characters are (unusually) lowercase in otherwise
uppercase strings; is this actual data or are you typing what you think
you see instead of copy/paste?
>
Using the Text function, these names are not validated as they contain
special characters or non-english characters (ï,ù). But the data is
correct.
It would help a great deal if you were to tell us (1) what is the regex
that you are using (2) what encoding you believe/know the data is
written in (3) does your app call locale.setlocale() at start-up? If
the following guesses are wrong, please say so.

Guess (1) (a) you are using the pattern "[A-Za-z]" to check for
alphabetic characters (b) you are using the "\w" pattern to check for
alphanumeric characters and then using "[\d_]" to reject digits and
underscores.
Guess (2): "cp1252" or "latin1" or "unknown" -- all pretty much
equivalent :-)
Guess (3): No.

If guess (1b) is correct: the so-called "special" characters are not
being interpreted as alphabetic because the re module is
locale-dependent. Here is what the re docs have to say:
"""
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current locale.
If UNICODE is set, this will match the characters [0-9_] plus whatever
is classified as alphanumeric in the Unicode character properties
database.
"""

If you are not using (1b) or something like it, you need to move in
that direction.

Please bear this in mind: the locale is meant to be an
attribute/property of the *user* of your application; it is *not* meant
to be an attribute of the input data. Read the docs of the locale
module -- switching locales on the fly is *not* a good idea.
Is there any function that can allow such special character's but not
numbers...?
The righteous way of handling your problem is:
(1) decode each field in the incoming 8-bit string data to Unicode,
using what you know/guess to be the correct encoding setting. Then
string methods like isalpha() and isdigit() will use the Unicode
character properties and your "special" characters will be recognised
for what they are.
(2) use the UNICODE flag in re.
>
Secondly, If I were to get the data in Russian text,
are there any
(lingual) packages available so that i can use the the same module for
validation.
If you are getting the data as 8-bit strings, then the above approach
should still "work" at the basic level ... you decode it using 'cp1251'
or whatever, and the Cyrillic letter equivalents of "Ivanov" would pass
muster as alphabetic.
Such that I just have to import the package and the module can be used
for validating russian text or japanese text....
Chinese, Japanese and Korean ("CJK") names are written natively in
characters that are not alphabetic in the linguistic sense. The number
of characters that could possibly be written in a name is rather large.
However the CJK characters are classified as Unicode category "Lo"
(Letter, other) and do actually match \w in re.

So with a minimal amount of work, you can provide a basic level of
validation across the board. Anything fancier needs local knowledge
[not a c.l.py topic].

Some points for consideration:
(1) You may wish not to reject digits irrevocably -- some jurisdictions
do permit people to change their legal name to "4567" or whatever.
(2) You are of course allowing space, hyphen and apostrophe as valid
characters in "English" names e.g. "mac Intyre", "O'Brien-Smith". Bear
in mind that other punctuation characters may be valid in other
languages -- see 'local knowledge" above.
(3) If you are given data encoded as utf16* or utf32, you won't be able
to use the csv module (neither the ObjectCraft one nor the Python one
(read the docs)) directly. You will need to recode the file as UTF8,
read it using the csv module, and *then* decode each text field from
utf8.

HTH,
John

Sep 5 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...
7
by: John | last post by:
I am having problems with special characters with database calls (if I'm referring to this in the right way). the problem is with apostrophes of all things. If an end user puts an apostrophe in...
3
by: ATH0 | last post by:
How to search for special character { } and how to count them.. I got field called text ( undefined length ) and in this field you must define "{" as start and "}" as end of some text line. If...
2
by: jau | last post by:
Hi co-listers! I have been off Python for 2 years and now, that i'm used to Eclipse and Java, I decided to start a project with Python to refresh skills this time using Eclipse and TrueStudio....
4
by: mimmo | last post by:
Hi! I should convert the accented letters of a string in the correspondent letters not accented. But when I compile with -Wall it give me: warning: multi-character character constant Do the...
3
by: Renato Vieira | last post by:
When i use special characters in a javascript alert() like 'ç' or 'ã', the message is displayed white a blank square. How can i avoid this? thanks in advance. -- Renato /*Portugal*/ Vieira
9
by: dnevado | last post by:
Hi, I have developed a javascript script which sends some html code to w3 validator service through xmlHttpRequest interface in IE. I simply request a page, take responseText property and send...
3
KevinADC
by: KevinADC | last post by:
Purpose The purpose of this article is to discuss the difference between characters inside a character class and outside a character class and some special characters inside a character class....
3
by: jake | last post by:
I am new to xml. I have a routine that parses xml files using a regular XmlReader class. Unfortunately, the XmlReader chokes (throws an exception) on character codes such as "É". I...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.