472,331 Members | 1,735 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,331 software developers and data experts.

How to allow special character's like ï,ù,acute e etc...

Dear All,
I am working on a module that validates the provided CSV data in a text
format, which must be in a predefined format.
We check for the :

1. Number of fields provided in the text file,

2. Text checks for max. length of the field & whether the field is
mandatory or optional
Example:
Text('Description', 100, optional=True)
Parameters: "Name of the field" ='Description'
"Max length " =100
"Optional" ='True' (the field is not mandaory)

3. valid-text expressions,
Example:
ValidText('Minor', '[yYnN]')

Parameters:
name =field name
regex =the regular expression y/Y for Yes & n/N for No

Recently we are getting data, where, the name contains non-english
characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc

Using the Text function, these names are not validated as they contain
special characters or non-english characters (ï,ù). But the data is
correct.
Is there any function that can allow such special character's but not
numbers...?

Secondly, If I were to get the data in Russian text, are there any
(lingual) packages available so that i can use the the same module for
validation.
Such that I just have to import the package and the module can be used
for validating russian text or japanese text....

Regards,
Sonal.

Sep 5 '06 #1
1 4435
sonald wrote:
Dear All,
I am working on a module that validates the provided CSV data in a text
format, which must be in a predefined format.
We check for the :
[snip]
>
3. valid-text expressions,
Example:
ValidText('Minor', '[yYnN]')

Parameters:
name =field name
regex =the regular expression y/Y for Yes & n/N for No

Recently we are getting data, where, the name contains non-english
characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc
The offending characters are (unusually) lowercase in otherwise
uppercase strings; is this actual data or are you typing what you think
you see instead of copy/paste?
>
Using the Text function, these names are not validated as they contain
special characters or non-english characters (ï,ù). But the data is
correct.
It would help a great deal if you were to tell us (1) what is the regex
that you are using (2) what encoding you believe/know the data is
written in (3) does your app call locale.setlocale() at start-up? If
the following guesses are wrong, please say so.

Guess (1) (a) you are using the pattern "[A-Za-z]" to check for
alphabetic characters (b) you are using the "\w" pattern to check for
alphanumeric characters and then using "[\d_]" to reject digits and
underscores.
Guess (2): "cp1252" or "latin1" or "unknown" -- all pretty much
equivalent :-)
Guess (3): No.

If guess (1b) is correct: the so-called "special" characters are not
being interpreted as alphabetic because the re module is
locale-dependent. Here is what the re docs have to say:
"""
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current locale.
If UNICODE is set, this will match the characters [0-9_] plus whatever
is classified as alphanumeric in the Unicode character properties
database.
"""

If you are not using (1b) or something like it, you need to move in
that direction.

Please bear this in mind: the locale is meant to be an
attribute/property of the *user* of your application; it is *not* meant
to be an attribute of the input data. Read the docs of the locale
module -- switching locales on the fly is *not* a good idea.
Is there any function that can allow such special character's but not
numbers...?
The righteous way of handling your problem is:
(1) decode each field in the incoming 8-bit string data to Unicode,
using what you know/guess to be the correct encoding setting. Then
string methods like isalpha() and isdigit() will use the Unicode
character properties and your "special" characters will be recognised
for what they are.
(2) use the UNICODE flag in re.
>
Secondly, If I were to get the data in Russian text,
are there any
(lingual) packages available so that i can use the the same module for
validation.
If you are getting the data as 8-bit strings, then the above approach
should still "work" at the basic level ... you decode it using 'cp1251'
or whatever, and the Cyrillic letter equivalents of "Ivanov" would pass
muster as alphabetic.
Such that I just have to import the package and the module can be used
for validating russian text or japanese text....
Chinese, Japanese and Korean ("CJK") names are written natively in
characters that are not alphabetic in the linguistic sense. The number
of characters that could possibly be written in a name is rather large.
However the CJK characters are classified as Unicode category "Lo"
(Letter, other) and do actually match \w in re.

So with a minimal amount of work, you can provide a basic level of
validation across the board. Anything fancier needs local knowledge
[not a c.l.py topic].

Some points for consideration:
(1) You may wish not to reject digits irrevocably -- some jurisdictions
do permit people to change their legal name to "4567" or whatever.
(2) You are of course allowing space, hyphen and apostrophe as valid
characters in "English" names e.g. "mac Intyre", "O'Brien-Smith". Bear
in mind that other punctuation characters may be valid in other
languages -- see 'local knowledge" above.
(3) If you are given data encoded as utf16* or utf32, you won't be able
to use the csv module (neither the ObjectCraft one nor the Python one
(read the docs)) directly. You will need to recode the file as UTF8,
read it using the csv module, and *then* decode each text field from
utf8.

HTH,
John

Sep 5 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser,...
7
by: John | last post by:
I am having problems with special characters with database calls (if I'm referring to this in the right way). the problem is with apostrophes of...
3
by: ATH0 | last post by:
How to search for special character { } and how to count them.. I got field called text ( undefined length ) and in this field you must define...
2
by: jau | last post by:
Hi co-listers! I have been off Python for 2 years and now, that i'm used to Eclipse and Java, I decided to start a project with Python to refresh...
4
by: mimmo | last post by:
Hi! I should convert the accented letters of a string in the correspondent letters not accented. But when I compile with -Wall it give me: ...
3
by: Renato Vieira | last post by:
When i use special characters in a javascript alert() like 'ç' or 'ã', the message is displayed white a blank square. How can i avoid this? ...
9
by: dnevado | last post by:
Hi, I have developed a javascript script which sends some html code to w3 validator service through xmlHttpRequest interface in IE. I simply...
3
KevinADC
by: KevinADC | last post by:
Purpose The purpose of this article is to discuss the difference between characters inside a character class and outside a character class and...
3
by: jake | last post by:
I am new to xml. I have a routine that parses xml files using a regular XmlReader class. Unfortunately, the XmlReader chokes (throws an...
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.