split string with hieroglyphs

Belize

Hi.
Essence of problem in the following:
Here is lines in utf8 of this form "BZ?ãƒ„ãƒ¼ãƒªTV%ãƒ„ã‚*DVD"
Is it possible to split them into the fragments that contain only latin
printable symbols (aplhabet + "?#" etc)
and fragments with the hieroglyphs, so it could be like this
['BZ?', '\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xaa', 'TV%',
'\xe3\x83\x84\xe3\x82\xad', 'DVD'] ?
Then, after translate of hieroglyphs, necessary to join line, so it
could be like this
"BZ? navigation TV% display DVD"
Thanks.

Dec 24 '06 #1

Subscribe Post Reply

1489

Steven D'Aprano

On Sat, 23 Dec 2006 19:28:48 -0800, Belize wrote:

Hi.
Essence of problem in the following:
Here is lines in utf8 of this form "BZ???TV%??DVD"
Is it possible to split them into the fragments that contain only latin
printable symbols (aplhabet + "?#" etc)

Of course it is possible, but there probably isn't a built-in function to
do it. Write a program to do it.

and fragments with the hieroglyphs, so it could be like this
['BZ?', '\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xaa', 'TV%',
'\xe3\x83\x84\xe3\x82\xad', 'DVD'] ?

def split_fragments(s):
"""Split a string s into Latin and non-Latin fragments."""
# Warning -- untested.
fragments = [] # hold the string fragments
latin = [] # temporary accumulator for Latin fragment
nonlatin = [] # temporary accumulator for non-Latin fragment
for c in s:
if islatin(c):
if nonlatin:
fragments.append(''.join(nonlatin))
nonlatin = []
latin.append(c)
else:
if latin:
fragments.append(''.join(latin))
latin = []
nonlatin.append(c)
return fragments
I leave it to you to write the function islatin.

Hints:

There is a Perl module to guess the encoding:
http://search.cpan.org/~dankogai/Enc...ncode/Guess.pm

You might like to read this too:
http://effbot.org/pyfaq/what-does-un...e-128-mean.htm

I also recommend you read this recipe:
http://aspn.activestate.com/ASPN/Coo.../Recipe/251871

And look at the module unicodedata.

Then, after translate of hieroglyphs, necessary to join line, so it
could be like this
"BZ? navigation TV% display DVD"

def join_fragments(fragments)
accumulator = []
for fragment in fragments:
if islatin(fragment):
accumulator.append(fragment)
else:
accumulator.append(translate_hieroglyphics(fragmen t))
return ''.join(accumulator)
I leave it to you to write the function translate_hieroglyphics.

--
Steven.

Dec 24 '06 #2

Belize

Steven, thanks! Very nice algorithm.
Here is code:
#!/usr/bin/env python
# -*- coding: utf_8 -*-

# Thanks Steven D'Aprano for hints

import unicodedata
import MySQLdb

#MySQL variables
mysql_host = "localhost"
mysql_user = "dict"
mysql_password = "passwd"
mysql_db = "dictionary"

try:
mysql_conn = MySQLdb.connect(mysql_host, mysql_user, mysql_password,
mysql_db)
cur = mysql_conn.cursor()
cur.execute("""SET NAMES UTF8""")
except:
print "unable insert to MySQL, check connection"

jap_text = "BZãƒ„ãƒ¼ãƒªTVãƒ„ã‚*DVD?"
jap_text = unicode(jap_text, 'utf-8') # fight with
full-width, half-width katakana madness :-)
jap_text = unicodedata.normalize('NFKC', jap_text) #
jap_text = jap_text.encode('utf-8') #

def translate_hieroglyph(jap_text):
eng_text = ""
mysql_translate_query = "SELECT Eng FROM dictionary where Jis='%s'
collate utf8_unicode_ci LIMIT 1" % jap_text
cur.execute(mysql_translate_query)
mysql_trans_data = cur.fetchall()
for line in mysql_trans_data:
eng_text = line[0]
if not eng_text:
eng_text = jap_text
return eng_text

def islatin(s):
try:
unicode(s, 'ascii')
except UnicodeError:
pass
else:
return True

def split_fragments(s):
fragments = []
latin = []
nonlatin = []
for c in s:
if islatin(c):
if nonlatin:
fragments.append(''.join(nonlatin))
nonlatin = []
latin.append(c)
else:
if latin:
fragments.append(''.join(latin))
latin = []
nonlatin.append(c)
if latin: # without
this we lose last fragment
fragments.append(''.join(latin)) #
else: #
fragments.append(''.join(nonlatin)) #
return fragments

fragments = split_fragments(jap_text)

def join_fragments(fragments):
accumulator = []
for fragment in fragments:
if islatin(fragment):
accumulator.append(fragment)
else:
accumulator.append(translate_hieroglyph(fragment))
return ' '.join(accumulator)

print join_fragments(fragments)
home@my ~/Src/Code $ python translate.py
BZ navigation TV display DVD?

Work as needed :-) Thanks again!

Dec 24 '06 #3

by: Stu Cazzo | last post by:

I have the following: String myStringArray; String myString = "98 99 100"; I want to split up myString and put it into myStringArray. If I use this: myStringArray = myString.split(" "); it...

Java

Small inconsistency between string.split and "".split

by: Carlos Ribeiro | last post by:

Hi all, While writing a small program to help other poster at c.l.py, I found a small inconsistency between the handling of keyword parameters of string.split() and the split() method of...

Python

string.split question

by: Senthil | last post by:

Code ---------------------- string Line = "\"A\",\"B\",\"C\",\"D\""; string Line2 = Line.Replace("\",\"","\"\",\"\""); string CSVColumns = Line2.Split("\",\"".ToCharArray());

C# / C Sharp

String.Split needs an enhancement to ignore empty fields

by: David Logan | last post by:

We need an additional function in the String class. We need the ability to suppress empty fields, so that we can more effectively parse. Right now, multiple whitespace characters create multiple...

C# / C Sharp

Split

by: Itzik | last post by:

can i split this string string str = "aa a - bb-b - ccc" with this delimiter string del = " - " i want recieve 3 items : "aa a" , "bb-b" , "ccc"

C# / C Sharp

problem with data.Split(vbCrLf)

by: Ron | last post by:

Hello, I am trying to parse a string on the newline char. I guess vbCrLf is a string constant. How can I parse my string - data - on the newline char? .... data += ASCII.GetString(buffer, 0,...

Visual Basic .NET

Split Delimited Text Twice into Array

by: Ben | last post by:

Hi I am creating a dynamic function to return a two dimensional array from a delimeted string. The delimited string is like: field1...field2...field3... field1...field2...field3......

Visual Basic .NET

String.Split versus Strings.Split

by: kurt sune | last post by:

The code: Dim aLine As String = "cat" & vbNewLine & "dog" & vbNewLine & "fox" & vbNewLine Dim csvColumns1 As String() = aLine.Split(vbNewLine, vbCr, vbLf) Dim csvColumns2 As String() =...

Visual Basic .NET

How to validate a string containing Chinese?

by: Kevin | last post by:

Hi All, I want to validate a string, and see if it contains any Chinese character (simple or traditional). I'm trying to use RegExp and Encoding, but no result. Can someone point me a...

C# / C Sharp

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

split string with hieroglyphs

Similar topics