By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,454 Members | 3,103 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,454 IT Pros & Developers. It's quick & easy.

Python for Vcard Parsing in UTF16

P: n/a
Greetings -

A recent Perl experiment hasn't turned out so well, which has piqued my
interest in Python. The project is this: take a Vcard file exported from
Apple's Addressbook and use a language that is good at parsing text to convert
it into a mutt alias file. There are better ways to use Mutt with Mac's
addressbook, but I want to be able to periodically convert my working
addressbook file into an alias file I can then transfer across all my different
machines - two Macs, two Linux, and one FreeBSD. It's basically a couple of
regexes that look for FN: followed by a name and convert all the words of the
name into a single structure separated by underscores, followed by the email
addresses. You would wind up with

alias Linus_Torvalds Linus Torvalds <lt@linux.com>

To me this was a natural task for Perl. Turns out however, there's a catch.
Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
their addressbook gets a legitimate Vcard file. And of course Perl somewhat
chokes on UTF. I've found several ways to do it that involve complicated
downloads and installations of Perl modules, but that defeats the purpose of
making it simple. In an ideal world you should be able to say "try this cool
script" and be done with it. Once you have to say "go to CPAN, download and
compile this module, then ..." it gets less exciting.

I know nothing about Python except that it interests me and has interested me
since I first learned the Rekall database frontend (Linux) runs on it. I just
ordered Learning Python and if that works out satisfactorily I'm going to go
back for Programming Python. In the meantime, I thought I would pose the
question to this newsgroup: would Python be useful for a parsing exercise like
this one?
Apr 21 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
R Wood <rw***@therandymon.comwrote:
...
alias Linus_Torvalds Linus Torvalds <lt@linux.com>

To me this was a natural task for Perl. Turns out however, there's a catch.
Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
their addressbook gets a legitimate Vcard file. And of course Perl somewhat
chokes on UTF. I've found several ways to do it that involve complicated
downloads and installations of Perl modules, but that defeats the purpose of
making it simple. In an ideal world you should be able to say "try this cool
script" and be done with it. Once you have to say "go to CPAN, download and
compile this module, then ..." it gets less exciting.

I know nothing about Python except that it interests me and has interested me
since I first learned the Rekall database frontend (Linux) runs on it. I just
ordered Learning Python and if that works out satisfactorily I'm going to go
back for Programming Python. In the meantime, I thought I would pose the
question to this newsgroup: would Python be useful for a parsing exercise like
this one?
Sure, Python and Perl (and Ruby) should be equally suitable for the
task, so, if Python appears more suitable by having built-in unicode
capabilities, go for it. I'm a bit uncertain about the UTF-16 export
though; I know some applications do use it (e.g., Microsoft Entourage),
but I thought Apple's Address Book didn't, and, having just tried a
VCard export from mine, it looks quite ASCII to me. Maybe you've set
some kind of preference, or...?
Alex
Apr 22 '07 #2

P: n/a
Alex Martelli wrote:
R Wood <rw***@therandymon.comwrote:
...
>alias Linus_Torvalds Linus Torvalds <lt@linux.com>

To me this was a natural task for Perl. Turns out however, there's a
catch. Apple exports the file in UTF-16 to ensure anyone with Chinese
characters in
their addressbook gets a legitimate Vcard file. And of course Perl
somewhat
chokes on UTF.

Sure, Python and Perl (and Ruby) should be equally suitable for the
task, so, if Python appears more suitable by having built-in unicode
capabilities, go for it. I'm a bit uncertain about the UTF-16 export
though; I know some applications do use it (e.g., Microsoft Entourage),
but I thought Apple's Address Book didn't, and, having just tried a
VCard export from mine, it looks quite ASCII to me. Maybe you've set
some kind of preference, or...?
Alex
I did the same thing. Apple's clever. If your addressbook doesn't have any
higher characters, ie nothing but ASCII, it will export your addressbook in
ASCII. But if you have anything else (in my case, Spanish, French, and
Italian) it goes for UTF16. I first thought it was UTF8 but realized since
Apple supports all sorts of Asian languages really well they need UTF16 to
deal with it, and importing the exported file into Jedit using UTF16
encoding confirmed that's what it is.

Apr 22 '07 #3

P: n/a
On Apr 21, 7:28 pm, R Wood <r...@therandymon.comwrote:
I know nothing about Python except that it interests me and has interested me
since I first learned the Rekall database frontend (Linux) runs on it. I just
ordered Learning Python and if that works out satisfactorily I'm going to go
back for Programming Python. In the meantime, I thought I would pose the
question to this newsgroup: would Python be useful for a parsing exercise like
this one?
Here's a little function that takes some `str`-type data (i.e. what
you'd get from doing open(...).read()) and, assuming it's a Vcard,
detects its encoding and converts it to a canonical `unicode` object.

def fix_encoding(s):
m = u'BEGIN:VCARD'
for c in ('ascii', 'utf_16_be', 'utf_16_le', 'utf_8'):
try: u = unicode(s, c)
except UnicodeDecodeError: continue
if m in u: return u
return None

Apr 24 '07 #4

P: n/a
On Apr 21, 7:28 pm, R Wood <r...@therandymon.comwrote:
To me this was a natural task for Perl. Turns out however, there's a catch.
Apple exports the file in UTF-16 to ensure anyone with Chinese characters in
their addressbook gets a legitimate Vcard file.
Here's a function that, given a `str` containing a vcard in some
encoding, guesses the encoding and returns a canonical representation
as a `unicode` object.

def fix_encoding(s):
m = u'BEGIN:VCARD'
for c in ('ascii', 'utf_16_be', 'utf_16_le', 'utf_8'):
try: u = unicode(s, c)
except UnicodeDecodeError: continue
if m in u: return u
return None

Apr 24 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.