By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,676 Members | 2,249 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,676 IT Pros & Developers. It's quick & easy.

reading hebrew text file

P: n/a
I have a hebrew text file, which I want to read in python
I don't know which encoding I need to use & how I do that

thanks,
hagai

Oct 17 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
<ha*****@gmail.com> wrote:
I have a hebrew text file, which I want to read in python
I don't know which encoding I need to use & how I do that


As for the "how", look to the codecs module -- but if you don't know
what codec the textfile is written in, I know of no ways to guess from
here!-)
Alex
Oct 17 '05 #2

P: n/a
I looked for "VAV" in the files in the "encodings" directory
(/usr/lib/python2.4/encodings/*.py on my machine). I found that the following
character encodings seem to include hebrew characters:
cp1255
cp424
cp856
cp862
iso8859-8
A file containing hebrew text might be in any one of these encodings, or
any unicode-based encoding.

To open an encoded file for reading, use
f = codecs.open(file, 'r', encoding='...')
Now, calls like 'f.readline()' will return unicode strings.

Here's an example, using a file in UTF-8 I have laying around:
f = codecs.open("/users/jepler/txt/UTF-8-demo.txt", "r", "utf-8")
for i in range(5): print repr(f.readline())

...
u'UTF-8 encoded sample plain-text file\n'
u'\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e \u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u 203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u20 3e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e \u203e\u203e\u203e\n'
u'\n'
u'Markus Kuhn [\u02c8ma\u02b3k\u028as ku\u02d0n] <mk***@acm.org> \u2014 1999-08-20\n'
u'\n'

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDU7SmJd01MZaTXX0RAsKzAJsFV94dRovEucFI0lzmrm jduiYsmQCfX7/F
NZ1jDK/UudrQmYgxFE/Ur0k=
=J63I
-----END PGP SIGNATURE-----

Oct 17 '05 #3

P: n/a
ha*****@gmail.com wrote:
I have a hebrew text file, which I want to read in python
I don't know which encoding I need to use
that's not a good start. but maybe it's one of these:

http://sites.huji.ac.il/tex/hebtex_fontsrep.html

?
how I do that


f = open(myfile)
text = f.readline()

followed by one of

text = text.decode("iso-8859-8")
text = text.decode("cp1255")
text = text.decode("cp862")

alternatively, use:

f = codecs.open(myfile, "r", encoding)

to get a stream that decodes things on the fly.

</F>

Oct 17 '05 #4

P: n/a
realy thanks

hagai

Oct 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.