473,320 Members | 2,020 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

python 27 re is not able to find characters öÖäÄåÅ

103 100+
Os: Windows7, 64bit
Python27

Expand|Select|Wrap|Line Numbers
  1. text="""AU  - Huang, Zhipeng
  2. AU  - Geyer, Nadine
  3. AU  - Werner, Peter
  4. AU  - de Boor, Johannes
  5. AU  - Gösele, Ulrich
  6. TI  - Metal-Assisted Chemical Etching of Silicon: A Review"""
  7.  
  8. auths=re.findall('AU  \- [öÖäÄåÅa-zA-Z.,\s]+', text)
  9. print(auths)
I am getting the result: ['AU - Huang, Zhipeng', 'AU - Geyer, Nadine', 'AU - Werner, Peter', 'AU - de
Boor, Johannes', 'AU - G']
Python is not able to find " Gösele"
Oct 7 '14 #1

✓ answered by bvdet

I am not having that issue in Python 2.7.2. Try this:
Expand|Select|Wrap|Line Numbers
  1. # coding=utf-8
  2. import re
  3.  
  4. text="""AU  - Huang, Zhipeng
  5. AU  - Geyer, Nadine
  6. AU  - Werner, Peter
  7. AU  - de Boor, Johannes
  8. AU  - Gösele, Ulrich
  9. TI  - Metal-Assisted Chemical Etching of Silicon: A Review"""
  10.  
  11. auths=re.findall('AU  \- [öÖäÄåÅa-zA-Z, ]+', text)
  12. for item in auths:
  13.     print item
The results:
Expand|Select|Wrap|Line Numbers
  1. >>> AU  - Huang, Zhipeng
  2. AU  - Geyer, Nadine
  3. AU  - Werner, Peter
  4. AU  - de Boor, Johannes
  5. AU  - Gösele, Ulrich
  6. >>> 
  7. >>> print auths
  8. ['AU  - Huang, Zhipeng', 'AU  - Geyer, Nadine', 'AU  - Werner, Peter', 'AU  - de Boor, Johannes', 'AU  - G\xc3\xb6sele, Ulrich']
  9. >>>

4 1245
bvdet
2,851 Expert Mod 2GB
I am not having that issue in Python 2.7.2. Try this:
Expand|Select|Wrap|Line Numbers
  1. # coding=utf-8
  2. import re
  3.  
  4. text="""AU  - Huang, Zhipeng
  5. AU  - Geyer, Nadine
  6. AU  - Werner, Peter
  7. AU  - de Boor, Johannes
  8. AU  - Gösele, Ulrich
  9. TI  - Metal-Assisted Chemical Etching of Silicon: A Review"""
  10.  
  11. auths=re.findall('AU  \- [öÖäÄåÅa-zA-Z, ]+', text)
  12. for item in auths:
  13.     print item
The results:
Expand|Select|Wrap|Line Numbers
  1. >>> AU  - Huang, Zhipeng
  2. AU  - Geyer, Nadine
  3. AU  - Werner, Peter
  4. AU  - de Boor, Johannes
  5. AU  - Gösele, Ulrich
  6. >>> 
  7. >>> print auths
  8. ['AU  - Huang, Zhipeng', 'AU  - Geyer, Nadine', 'AU  - Werner, Peter', 'AU  - de Boor, Johannes', 'AU  - G\xc3\xb6sele, Ulrich']
  9. >>>
Oct 7 '14 #2
gintare
103 100+
I am sorry for misinformation. Actually the text is not a string, but text from the file. The error appears if text is from file:
Expand|Select|Wrap|Line Numbers
  1. fcit=codecs.open('C:/Users/Gintare/Downloads/Citations.txt','r',encoding='utf-8')
  2. text=fcit.readlines()
Thanks for the notice, i am just copying file context to python script and now everything is working. But if you know how to read correctly the file, could you please write.
Oct 7 '14 #3
bvdet
2,851 Expert Mod 2GB
You are calling readlines() which reads in a list of the lines. You should use read() or iterate on the list returned by readlines(). If the file is saved with utf-8 encoding, this should work:
Expand|Select|Wrap|Line Numbers
  1. # coding=utf-8
  2. import re
  3. import codecs
  4.  
  5. fcit=codecs.open("data.txt", encoding="utf-8")
  6. text=fcit.read()
  7. auths=re.findall(codecs.decode('AU  \- [öÖäÄåÅa-zA-Z, ]+', "utf-8"), text)
  8.  
  9. for item in auths:
  10.     print item
Oct 8 '14 #4
gintare
103 100+
Thanks, it works with file.read()
Oct 9 '14 #5

Sign in to post your reply or Sign up for a free account.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.