Os: Windows7, 64bit
Python27 - text="""AU - Huang, Zhipeng
-
AU - Geyer, Nadine
-
AU - Werner, Peter
-
AU - de Boor, Johannes
-
AU - Gösele, Ulrich
-
TI - Metal-Assisted Chemical Etching of Silicon: A Review"""
-
-
auths=re.findall('AU \- [öÖäÄåÅa-zA-Z.,\s]+', text)
-
print(auths)
I am getting the result: ['AU - Huang, Zhipeng', 'AU - Geyer, Nadine', 'AU - Werner, Peter', 'AU - de
Boor, Johannes', 'AU - G']
Python is not able to find " Gösele"
I am not having that issue in Python 2.7.2. Try this: - # coding=utf-8
-
import re
-
-
text="""AU - Huang, Zhipeng
-
AU - Geyer, Nadine
-
AU - Werner, Peter
-
AU - de Boor, Johannes
-
AU - Gösele, Ulrich
-
TI - Metal-Assisted Chemical Etching of Silicon: A Review"""
-
-
auths=re.findall('AU \- [öÖäÄåÅa-zA-Z, ]+', text)
-
for item in auths:
-
print item
The results: - >>> AU - Huang, Zhipeng
-
AU - Geyer, Nadine
-
AU - Werner, Peter
-
AU - de Boor, Johannes
-
AU - Gösele, Ulrich
-
>>>
-
>>> print auths
-
['AU - Huang, Zhipeng', 'AU - Geyer, Nadine', 'AU - Werner, Peter', 'AU - de Boor, Johannes', 'AU - G\xc3\xb6sele, Ulrich']
-
>>>
4 1245 bvdet 2,851
Expert Mod 2GB
I am not having that issue in Python 2.7.2. Try this: - # coding=utf-8
-
import re
-
-
text="""AU - Huang, Zhipeng
-
AU - Geyer, Nadine
-
AU - Werner, Peter
-
AU - de Boor, Johannes
-
AU - Gösele, Ulrich
-
TI - Metal-Assisted Chemical Etching of Silicon: A Review"""
-
-
auths=re.findall('AU \- [öÖäÄåÅa-zA-Z, ]+', text)
-
for item in auths:
-
print item
The results: - >>> AU - Huang, Zhipeng
-
AU - Geyer, Nadine
-
AU - Werner, Peter
-
AU - de Boor, Johannes
-
AU - Gösele, Ulrich
-
>>>
-
>>> print auths
-
['AU - Huang, Zhipeng', 'AU - Geyer, Nadine', 'AU - Werner, Peter', 'AU - de Boor, Johannes', 'AU - G\xc3\xb6sele, Ulrich']
-
>>>
I am sorry for misinformation. Actually the text is not a string, but text from the file. The error appears if text is from file: - fcit=codecs.open('C:/Users/Gintare/Downloads/Citations.txt','r',encoding='utf-8')
-
text=fcit.readlines()
Thanks for the notice, i am just copying file context to python script and now everything is working. But if you know how to read correctly the file, could you please write.
bvdet 2,851
Expert Mod 2GB
You are calling readlines() which reads in a list of the lines. You should use read() or iterate on the list returned by readlines(). If the file is saved with utf-8 encoding, this should work: - # coding=utf-8
-
import re
-
import codecs
-
-
fcit=codecs.open("data.txt", encoding="utf-8")
-
text=fcit.read()
-
auths=re.findall(codecs.decode('AU \- [öÖäÄåÅa-zA-Z, ]+', "utf-8"), text)
-
-
for item in auths:
-
print item
Thanks, it works with file.read()
Sign in to post your reply or Sign up for a free account.
|