I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).
The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I'd rather not
reinvent the wheel and start translating strings 'by hand'.
Anton
Apr 26 '06
23 1883
Anton Vredegoor wrote: Serge Orlov wrote:
Anton Vredegoor wrote: In fact there are a lot of printable things that haven't got a text attribute, for example some items with tag (xxxx)s.
In my sample file I see <text:s text:c="2"/>, is that you're talking about? Since my file is small I can say for sure this tag represents two space characters.
Or for example in firefox:
<text:s/> in Amsterdam <text:s/>
So, probably yes. If it doesn't have a text attribrute if you iterate over it using OOopy for example:
o = OOoPy (infile = fname) c = o.read ('content.xml') for x in c.getiterator() : if x.text:
Then we know for sure you have recreated my other problem.
I'm tweaking a small test file and see that
<text:s/> is one space character
<text:s text:c="2"/> is two space characters
<text:s text:c="3"/> is three space characters
Martin v. Löwis wrote: So if that is the case: What is the problem then? If you interpret the document as cp1252, and it contains \x93 and \x94, what is it that you don't like about that? In yet other words: what actions are you performing, what are the results you expect to get, and what are the results that you actually get?
Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.
I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it :-)
Anton
Anton Vredegoor wrote: So if that is the case: What is the problem then? If you interpret the document as cp1252, and it contains \x93 and \x94, what is it that you don't like about that? In yet other words: what actions are you performing, what are the results you expect to get, and what are the results that you actually get? Well, where do these cp1252 codes come from? The xml-file claims it's utf-8.
Ah. Then the document is most likely right: \x94 can very well occur
in an UTF-8 file.
I just tried out some random decodings and cp1252 seemed to work. I don't like to have to guess this way. I think John wouldn't even allow it :-)
Well, if the document is UTF-8, you should decode it as UTF-8, of
course.
Regards,
Martin
Martin v. Löwis wrote: Well, if the document is UTF-8, you should decode it as UTF-8, of course.
Thanks. This and: http://en.wikipedia.org/wiki/UTF-8
solved my problem with understanding the encoding.
Anton
proof that I understand it now (please anyone, prove me wrong if you can):
from zipfile import ZipFile, ZIP_DEFLATED
def by80(seq):
it = iter(seq)
while it:
yield ''.join(it.next () for i in range(80))
def utfCheck(infn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
data = zin.read('conte nt.xml').decode ('utf-8')
for line in by80(data):
print line.encode('12 52')
def test():
infn = "xxx.sxw"
utfCheck(infn)
if __name__=='__ma in__':
test() This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Paula Blau |
last post by:
Hi there,
I have a problem with encoding a file .txt to win-1252.
$text = recode_string("ISO-8859-1..win-1252", $text);
but I get the following error message:
Fatal error: Call to undefined function: recode_string()
|
by: sdowney717 |
last post by:
Code page translations are not supported for the text data type. From:
1252 To: 950.
I would like to know what this message means. I also installed the
language packs in advanced settings, Everything is set to English. My
windows XP computer is XP English. For some reason I cant get an update
to go thru using ADO. like Recordset.Update
Although other routines using .update work
|
by: Rich Wallace |
last post by:
Hi all,
I have an XML document fed to me from a third party app:
<?xml version="1.0" encoding="WINDOWS-1252" ?>
<GatewayPlan xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Diagnostics>
<ErrorCode>0</ErrorCode>
<ErrorDescription>OK</ErrorDescription>
|
by: Ravi |
last post by:
HI All,
I am using Windows,
Database territory = US
Database code page = 1252
Database code set = IBM-1252
Database country/region code = 1
Database collating sequence = UNIQUE
now I want change to UTF-8 format in DB2 , data is in databases .
|
by: Lisa |
last post by:
Suddenly, the encoding in my exe.config file changes from UTF-8 to
Windows-1252 every time I try to debug my winform app. This causes the old
application configuration error when I try to debug the app. How can I
finally convince my exe.config file to keep straight?
| |
by: nan |
last post by:
Hi All,
I am trying to connect the Database which is installed in AS400
using DB2 Client Version 8 in Windows box.
First i created the Catalog,
then when i selected the connection type as ODBC, then i am getting
|
by: DC |
last post by:
We are about to go online with an ASP.Net site. We have found that it
is easiest for us to use windows-1252 content encoding, since that
solves our problems with some special characters. Are there some
general disadvantages about using this codepage (most sites I know use
utf-8 or iso) - I am thinking of things like search engine
incompatibilities - or should it be OK to use 1252?
Thanks for any hint in advance,
Regards
|
by: Hans Ruck |
last post by:
I am writing a MIME parser and i need to convert the Windows-1252
encoded strings to utf-8. For example, the text
"=?Windows-1252?Q?une_beaut=E9=?" should become "une beauté".
Do you know what the conversion algorithm is? Does a "shortest" method
exist in the framework?
Hans.
|
by: Boris |
last post by:
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.
As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |