473,725 Members | 2,173 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

not quite 1252

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I'd rather not
reinvent the wheel and start translating strings 'by hand'.

Anton
Apr 26 '06
23 1883

Anton Vredegoor wrote:
Serge Orlov wrote:
Anton Vredegoor wrote:
In fact there are a lot of printable things that haven't got a text
attribute, for example some items with tag (xxxx)s.


In my sample file I see <text:s text:c="2"/>, is that you're talking
about? Since my file is small I can say for sure this tag represents
two space characters.


Or for example in firefox:

<text:s/>
in Amsterdam
<text:s/>

So, probably yes. If it doesn't have a text attribrute if you iterate
over it using OOopy for example:

o = OOoPy (infile = fname)
c = o.read ('content.xml')
for x in c.getiterator() :
if x.text:

Then we know for sure you have recreated my other problem.


I'm tweaking a small test file and see that
<text:s/> is one space character
<text:s text:c="2"/> is two space characters
<text:s text:c="3"/> is three space characters

Apr 28 '06 #21
Martin v. Löwis wrote:
So if that is the case: What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?


Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.

I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it :-)

Anton
Apr 28 '06 #22
Anton Vredegoor wrote:
So if that is the case: What is the problem then? If you interpret
the document as cp1252, and it contains \x93 and \x94, what is
it that you don't like about that? In yet other words: what actions
are you performing, what are the results you expect to get, and
what are the results that you actually get?
Well, where do these cp1252 codes come from? The xml-file claims it's
utf-8.


Ah. Then the document is most likely right: \x94 can very well occur
in an UTF-8 file.
I just tried out some random decodings and cp1252 seemed to work. I
don't like to have to guess this way. I think John wouldn't even allow
it :-)


Well, if the document is UTF-8, you should decode it as UTF-8, of
course.

Regards,
Martin
Apr 29 '06 #23
Martin v. Löwis wrote:
Well, if the document is UTF-8, you should decode it as UTF-8, of
course.


Thanks. This and:

http://en.wikipedia.org/wiki/UTF-8

solved my problem with understanding the encoding.

Anton

proof that I understand it now (please anyone, prove me wrong if you can):

from zipfile import ZipFile, ZIP_DEFLATED

def by80(seq):
it = iter(seq)
while it:
yield ''.join(it.next () for i in range(80))

def utfCheck(infn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
data = zin.read('conte nt.xml').decode ('utf-8')
for line in by80(data):
print line.encode('12 52')

def test():
infn = "xxx.sxw"
utfCheck(infn)

if __name__=='__ma in__':
test()
Apr 29 '06 #24

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
3011
by: Paula Blau | last post by:
Hi there, I have a problem with encoding a file .txt to win-1252. $text = recode_string("ISO-8859-1..win-1252", $text); but I get the following error message: Fatal error: Call to undefined function: recode_string()
6
5124
by: sdowney717 | last post by:
Code page translations are not supported for the text data type. From: 1252 To: 950. I would like to know what this message means. I also installed the language packs in advanced settings, Everything is set to English. My windows XP computer is XP English. For some reason I cant get an update to go thru using ADO. like Recordset.Update Although other routines using .update work
4
4771
by: Rich Wallace | last post by:
Hi all, I have an XML document fed to me from a third party app: <?xml version="1.0" encoding="WINDOWS-1252" ?> <GatewayPlan xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Diagnostics> <ErrorCode>0</ErrorCode> <ErrorDescription>OK</ErrorDescription>
2
8564
by: Ravi | last post by:
HI All, I am using Windows, Database territory = US Database code page = 1252 Database code set = IBM-1252 Database country/region code = 1 Database collating sequence = UNIQUE now I want change to UTF-8 format in DB2 , data is in databases .
3
2231
by: Lisa | last post by:
Suddenly, the encoding in my exe.config file changes from UTF-8 to Windows-1252 every time I try to debug my winform app. This causes the old application configuration error when I try to debug the app. How can I finally convince my exe.config file to keep straight?
3
12014
by: nan | last post by:
Hi All, I am trying to connect the Database which is installed in AS400 using DB2 Client Version 8 in Windows box. First i created the Catalog, then when i selected the connection type as ODBC, then i am getting
12
2169
by: DC | last post by:
We are about to go online with an ASP.Net site. We have found that it is easiest for us to use windows-1252 content encoding, since that solves our problems with some special characters. Are there some general disadvantages about using this codepage (most sites I know use utf-8 or iso) - I am thinking of things like search engine incompatibilities - or should it be OK to use 1252? Thanks for any hint in advance, Regards
2
1634
by: Hans Ruck | last post by:
I am writing a MIME parser and i need to convert the Windows-1252 encoded strings to utf-8. For example, the text "=?Windows-1252?Q?une_beaut=E9=?" should become "une beauté". Do you know what the conversion algorithm is? Does a "shortest" method exist in the framework? Hans.
12
2788
by: Boris | last post by:
I'm trying to find out what the steps look like to upgrade a program (which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI" code page) to UCS-2. Currently the program reads and writes files encoded in Windows-1252 but should be able to read files encoded in UCS-2, too. As I don't want to deal with two character representations in the program I plan to use UCS-2 internally. I should be able to simply use std::wstring...
0
9401
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9257
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9176
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9113
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6702
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4519
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4784
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3221
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2157
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.