Bug 63756 - non UTF-8/UTF-16/ISO-8859 XML files cannot be opened on Windows
Summary: non UTF-8/UTF-16/ISO-8859 XML files cannot be opened on Windows
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.0.0.3 release
Hardware: All Windows (All)
: medium major
Assignee: David Tardon
URL:
Whiteboard: BSA target:4.4.0 target:4.3.1
Keywords:
: 50012 59788 64676 65005 69163 71782 71831 81461 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-04-20 16:34 UTC by Michal
Modified: 2014-07-24 12:32 UTC (History)
17 users (show)

See Also:
Crash report or crash signature:


Attachments
Example file (windows-1250 encoded). (1.96 KB, text/xml)
2013-04-20 16:34 UTC, Michal
Details
Example file (utf-8 encoded) - this works. (1.95 KB, text/plain)
2013-04-20 16:35 UTC, Michal
Details
Example file (windows-1250 encoded). (1.90 KB, text/plain)
2013-04-23 05:24 UTC, Andras Timar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michal 2013-04-20 16:34:09 UTC
Created attachment 78276 [details]
Example file (windows-1250 encoded).

Problem description: 
LibreOffice is unable to open 2003 XML files which are not encoded in UTF8. I attached valid file with encoding windows-1250 (polish). Same file converted to UTF8 opens without problems.

Steps to reproduce:
1. Save file in 2003 XML,
2. Open with Notepad++ (or something similar),
3. Change encoding,
4. Edit xml header in file and change encoding,
5. Try open in LibreOffice (general I/O error).

Current behavior:
Fails to open.

Expected behavior:
Should be fixed. Last known working version was OpenOffice 3.0.
Operating System: Windows 7
Version: 4.0.1.2 release
Comment 1 Michal 2013-04-20 16:35:14 UTC
Created attachment 78277 [details]
Example file (utf-8 encoded) - this works.
Comment 2 Joel Madero 2013-04-20 18:34:14 UTC
Are the two files identical just different encoding? I don't get an I/O error but I one opens in Spreadsheet with just A1 filled in while the other one opens in writer with just straight xml.

Let us know if these are just the same file saved in two different encodings
Comment 3 Michal 2013-04-20 19:53:47 UTC
I attached both files if you need to check.
UTF8 is saved with UTF8 without BOM via Notepad++, second is saved in ANSI windows-1250 encoding. Files differ additionally in xml header.

UTF-8 file has header:
<?xml version="1.0" encoding="UTF-8"?>

windows-1250 file has header:
<?xml version="1.0" encoding="windows-1250"?>

Internet Explorer open both files and displays characters in valid way - so I assume that XML is OK.

Regards,
Michal
Comment 4 Michal 2013-04-20 19:56:32 UTC
Next suggestion... If file contains <?mso-application progid="Excel.Sheet"?> this should be opened in LO Calc not Writer... If you only choose LO as opening program.

Regards,
Michal
Comment 5 Michal 2013-04-20 20:02:46 UTC
Most important: Yes files are identical (content) - the only differ is xml header.

Regards,
Michal
Comment 6 Joel Madero 2013-04-21 05:43:49 UTC
Okay I can confirm this behavior but I think it's an enhancement request as I can't find documentation saying we actually support windows-1250 encoding. 

If you find documentation that says that it's supposed to be supported please feel free to change this but for now marking as:

New (confirmed)
Enhancement (not a bug with any feature, you just want the ability to support windows-1250 encoding)
Low - as this is the only bug I can find related to this encoding it doesn't seem like many people use it, seems appropriate setting.


Thanks!
Comment 7 Michal 2013-04-21 14:47:21 UTC
Joel, I don't think it's problem with encoding at all... windows-1250 is defaultly used by all polish Windows up to XP (I think that Vista replaced with UTF-8, but I'm not sure), so it's really not rare. If I convert file to iso8859-2 (ISO standard for central europe) I can't open too. It seems that problems is rather related to UTF8 and non-UTF8 files (but only those that contains national specified characters). LO seems to have problems with opening files that have other XML header than <?xml version="1.0" encoding="UTF-8"?> or <?xml version="1.0"?>

As I said in one earlier comments... That worked perfect with OpenOffice 3.0... so if someone once added this encoding then why it's deleted for now ???

Regards,
Michal
Comment 8 Andras Timar 2013-04-22 09:30:20 UTC
Michal,  "Example file (windows-1250 encoded)" is in utf-8, too, only the header is different. Anyway, I think I could reproduce your problem, when I followed the steps you described. I got "General Error. General input/output error." 

Removing the support for legacy encodings might have been a side effect of some code optimalizations and/or program startup optimalizations.
Comment 9 Michal 2013-04-22 12:16:00 UTC
Andras, great that you can reproduce - hope that this will be fixed. 

Personally I think that disabling non-UTF8 files in XML parser is bad idea. Just a example from my company... Our ERP system is creating some reports in 2003 XML files, these files can be opened via MSO 2007 and 2010 without problems, but LO crash with opening. All files are generated in windows-1250 because database is encoded too in windows-1250. I believe that I'm not alone with that problem.

If you need to optimize then maybe good solution will be using only those codepages that are installed in system ? Most systems has UTF8 + 1-3 codepages installed.


Regards,
Michal
Comment 10 Andras Timar 2013-04-23 05:23:50 UTC
Kohei, you fixed similar issues in the past. Could you please have a look? Interestingly, it works in AOO 3.4.1 but not in Go-OO 3.2.1
Comment 11 Andras Timar 2013-04-23 05:24:40 UTC
Created attachment 78355 [details]
Example file (windows-1250 encoded).
Comment 12 Urmas 2013-05-17 04:31:31 UTC
*** Bug 64676 has been marked as a duplicate of this bug. ***
Comment 13 Michael Meeks 2013-05-22 09:50:46 UTC
This is for the legacy Office 2003 XML format ? IIRC this was implemented with some XSLT filters, which we re-wrote (Peter did anyhow) to use libxslt and libxml2 instead of some Java monster [ though I may mis-remember ]. It is possible that that is related ...

In general using non-utf-8 encodings is (IMHO) a bad idea wherever you see it - but of course, we should try to look into that / patches appreciated etc.
Comment 14 David Tardon 2013-05-23 08:16:36 UTC
Works fine on Linux. I guess libxslt cannot convert the character set because we do not distribute iconv.dll.
Comment 15 Mat M 2013-05-23 22:41:17 UTC
(In reply to comment #14)
> Works fine on Linux. I guess libxslt cannot convert the character set
> because we do not distribute iconv.dll.

libxslt is built against iconv (default for WIN/MSC)
libxml2 is built with iconv=0 sax1=1 (for WIN/MSC)

Is there any inconsistency there ?
Comment 16 David Tardon 2013-05-27 14:42:19 UTC
*** Bug 65005 has been marked as a duplicate of this bug. ***
Comment 17 Michael Stahl (allotropia) 2013-05-28 17:58:54 UTC
it appears that libxml2 has built-in support for just a
few standard encodings like UTF-8/UTF-16/ISO 8859-*
and everything else is handled by an optional iconv dependency.

LO does not bundle libiconv on Windows; on Linux the bundled
libxml2 will pick up iconv on the system, on Mac we use the
system libxml2 which should have iconv support.

this problem only affects XSLT based filters which are not
very popular (mainly for legacy MSO 2003 XML formats and XHTML).

hmm... i don't think that adding support for obscure encodings
used in obscure formats only on Windows is a good use of resources.

as a workaround there is the "old" XSLT import filter based
on Saxon which presumably (since this is a regression) supported
more encodings, now available as an extension from

https://github.com/dtardon/xslt2-transformer
Comment 18 David Tardon 2013-05-28 18:08:40 UTC
Another possibility is to convert the XML to UTF-8 by an external tool. (Of course, that implies that the user knows it is the encoding that causes the failure. Patches for detection of that situation and improvement of the error message welcome.)
Comment 19 Urmas 2013-05-28 19:01:21 UTC
Yes, because people have too much free time on their hands which they can use to convert their valid documents between encodings to pleasure your shitty software.
Comment 20 Joel Madero 2013-05-28 19:02:35 UTC
Urmas - you have been politely warned once about language, insulting, etc...
Comment 21 David Tardon 2013-05-28 19:32:46 UTC
(In reply to comment #19)
> Yes, because people have too much free time on their hands which they can
> use to convert their valid documents between encodings to pleasure your
> shitty software.

<sarcasm>Thank you for you constructive response. This is exactly the reaction I expected from you.</sarcasm>

In case you have not noticed it, this project is open source. If you have problems with our decision to not waste time on something that we consider a marginal problem, you are free to fix it yourself and send a patch. Or you can go over to xmlsoft.org and try to convince Daniell, in your typical diplomatic way, to add internal support for cp1250 into libxml2.

Btw., this is perfectly valid behavior for an XML processor. The only encodings required by XML 1.1 are UTF-8 and UTF-16 (see section 2.2 of the standard).
Comment 22 Anton Derbenev 2013-05-29 06:54:23 UTC
(In reply to comment #17)
> as a workaround there is the "old" XSLT import filter based
> on Saxon which presumably (since this is a regression) supported
> more encodings, now available as an extension from
> 
> https://github.com/dtardon/xslt2-transformer

Great, I asked dtardon to upload pre-compiled version to extension-center.

This would be second bug I'm working around with an extension (another is with encodings in BIFF5 / Excel 95).

Next step will be to deploy the extension company-wide.
Comment 23 Anton Derbenev 2013-07-08 09:13:19 UTC
It seems doomed, xslt2-transformer did not help.

To relief poor users' pain, I've written small AutoHotkey script to convert files.



#NoEnv

If %0%
    Loop %0%
	ConvertFiles(%A_Index%)
Else {
    FileSelectFile srcFilesList, M,, Файлы для преобразования, Файлы XML (*.xml)
    Loop Parse, srcFilesList, `n
    {
	If A_Index = 1
	    srcDir=%A_LoopField%\
	Else
	    ConvertFiles(srcDir . A_LoopField)
    }
}
    
Exit

ConvertFiles(srcMask) {
    Loop %srcMask%
    {
	LoopFileDir=
	If A_LoopFileDir
	    LoopFileDir=%A_LoopFileDir%\
	dstName=%LoopFileDir%%A_LoopFileName%.UTF-8.xml
	srcName=%A_LoopFileFullPath%
	
	FileEncoding CP1251
	FileRead srcXML, %srcName%

	FileEncoding UTF-8-RAW
	StringReplace srcXML, srcXML, encoding="Windows-1251", encoding="UTF-8"

	IfExist %dstName%
	{
	    If Not ReplaceSilently
	    {
		MsgBox 35, Сохранение обработанного файла, Файл уже существует`, заменить?`n`n%dstName%
		IfMsgBox Cancel
		    Exit

		IfMsgBox No
		    continue

		; IfMsgBox Yes 
		If Not AskedOnce
		{
		    AskedOnce=1
		} Else {
		    AskedOnce=-1
		    MsgBox 36, Сохранение обработанных файлов, Заменять все файлы без дополнительных вопросов?
		    IfMsgBox Yes
			ReplaceSilently=1
		}
	    }

	    FileDelete %dstName%
	}
	FileAppend %srcXML%, %dstName%
    }
}
Comment 24 Urmas 2013-09-10 13:57:25 UTC
*** Bug 69163 has been marked as a duplicate of this bug. ***
Comment 25 Urmas 2013-09-17 16:53:25 UTC
*** Bug 59788 has been marked as a duplicate of this bug. ***
Comment 26 Urmas 2013-09-27 10:14:26 UTC
*** Bug 50012 has been marked as a duplicate of this bug. ***
Comment 27 Maxim Monastirsky 2013-11-19 13:34:53 UTC
*** Bug 71782 has been marked as a duplicate of this bug. ***
Comment 28 Maxim Monastirsky 2013-11-20 11:11:37 UTC
*** Bug 71831 has been marked as a duplicate of this bug. ***
Comment 29 David Tardon 2014-07-18 15:52:48 UTC
It is possible to build libxml2 with ICU support as an alternative to iconv. Let's see if I can make this work.
Comment 30 Kevin Suo 2014-07-18 15:57:22 UTC
Good to hear that :-)
Then I move this to MAB4.2 instead, as MAB4.0 is closed already.
Comment 31 Michael Stahl (allotropia) 2014-07-18 16:03:13 UTC
if this were really a "most annoying" bug it would hardly have been WONTFIX'ed in the first place
Comment 32 tommy27 2014-07-18 17:06:19 UTC
I do not wanna contest your opinion but I see quite a bunch of duplicates from many different users, maybe the WONTFIX was done because of technical incompatibilities.
Comment 33 David Tardon 2014-07-20 08:28:08 UTC
(In reply to comment #32)
> I do not wanna contest your opinion but I see quite a bunch of duplicates
> from many different users, maybe the WONTFIX was done because of technical
> incompatibilities.

The same solution was possible at that time, but while this was only breaking the XSLT filters--which are effectively unmaintained anyway--nobody cared enough to look if there actually might be a solution. Now this started breaking other filters too and that is unacceptable :-)
Comment 34 David Tardon 2014-07-20 08:28:55 UTC
*** Bug 81461 has been marked as a duplicate of this bug. ***
Comment 35 Commit Notification 2014-07-20 08:30:19 UTC
David Tardon committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7515b1a90fac9e31733c0fdcc1156adadf0e6f99

fdo#63756 build libxml2 with ICU support



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 36 Commit Notification 2014-07-21 08:58:36 UTC
David Tardon committed a patch related to this issue.
It has been pushed to "libreoffice-4-3":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=23b4b764ade82cf3a5835a7b7f35fb5e45cd6cc9&h=libreoffice-4-3

fdo#63756 build libxml2 with ICU support


It will be available in LibreOffice 4.3.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 37 Michael Stahl (allotropia) 2014-07-23 22:05:19 UTC
the fix breaks the installation of Java based extensions on Windows.

reason is that URE/bin/uno.exe cannot load URE/bin/javavmlo.dll, which is linked against URE/bin/libxml2.dll, which is linked against program/icuucd53.dll and of course URE binaries don't have program dir on path.
Comment 38 Commit Notification 2014-07-23 22:41:57 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=057613c6864204ac5c09260e93a8f14cc9768b90

icu: un-break installation of Java extensions on Windows (rel. fdo#63756)



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 39 Commit Notification 2014-07-24 12:32:12 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-4-3":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3012156bab9dc0504a61fa7062f8e7cbd677bad4&h=libreoffice-4-3

icu: un-break installation of Java extensions on Windows (rel. fdo#63756)


It will be available in LibreOffice 4.3.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.