Bug 82418 - FILEOPEN: CSV (Text) Import defaults to UTF-16 resulting in garbled text and can freeze LibreOffice
Summary: FILEOPEN: CSV (Text) Import defaults to UTF-16 resulting in garbled text and ...
Status: RESOLVED WONTFIX
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.3.0.4 release
Hardware: Other Linux (All)
: medium normal
Assignee: David Tardon
URL:
Whiteboard: BSA target:4.4.0 target:4.3.2
Keywords:
: 84338 88909 93021 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-08-10 12:51 UTC by Tom
Modified: 2015-12-10 12:28 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
screenshots, bash scripts to generate CSV samples, gdbtrace.log (67.18 KB, application/gzip)
2014-08-10 12:51 UTC, Tom
Details
Sample CSV file 100 rows of data, 11 columns, ASCII (2.39 KB, text/csv)
2014-08-11 17:49 UTC, Tom
Details
Sample CSV file 10k rows of data, 11 columns, ASCII (252.88 KB, text/csv)
2014-08-11 17:49 UTC, Tom
Details
Single line with 'aaa' string, ASCII, 131072 characters long (128.00 KB, text/csv)
2014-08-17 18:48 UTC, Tom
Details
Single line with 'aaa' string, ASCII, 65535 characters long (64.00 KB, text/csv)
2014-08-17 18:48 UTC, Tom
Details
Single line with 'aaa' string, UTF-16 with BOM, 65535 characters long (128.00 KB, text/csv)
2014-08-17 18:49 UTC, Tom
Details
Screenshot with LO 4.3.3.2 (see comment 15) (102.47 KB, image/png)
2014-10-30 19:06 UTC, Jim Avera
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom 2014-08-10 12:51:28 UTC
Created attachment 104381 [details]
screenshots, bash scripts to generate CSV samples, gdbtrace.log

Problem description: 

CSV/Text data import on fresh installation defaults to UTF-16 (at least in my case) resulting in a garbled text in the data preview. This is a change to the behaviour of LibreOffice (LO) from previous releases and can be rather confusing to less savvy users. Also, if you accidentally okay the dialogue LO will try to import the file this way, small files will freeze the LO for considerable amount of time, large files may freeze it indefinitely.

Steps to reproduce:
1. Create some test csv files (see attachment, you can use gensample1.sh, gensample2.sh scripts to create 100 row and 10k row files)
2. Open samplecsv-100.csv (2.4K) with LO
3. If the character set is UTF-16 then fields preview will be garbled (see attachment)
4. Change character set to UTF-8: fields preview will be as expected
5. Change back to UTF-16 and press OK: the file will be imported, but it will take at least couple of seconds (2.4K file), imported data will be cramped into a single cell of 'random' string
6. Now try to open the samplecsv-10k.csv (253K) file
7. Check the field preview with UTF-16 and UTF-8
8. Change back to UTF-16 and press OK
9. Have some tea, go out, go on holiday, alternatively pkill -15 soffice.bin ;)

Does it mean that there is no sanity check on what LO is trying to import? Is LO attempting to import the whole 253K file as a single cell?

Expected behaviour:
- Default 'Character set' should be set to UTF-8, maybe 'System'?
- Text import should do some (more) sanity checks before trying to actually import the data
- There should be a progress bar and there must be a possibility to abort the import (if the import freezes, all currently opened LO windows/documents will freeze too)

Problem impairs:
usability / user experience, may freeze LibreOffice


Operating System: Linux (Other)
Version: 4.3.0.4 release
Last worked in: 4.1.3.2 release
Comment 1 penttila 2014-08-11 16:38:34 UTC
On LinuxMint 17 Cinnamon LO 4.2.4.2 text import still defaults to Unicode (UTF-8) in Calc. As I see this dialogue is used also by Writer, which may explain the change of this default value.

On the other hand in the small 'preview window' - at the bottom of import dialogue - you can easily verify or test the outcome!
Comment 2 Tom 2014-08-11 17:45:21 UTC
Thanks penttila, it could be a recent change to the default value. Hope someone else can also confirm this, and I will try to get hold of a PC running Windows to see what is the current behaviour of text import.

I appreciate that there is the field preview and personally I am quite okay with that. However, the problem I see here is twofold:
1. Usability / (fresh) user experience
2. No sanity check on imported data

Usability - don't get me wrong - the text import in LO is a really powerful and useful tool, and I am glad we have it. However, I believe that less savvy users just want to click 'OK' and have the spreadsheet popping up so they can work on it. Both you and me are familiar with the interface, and we know that if text looks corrupted, it could be the encoding issue. But, many fresh or less savvy users will either think that the file is corrupted, there is a problem with Calc/Writer/LO, or worst case: they may just okay the dialogue without paying much attention to the preview. And as a result it will freeze the whole application, there is no way to abort the import, there is no progress bar, there is no warning that something is not quite right.

The second issue here is more serious, importing text data, even with incorrect encoding should never cause the software to hang or crash. LO attempted to cramp the small file into a single cell. I guess the reason for the freeze is because when you try to open the large file LO is trying to do just that, import 129k 2-byte characters into a single cell.
Comment 3 Tom 2014-08-11 17:49:00 UTC
Created attachment 104452 [details]
Sample CSV file 100 rows of data, 11 columns, ASCII
Comment 4 Tom 2014-08-11 17:49:34 UTC
Created attachment 104453 [details]
Sample CSV file 10k rows of data, 11 columns, ASCII
Comment 5 Tom 2014-08-11 17:59:37 UTC
Hi penttila,

I have attached two simple CSV test files for convenience (one with 100 rows, the other with 10,000 rows).

Please would you mind trying to import these as UTF-16 into your LO 4.2.4.2 (please ignore the fact that the text is corrupted in the preview). Just be aware that with the large file you will probably need to kill the process.

I am just wondering if there is any difference between 4.2.4.2 and 4.3.0.4 in the way how this is handled.

Thank you
Comment 6 David Tardon 2014-08-13 14:33:30 UTC
So... We use saved encoding if there is any. If there is not, we detect if the file contains either UTF-16 or UTF-8 BOM (the majority of UTF-8 files does not contain BOM, though). If that fails too, we use system encoding (which is UTF-16 on Windows and UTF-8 everywhere else, I suppose.)

I think we could add another special case: set encoding to UTF-8 if system encoding is UTF-16 (as that would have already been detected from the file itself). (Of course adding real encoding detection would be even better, but that is outside of the scope of this bug.)

Btw, this has not changed since 4.2 at all.
Comment 7 David Tardon 2014-08-13 15:03:00 UTC
Okay, let's try this...
Comment 8 Commit Notification 2014-08-13 15:08:12 UTC
David Tardon committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=a8525fe5cf2ba834ae39e7bfe078911d94957a70

fdo#82418 prefer UTF-8 over UTF-16



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 9 Commit Notification 2014-08-13 20:07:55 UTC
David Tardon committed a patch related to this issue.
It has been pushed to "libreoffice-4-3":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=e11d89ce6404194f7d38c1e8e8f7af62297ca91b&h=libreoffice-4-3

fdo#82418 prefer UTF-8 over UTF-16


It will be available in LibreOffice 4.3.2.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 10 Tom 2014-08-17 18:47:14 UTC
Thanks David,

Just to confirm I have tested it with the current daily build (2014-08-16) and the encoding now defaults to UTF-8 as expected.


Would you suggest to open separate cases for the other two issues?
1. Introducing encoding detection
2. Looking into the problem that LO may freeze on import of incorrectly formed and/or encoded text files

Regarding the second issue, I wanted to find out what is the string length limit for a single cell so I have created a text file with a 128KiB long 'a' string in ASCII (a-128K.csv).

On importing ASCII/UTF-8 LO warned me that "The data could not be loaded completely because the maximum number of characters per cell was exceeded".
Once the file is imported the string length is being reduced to 65535. Which is what I would expect. I have also created another file (a-64K.csv) with 65535 long 'a' string and this time LO imported it without any warning. So no issue here.

However, as you may guess I have also attempted to import the same file (a-64K.csv) encoded as ASCII as UTF-16, this time LO freezes for good (gave up on waiting after about 5 minutes).

Then I did another test, after converting the file encoding to UTF-16 (a-64K-utf16.csv) the file opens correctly in the import dialogue, i.e. the encoding is set automatically to UTF-16 thanks to the correct BOM. Importing this file as UTF-16 works fine.

However, if on importing the true UTF-16 file (a-64K-utf16.csv) you change the encoding ('character set' drop-down) to say UTF-8, the 'text import' interface will become unresponsive for at least couple of seconds (3-4 seconds in my case). This will repeat every time you change any of the parameters in the 'Separator Options' and/or 'Other Options' sections.
Moreover, if you OK the 'text import' dialogue it can take at least 8-10 seconds to import the file which doesn't feel right for 64KiB worth of data. Increasing the file size to 128KiB (2 rows of data) gives about 6-8 seconds freeze in the import dialogue, and around 15-20 seconds to import the file after hitting 'OK'.

I did this test, with same results, on:
4.1.3.2 release
4.3.0.4 release
daily build from 2014-08-16

Would you agree that something is not quite right here?
Shall I file these as separate bugs?
Comment 11 Tom 2014-08-17 18:48:17 UTC
Created attachment 104771 [details]
Single line with 'aaa' string, ASCII, 131072 characters long
Comment 12 Tom 2014-08-17 18:48:48 UTC
Created attachment 104772 [details]
Single line with 'aaa' string, ASCII, 65535 characters long
Comment 13 Tom 2014-08-17 18:49:33 UTC
Created attachment 104773 [details]
Single line with 'aaa' string, UTF-16 with BOM, 65535 characters long
Comment 14 David Tardon 2014-08-24 08:30:59 UTC
(In reply to comment #10)
> Thanks David,
> 
> Just to confirm I have tested it with the current daily build (2014-08-16)
> and the encoding now defaults to UTF-8 as expected.
> 
> 
> Would you suggest to open separate cases for the other two issues?
> 1. Introducing encoding detection

Yes, that would be nice. Btw, I should add that when I said "detection", I really meant "guess". Because that is the best that can be done.

> 2. Looking into the problem that LO may freeze on import of incorrectly
> formed and/or encoded text files
> 
> ...
> 
> Would you agree that something is not quite right here?

Yes, I would agree. What is not right is the expectation that a text file can be reliably opened using a different encoding. It more-or-less works for one-byte encodings: the worst thing that might happen is that some characters are not shown correctly. But, for multi-byte encodings, all bets are off.
Comment 15 Jim Avera 2014-10-30 19:05:44 UTC
Comment 9 says it should be in LibreOffice 4.3.2, but the problem is still there in 4.3.3.  Can someone confirm and/or dis-confirm that the fix should be in 4.3.3 ?

Please see attached screen-shot showing UTF-16 as default charset.

P.S. I recently upgraded from Ubuntu 13.10 to 14.04 and started seeing the problem, so suspect some interaction is at the root of this.  However the LANG* environment variables are set for UTF-8 (see screenshot).
Comment 16 Jim Avera 2014-10-30 19:06:36 UTC
Created attachment 108703 [details]
Screenshot with LO 4.3.3.2 (see comment 15)
Comment 17 petrelharp 2014-11-18 19:12:37 UTC
Some possibly irrelevant observations: in 4.3.1.2 (linux), the default encoding seems to be whatever was used last.  So, if I open a UTF-16 .csv, next time I try to open a UTF-8 .csv, it will look like gibberish, and conversely.

And, I don't know about automatically detecting decoding, but
  file data.csv
at the command line correctly tells me the encoding of the test files.  If it is not possible to tell for sure, then it would be better to e.g. check if most of the characters in the first few rows look more valid in either UTF-8 or UTF-16, or some such?  It would probably be easy to find a good set of characters that occur commonly when treating UTF-8 as UTF-16 but are unlikely to appear otherwise, and vice-versa.
Comment 18 David Tardon 2014-12-27 09:42:38 UTC
(In reply to petrelharp from comment #17)
> Some possibly irrelevant observations: in 4.3.1.2 (linux), the default
> encoding seems to be whatever was used last.  So, if I open a UTF-16 .csv,
> next time I try to open a UTF-8 .csv, it will look like gibberish, and
> conversely.

The last-used encoding is only restored if it was picked manually by the user. I guess we need a "Remember selected encoding" checkbox to be able to turn this behaviour off/on as needed, but that has got nothing with this bug.

> 
> And, I don't know about automatically detecting decoding, but
>   file data.csv
> at the command line correctly tells me the encoding of the test files.

UTF-16LE/UTF-16BE/UTF-8 are easy. Now try it with an 8-bit encoding, e.g., one of the iso-8859-* or windows-12* families.
Comment 19 Maxim Monastirsky 2015-07-30 11:47:56 UTC
*** Bug 88909 has been marked as a duplicate of this bug. ***
Comment 20 Maxim Monastirsky 2015-07-30 11:49:08 UTC
*** Bug 84338 has been marked as a duplicate of this bug. ***
Comment 21 Maxim Monastirsky 2015-07-30 11:50:46 UTC
*** Bug 93021 has been marked as a duplicate of this bug. ***
Comment 22 Maxim Monastirsky 2015-07-30 12:01:53 UTC
That's actually a regression of:

commit 13ae10691d1362cc9a7b3438b1fa392f6e5517eb
Author: Caolán McNamara <caolanm@redhat.com>
Date:   Thu Apr 24 18:34:46 2014 +0100

    make 'Unicode' less-attractive to pick vs UTF-8
    
    for people guessing the encoding of a .csv
    
    Change-Id: Ie1b0a51bd2beb60351c244f97583a48ce596fbcc

We're remembering the last selected charset by storing its index inside the listbox. And because the list is sorted, the index of the UTF-8 item was changed in the mentioned commit. So if someone has a pre-4.3 user profile, and used UTF-8 in the past, the stored index is 61, while in 4.3 and later the correct index should be 60, and 61 is for UTF-16.
Comment 23 Maxim Monastirsky 2015-12-10 12:28:42 UTC
Let's hope pre-4.3 user profiles aren't so common nowadays (and IMHO there is nothing can be done here without breaking compatibility even further.)