Bug 156507 - Ability to remove non-printing/"atypical" characters in a stretch of text
Summary: Ability to remove non-printing/"atypical" characters in a stretch of text
Status: RESOLVED WONTFIX
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Paste
  Show dependency treegraph
 
Reported: 2023-07-28 20:00 UTC by Eyal Rozenberg
Modified: 2023-08-31 12:50 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2023-07-28 20:00:28 UTC
Stretches of text contain "usual" characters - the one that one typically produces uses the keyboard (not using a numeric charcode sequence): Letters, punctuation marks, digits, spaces; but they may contain more exotic characters, like directionality marks, non-breaking spaces, zero-width joiners, and other oddities. The latter ones are most often invisible, and it is thus almost impossible to know whether a stretch of text has them, other than by moving the cursor, one character at a time, over this stretch (and maybe even then you might miss some, I don't know).

When we open a document created by another person, or program - we might encounter such problematic stretches of text; and so is the case when we copy some text from a rendered PDF document or other application which exposes text, but renders it with a lot of specific restrictions constrictions.

It would be useful if we could select a piece of text, and clean it up from these "special" characters, with only the ones with visible-glyph effect remaining.

The question of which characters to delete is not trivial, and it could be either a wider or narrower set, or even several possible sets which the user chooses among. You (=readers of this bug) are kindly requested to help with suggestions regarding what could be the relevant set of characters we should allow purges of.
Comment 1 V Stuart Foote 2023-08-12 14:29:21 UTC
Understand the request, but not clear it is the right way to handle what are in essence fundamental components of Unicode based text runs within a PS that can be laid down by locale specific keyboard keysyms but also a wide range of IME.

Rather than needing a way to strip out NPC from a text selection (a span of word, sentence, paragraph length) instead I see this more as our mishandling of marking NPC which needs more work. 

So extend/continue the work done for bug 58434 where the table in attachment 112798 [details] remains a good reference.  

Unicode and IUC libs handling suggest that additional Unicode control codepoints need our NPC shading as a visual queue to their presence.
Comment 2 V Stuart Foote 2023-08-12 14:30:45 UTC
s/IUC/ICU
Comment 3 Eyal Rozenberg 2023-08-12 14:56:14 UTC
(In reply to V Stuart Foote from comment #1)

That's an interesting bug... because even though bug 58434 is marked as fixed, some/many non-printing characters aren't displayed with Ctrl+F10. Also, some characters are not mentioned in the table (https://bugs.documentfoundation.org/attachment.cgi?id=112798), e.g. RLO, LRO, RLE, LRE, PDF and that's just ones I can mention off the top of my head. So, one wonders how that's marked as fixed.

Having said that - even if these characters were somehow marked - this feature would still be useful, because:

1. Many users would not be able to distinguish between the markings
2. Many users don't know what these characters mean, even if told their names.
3. The user might not want to retain all of these characters. In this sense, removing non-printing/non-keyboard-insertable/atypical characters is a bit like clearing direct formatting, or resetting capitalization to small caps etc. It's not something you necessarily want to do, but sometimes it's just the right tool.
Comment 4 Eyal Rozenberg 2023-08-12 14:58:31 UTC
As for alternatives/complements to my suggestion, we could also think of:

* Having a option to sanitize input filters of such characters.
* Ability to traverse just these characters with some kind of tracking dialog (possibly even via find-replace, but maybe a bespoke dialog).
Comment 5 Heiko Tietze 2023-08-16 13:37:08 UTC
This should be rather solved with the F&R dialog, ideally per regular expression. Any hard-coded procedure is likely to fail for some reason, as carefully it was implemented. And such a function requires some expertise what "usual character" are depending on the document and language. My take: WF.
Comment 6 Eyal Rozenberg 2023-08-16 16:45:13 UTC
(In reply to Heiko Tietze from comment #5)
> This should be rather solved with the F&R dialog, ideally per regular
> expression.

That's only because you're thinking about the advanced users. I have actually not tried addressing this via F&R; but let's suppose this is doable. That still does not at all help not solve the problem that more novice users are facing: They paste text from somewhere in LO and get a bunch of invisible characters which mess with their document's behavior and with keyboard navigation. We should do something to address that scenario - help the novices, and also the experts when doing something simple, so that they don't have to waste a lot of time on it.

> ... And such a function requires some expertise
> what "usual character" are depending on the document and language. 

It requires so much expertise, or rather - shallow expertise on a lot of kind of characters, that even most advanced users won't have it. I'm kind of an advanced users, but I only know some of them. Only a Unicode expert (i.e. not an LO expert user...) would probably be able to do this correctly.=


> Any hard-coded procedure is likely to fail for some reason, as
> carefully it was implemented. 

I challenge that claim - considering that one can choose to either apply it or not apply it. We choose a expansive set of characters to remove; and those who want to keep some and remove some - those are the advanced users who can be left to take care of it "on their own". But - show me some characters which should both be in that standard set for scrubbing/removing, and at the same time have a good argument for not being in it.
Comment 7 ⁨خالد حسني⁩ 2023-08-24 09:15:53 UTC
Someone’s exotic character is another one’s essential part of the text. Removing ZWNJ from Persian text alters its meaning, removing ZWJ from Emoji sequences alters their meaning, removing Unicode Variation Selectors from CJK text alters its meaning. Removing BiDi control characters from text changes its intended rendering and possibly the meaning.

The notion here is fundamentally flawed, there is no such thing as exotic characters in a multilingual and multicultural piece of software, this is very monolingual way of thinking.

Also, don’t copy text from PDF, that is your actual problem. PDF is not a text exchange format, despite what PDF stakeholders want to sell people. If you get “exotic” characters copying text from any other source, there is more than a 90% chance these are essential part of the text.
Comment 8 V Stuart Foote 2023-08-24 17:27:41 UTC
Agree with Khaled as to the impact on text runs of the multitude of scripts that might be in use. Stripping NPC from text runs does not make sense.

Still think, as in comment 1, better approach is continued work of bug 58434 and expanded coverage for toggle display of non-printing and formatting Unicode via <Ctrl>+F10
Comment 9 Eyal Rozenberg 2023-08-24 18:56:15 UTC
(In reply to ⁨خالد حسني⁩ from comment #7)
> The notion here is fundamentally flawed, there is no such thing as exotic
> characters in a multilingual and multicultural piece of software, this is
> very monolingual way of thinking.

It's not about what's not in a particular language. It's about what people don't generate with their keyboards. Yes, that could be different for every language; and if you can generate ZWNJ with a Persian keyboard - then ZWNJ is a poor example, or it should be considered atypical only in certain contexts.

> Also, don’t copy text from PDF, that is your actual problem. PDF is not a
> text exchange format, despite what PDF stakeholders want to sell people.

But users have to copy text from PDFs, because they are provided PDFs rather than original editable documents, and they need the text. (In fact, sometimes, they need to edit the PDF in place.)

Indeed, PDF is not a text exchange format, and PDF export code may and does add all sorts of non-printing characters, which users need to filter out. That's the motivating use case. 


(In reply to V Stuart Foote from comment #8)
> expanded coverage for toggle display of non-printing and formatting
> Unicode via <Ctrl>+F10

That is unlikely to help - unless you start indicating which non-printing character is which, and then you have something like a source view, or a few-characters-per-line view, which doesn't look like the original text.

I agree with Khaled's observation that question of what to filter out is non-trivial and language-dependent. But it is a (somewhat) common use-case to want to perform such filtering.
Comment 10 ⁨خالد حسني⁩ 2023-08-24 20:18:17 UTC
If users understand what these characters are, they can use advanced search and replace to strip them, if they don’t understand what these are then we shouldn’t be giving them functionality that is more likely than not to do the wrong thing.

WONFIX from me.
Comment 11 Eyal Rozenberg 2023-08-24 20:31:20 UTC
(In reply to ⁨خالد حسني⁩ from comment #10)

They understand enough to know they're getting some junk in addition to the text they want; keyboard traversal is weird, and the "junk" may be affecting rendering behavior.

Just because it's easier if users simply not do this, does not mean that they don't (or that they shouldn't). You could argue that there is no reasonable way this could work; perhaps, but - I very much doubt it. While we have not demonstrated that there can be no reasonable choice of characters to filter (say, a language-specific or locale-specific choice) - we should entertain the possibility that it does exist. And assuming it exists - I believe should offer it, to make this kind of work easier for users.
Comment 12 V Stuart Foote 2023-08-25 02:32:35 UTC
(In reply to Eyal Rozenberg from comment #11)
> (In reply to ⁨خالد حسني⁩ from comment #10)
> 
> They understand enough to know they're getting some junk in addition to the
> text they want; keyboard traversal is weird, and the "junk" may be affecting
> rendering behavior.
> 
> Just because it's easier if users simply not do this, does not mean that
> they don't (or that they shouldn't). You could argue that there is no
> reasonable way this could work; perhaps, but - I very much doubt it. While
> we have not demonstrated that there can be no reasonable choice of
> characters to filter (say, a language-specific or locale-specific choice) -
> we should entertain the possibility that it does exist. And assuming it
> exists - I believe should offer it, to make this kind of work easier for
> users.

With <Ctl>+F10 exposing NPC, an <Alt>+X toggle will show Unicode for the specific NPC at the text cursor--and then toggle it back.  Then knowing the Unicode, it is trivial to find/delete (or edit) via Find-Replace dialog.  It is not dynamic (requiring linear progression of codepoints being removed from the text) but it is already functional.

Otherwise I don't see a need for providing a new dialog as a core capability as it is very much a corner case, and dev effort is not justified. 

More on point would be dev work to complete the residual NPC toggle exposure from bug 58434

And if the use case is only for parsing PDF--a new Writer paragraph oriented "reflow" implementation, replacing Justin's text box 'combine' based PDF import done for bug 32249, is really the ask. 

And implementing a new PDF parser/reflow would be the opportunity to selectively clean up any NPC or malformed text runs from PDF source.  While completing the per word /ActualText support of bug 117428 would move quality of our PDF exports (optionally as it has a real performance and size cost). 

Either dupe this to bug 32249, or more simply it becomes => WF
Comment 13 Eyal Rozenberg 2023-08-25 08:45:58 UTC
(In reply to V Stuart Foote from comment #12)
> With <Ctl>+F10 exposing NPC, an <Alt>+X toggle will show Unicode for the
> specific NPC at the text cursor--and then toggle it back.  Then knowing the
> Unicode, it is trivial to find/delete (or edit) via Find-Replace dialog.  It
> is not dynamic (requiring linear progression of codepoints being removed
> from the text) but it is already functional.

Well, Ctrl+F10 doesn't actually expose all non-printing characters, but if it did, and given Alt+X, and given that the user knows about both of them, and if we assume there is no need for applying contextual removal logic (which perhaps we should since not assuming that makes my ask more difficult too), then yes, this could be done by figuring out all relevant codepoints.

Ok, conceded, but those are quite a few IF's. And within the set of users who work with text from sources "sullied" with undesirable NPCs - the fraction who both know about Ctrl+F10 and Alt+X and would conceive of this removal process is low. I, for one, didn't even know Alt+X existed... is that on the menus anywhere?

And then there's forcing every user to figure this out and go through a rather complex procedure. It's a bit like removing the "distribute width evenly among columns" because it can be done manually with care...

But I will also concede that there is a legitimate doubt regarding the extent of potential use. My motivating use case is text coming from PDFs - which I claim is a significant case in terms of number of affected users. There may be other use cases, but that's already speculation.

This would not be solved, however, by PDF parsing - since even though the text may originally have come from a PDF - it does not necessarily come from there directly. I would hope to see something like this as an option during the pasting of text, and when opening a PDF. Thanks for the link to the bug regarding ActualText.
Comment 14 Heiko Tietze 2023-08-31 12:50:51 UTC
We discussed the topic in the design meeting and decided to resolve WF, as suggested.