Bug 32249 - Make it easier to edit text in imported PDFs
Summary: Make it easier to edit text in imported PDFs
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3.0 RC1
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL: https://www.youtube.com/watch?v=ie7Jb...
Whiteboard:
Keywords:
: 38084 84712 91896 93039 105274 119070 125838 151577 152143 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2010-12-08 22:51 UTC by grigoreflorin1985
Modified: 2024-03-20 12:25 UTC (History)
22 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF_import_testDoc.odg: exploring what combining textboxes could look like (21.61 KB, application/vnd.oasis.opendocument.graphics)
2019-06-27 18:38 UTC, Justin L
Details
Draw-add-option-to-consolidate-multiple-textboxes.patch (15.71 KB, patch)
2019-07-03 13:02 UTC, Justin L
Details
pdf test file for too many textboxes also in LO 7132 (3.54 MB, application/pdf)
2021-05-17 15:25 UTC, paulystefan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description grigoreflorin1985 2010-12-08 22:51:29 UTC
When i import a PDF with text in it I get editing function on separate paragraph one by one , what i need and want (like all users will expect to do)  to do it is to edit all paragraf like I do an office document normaly. Option to union the the paragraph to edit them flawless and easy at start ? Unificate all paragraph on page to be editable like a simple full form. It is time consuming to click and edit every paragrapf one at a time. Tryed union from the right click menu  on them and I get a plain graphic non editable txt form with the txt tool (aka big T icon).

Hope this small option will get until final release.
Comment 1 Rainer Bielefeld Retired 2010-12-09 08:58:10 UTC
I also sometimes wished such a feature, but I'm afraid that would cost too much manpower. Compared to other needs definitively not more than Importance "Medium", I doubt that that ever will be integrated.
Comment 2 Samuele Kaplun 2011-05-06 07:30:12 UTC
Hi,

I am a developer on a digital library software, and, aiming at supporting digital preservation, I was thinking of exploiting the wonderful PDF importer filter of LibreOffice to archive .odt document next to the original .pdf (as the .odt document should provide more value for future retrieval and the use of the document).

Indeed I also find this a very nice feature to have and it should be possible to implement it via some heuristic such as merging together subsequent lines that are not too far from each other (e.g. say that they are not more distant than the height of the character).

If no-one have time to work on it I'd be glad to give it a try in my spare time, if someone could be so kind to point me at the most appropriate source code files that would need to be touched.

Cheers!
Comment 3 Rainer Bielefeld Retired 2011-05-06 08:17:14 UTC
@Samuele Kaplun:
That would be great. 

I'm afraid that won't be easy. I do not know how that works for other OS, but for WIN I have to install the "Oracle PDF Import Extension" from 
<http://extensions.services.openoffice.org/en/search/node/pdf import>, what itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.

That's all I can contribute.

BTW: Version is for the first version where the problem has been observed!
Comment 4 Samuele Kaplun 2011-05-06 08:34:08 UTC
(In reply to comment #3)
> @Samuele Kaplun:
> I'm afraid that won't be easy. I do not know how that works for other OS, but
> for WIN I have to install the "Oracle PDF Import Extension" from 
> <http://extensions.services.openoffice.org/en/search/node/pdf import>, what
> itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.

From <http://www.libreoffice.org/features/extensions/> I understand that finally this extension is part of the core LibreOffice source tree. Is this so?

> That's all I can contribute.
> 
> BTW: Version is for the first version where the problem has been observed!

Sorry for this!! That makes perfect sense!
Comment 5 Rainer Bielefeld Retired 2011-05-06 10:50:21 UTC
> From <http://www.libreoffice.org/features/extensions/> I understand that
> finally this extension is part of the core LibreOffice source tree. Is this so?

I thought so, too, but for my 3.4 I definitively had to download the extension. Pls see 
<https://bugs.freedesktop.org/show_bug.cgi?id=35604#c6>
Comment 6 Björn Michaelsen 2011-12-23 11:33:26 UTC Comment hidden (obsolete)
Comment 7 Rainer Bielefeld Retired 2011-12-23 23:33:18 UTC
Was New by good reasons. But it#s the question whether there is a realistic chance to get this enhancement.
Comment 8 vilpan 2013-05-01 18:41:14 UTC
*** Bug 38084 has been marked as a duplicate of this bug. ***
Comment 9 sophie 2014-10-09 11:21:36 UTC
*** Bug 84712 has been marked as a duplicate of this bug. ***
Comment 10 QA Administrators 2014-10-23 17:31:40 UTC Comment hidden (obsolete)
Comment 11 Gerry 2015-04-23 16:57:59 UTC
I can confirm this bug in LO 4.4.2.2. on Windows 7
Comment 12 Jean-Baptiste Faure 2015-07-01 18:06:35 UTC
*** Bug 91896 has been marked as a duplicate of this bug. ***
Comment 13 Hendrik Maryns 2015-11-22 08:38:23 UTC
Is there no bug voting?  This is a major turndown!
Comment 14 m_a_riosv 2017-01-13 09:30:38 UTC
*** Bug 105274 has been marked as a duplicate of this bug. ***
Comment 15 m_a_riosv 2017-01-13 09:31:54 UTC
*** Bug 93039 has been marked as a duplicate of this bug. ***
Comment 16 V Stuart Foote 2017-01-13 14:05:25 UTC
LibreOffice has provided functional filter import of PDF into Draw (default Open action), and into Impress and Writer or also Draw by import filter selection.

With each filter selected, the rendering to respective document canvas follows the structure of the document as recorded within the PDF and text elements are rendered into styled Text box or Frames. 

The PDF filter(s) do not "reflow" text into Paragraph objects. That would require a very complex treatment of the PDF structure to reliably extract syntax and layout--at the expense of fidelity rendering the PDF document.

Replacing of supplementing the PDF filters to provide "reflow" back into paragraphs is seen as out-of-scope for the project as we are not a PDF editor.

The core PDF filters and function are sufficient to our needs of high rendering fidelity.

This is fertile ground for an extension.
Comment 17 Justin L 2019-06-21 18:29:39 UTC
*** Bug 125838 has been marked as a duplicate of this bug. ***
Comment 18 Justin L 2019-06-27 18:38:15 UTC
Created attachment 152450 [details]
PDF_import_testDoc.odg: exploring what combining textboxes could look like

I agree with Stuart's conclusion that monkeying with import to make larger textboxes would be disastrous. So I only see one reasonable option and that is a function that allows a user to combine selected textboxes into one textbox.

However, the results won't be pretty. Each character attribute change (size, bold, font, etc.) becomes a separate textbox, and there is no way to identify whether that ends the paragraph or not, although some content analysis guesswork could approximate the majority of cases I guess. In any case, a LOT of cleanup would be needed to reformat the text, since each character run is treated as a separate paragraph and all paragraph spacing information is missing.

The other option is to force the user to create their own textbox and copy/paste the text from the PDF itself, but in that case all the character properties are lost. So there does still seem to be an advantage of consolidating textboxes into one, even if many excess paragraph markers need to be deleted.
Comment 19 V Stuart Foote 2019-06-27 23:56:42 UTC
(In reply to Justin L from comment #18)
> ... So I only see one reasonable option and that
> is a function that allows a user to combine selected textboxes into one
> textbox.
> 

Yes, agree that would be an acceptable way to handle PDF source text runs extracted from BT/ET blocks, or where /ActualText annotation is present.

But why first extract the text runs into Draw Text boxes, and then merging them into one or more non-formattable Draw Text boxes? Seems like a different filter import of the PDF text runs is needed.

Dumping the strings out to a Writer Paragraph object, either in bulk or interactively, would be more functional.  And text runs dumped into a Paragraph object, would allow assignment of direct formatting or style, with text validation and word and line break cleanup.

Probably more efficient UI could evolve if done as a pop-out dialog to pick the Draw Text box snippets, but could spin up a full Writer session and do the same.

More often than not, folks simply want to reflow the text strings back into their lexicographically correct sequence without too great a concern as to original formatting of the source document generating the PDF.

We can't do that with much fidelity to the original source--so why bother?

Our other 'pdfium' based "insert" filter provides the text runs to document canvas with good fidelity to the original layout. Though the object "break" there has similar issues to the 'poppler' based import filter for text handling.
Comment 20 Justin L 2019-06-28 05:24:00 UTC
(In reply to V Stuart Foote from comment #19)
> But why first extract the text runs into Draw Text boxes? Seems like a
> different filter import of the PDF text runs is needed.
Yes, that sounds like it would be perfect, but 100x more complex to code.
Comment 21 Justin L 2019-07-03 13:02:54 UTC
Created attachment 152531 [details]
Draw-add-option-to-consolidate-multiple-textboxes.patch

(In reply to Justin L from comment #18)
> I only see one reasonable option and that is a function that
> allows a user to combine selected textboxes into one textbox.

Proposed patch: https://gerrit.libreoffice.org/75043
    Draw: add option to consolidate multiple textboxes into one
Comment 22 Justin L 2019-07-22 05:11:36 UTC
(In reply to Justin L from comment #21)
This patch has landed in LO 6.4. Please use bug 118370 to follow the implementation of a "Shapes - Consolidate Text" function that gives the user a tool to combine multiple textboxes into one.

For this bug report, let's keep the discussion to the bigger request to add a "text content focused" PDF import, rather than a layout focused import as discussed in comment 19.
Comment 23 paulystefan 2021-05-17 15:25:52 UTC
Created attachment 172095 [details]
pdf test file for too many textboxes also in LO 7132

LO 7.1.3.2 win64

Automatic detection of blocks should be improved.
It is better than before, but there is a place for better.

the selection of fonts and their size is improvable.

Perhaps KI is here the solution for the best fonts with the best sizes and detection of blocks.
Comment 24 paulystefan 2021-08-05 20:36:20 UTC
Shapes - consolidate text 
changes the position of the text.

mostly the text needs more place in height and width.

also with LO 7.2.0.2
Comment 25 paulystefan 2022-06-16 12:30:54 UTC
Too many text boxes are active with location errors of other signs with different fonts in LO 7.3.4.2.

Version: 7.3.4.2 (x64) / LibreOffice Community
Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: en-US (de_DE); UI: de-DE
Calc: CL

Improvement is detection of language for right writing.
In PDF example, English is now immediately detected.

In 7.2.7.2 there is only red snakes under the text for unknown writing by primary German language.

Version: 7.2.7.2 (x64) / LibreOffice Community
Build ID: 8d71d29d553c0f7dcbfa38fbfda25ee34cce99a2
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: de-DE
Calc: CL
Comment 26 m_a_riosv 2022-10-17 02:27:50 UTC
*** Bug 151577 has been marked as a duplicate of this bug. ***
Comment 27 m_a_riosv 2022-10-17 02:34:42 UTC
Nowadays there it's possible to join the text for the boxes on the same page.
Select the boxes and righ-click Consolidate text.
Comment 28 Eyal Rozenberg 2022-10-17 08:23:05 UTC
I'm the author of bug 151577. 

I want to bring up the question of whether there should be a single bug about this issue for importing into Writer and into Draw.

In Draw, we expect drawing objects which can be manipulated independently - although paragraph-level rather than line-level boxes would indeed be preferable whenever applicable. In Writer, however, we would like long continuous runs of text, across paragraphs and page, which aren't drawing object at all. 

Also, the code for the two input filters, while similar, is different: They're two independent filters.

Finally, I would say that while in Draw this issue may be considered as an enhancement - in Writer it is a proper bug: The current Writer import filter produces what is essentially a Draw document - a bunch of disconnected drawing objects on separate pages - opened in Writer.

What say you? :-)
Comment 29 Eyal Rozenberg 2022-10-17 08:33:07 UTC
(In reply to m.a.riosv from comment #27)
> Nowadays there it's possible to join the text for the boxes on the same page.
> Select the boxes and righ-click Consolidate text.

... but only in Draw, it seems. Just filed bug 151598 about having it in Writer as well.
Comment 30 Eyal Rozenberg 2022-10-17 08:58:51 UTC
(In reply to V Stuart Foote from comment #16)
> Replacing of supplementing the PDF filters to provide "reflow" back into
> paragraphs is seen as out-of-scope for the project as we are not a PDF
> editor.


A reflow is often not necessary. That is, the text in a contiguous paragraph without changes to the formatting is saved in a single object stream. So, what actually happens is that our import filters _artificially_ break up the text into lines.

Also, LibreOffice is actually a PDF editor: It satisfies the dictionary definition [1] of an editor for PDFs, and is used by many to edit PDFs. True, it does not directly manipulate the structure of PDFs - it imports-from and exports-to PDFs - but that is also the case for OOXML documents and many other formats - and we still consider LO an editor for those. Certainly, LO may not be the ideal software platform for editing PDFs, but there's no reason it couldn't be a half-decent editor for not-very-complex PDFs. I've recently had this discussion with Stuart on bug 151552.

For this reason improving the editability of imported PDFs, e.g. by importing text as completr paragraphs, is entirely within the scope of the project.


[1] : https://www.dictionary.com/browse/editor
Comment 31 V Stuart Foote 2022-10-17 22:04:08 UTC
*** Bug 151607 has been marked as a duplicate of this bug. ***
Comment 32 V Stuart Foote 2022-11-20 19:10:33 UTC
*** Bug 152143 has been marked as a duplicate of this bug. ***
Comment 33 Eyal Rozenberg 2022-12-22 18:53:27 UTC
Note this recent (and relatively popular) YouTube video by "The Linux Experiment", decrying several usability issues with modern Linux distributions. In the section on PDFs, it explains how, to edit a PDF, you use LibreOffice Draw. Then it goes on to complain about how bad it is as an editor, and particularly: How the text is broken up into separate lines.

The (very common) use-case described in the video: Signing a PDF by adding a signature image to it.

https://www.youtube.com/watch?v=0re63X2nY0s

This illustrates that, at the bottom line, LO is a PDF editor, and in fact, is _the_ PDF editor for users who aren't experts in locating software.

It also illustrates how addressing this issue will make both LO and FOSS desktop environments more attractive to users.
Comment 34 V Stuart Foote 2022-12-23 00:33:29 UTC
(In reply to Eyal Rozenberg from comment #33)

>...
> This illustrates that, at the bottom line, LO is a PDF editor, and in fact,
> is _the_ PDF editor for users who aren't experts in locating software.
> 
> It also illustrates how addressing this issue will make both LO and FOSS
> desktop environments more attractive to users.

Nope! It again illustrates the bottom line that PDF (ISO 32000-1:2008, or 2:2020) is NOT an editable format, it is a presentation/publication format. 

Also, it demonstrates reality that LibreOffice is not a PDF editor as we will only ever read content of a PDF to filter import to an ODF XML compliant document canvas.

Fidelity of the filter import varies (poppler to sd Drawing objects, or pdfium as image/meta streams as image to a vcl canvas), but in no sense do we do more than read from PDF.

Export/print from ODF module to PDF is then a completely different process with a different set of export filters.

And it highlights the project's need to scrupulously manage user expectations reinforcing that PDF is not an editable format, and that LibreOffice is NOT a PDF "editor".

Improvements can be made to LO filter handling as a PDF reader to import content--witness the adoption of pdfium libs for the insert as image filter paths.

But simply put, the internals of the presentation optimized text runs within PDF do not support extraction with the lexical syntax of the original source document from which a PDF was generated.

We can provide tools to better organize results of the import filters, reformating them into either paragraph objects or sd draw objects--but there are very real limits to what the project can or should do.
Comment 35 Eyal Rozenberg 2022-12-23 10:56:12 UTC
(In reply to V Stuart Foote from comment #34)
> Nope! It again illustrates the bottom line that PDF (ISO 32000-1:2008, or
> 2:2020) is NOT an editable format, it is a presentation/publication format. 

People need to edit PDFs all the time - hence its featuring prominently in a video describing common tasks which need catering to by desktop apps. You get a PDF of a form - typically scanned or printed from a word processor - and you need to put text and/or a signature on it. That's PDF editing, and millions of people do it every day. Ok, maybe not millions every day, let's say millions every week.

> Also, it demonstrates reality that LibreOffice is not a PDF editor as we
> will only ever read content of a PDF to filter import to an ODF XML
> compliant document canvas.

Nobody said LO needs to represent the PDF structure as-is and perform surgical edits. In that sense, LO isn't a .doc and .docx editor either: It only ever reads their contents via an import filter; and it is certainly a .doc and .docx editor. But - we've had this argument already. Why are you repeating a rebutted point?

> And it highlights the project's need to scrupulously manage user
> expectations reinforcing that PDF is not an editable format, and that
> LibreOffice is NOT a PDF "editor".

You keep saying that, despite it having been demonstrated to you both in principle and empirically that it is. What LO needs to manage perhaps people's insistence of sticking their heads in the sand and ignoring an important use of our suite. I'll bet you there are more people using LO as a PDF editor than users of LO Base, for example. (No offense to the LO Base folks!)

But anyway, let's focus on the practicality and the scope of this bug.

> Improvements can be made to LO filter handling as a PDF reader to import
> content--witness the adoption of pdfium libs for the insert as image filter
> paths.

That's a step in the right direction - as was the resolution of bug 104597. But there's a very long way to go.

> But simply put, the internals of the presentation optimized text runs within
> PDF do not support extraction with the lexical syntax of the original source
> document from which a PDF was generated.

That's true, and we can never hope to restore what's not saved in a PDF. But:

1. We can avoid losing the information and styling that _is_ represented in the PDF, so that importing-then-saving would result in a PDF with no noticeable distortions, or almost none. At least - for PDFs of typical documents which don't use the more esoteric features of PDFs. Of course the PDF's internal structure will likely show a lot more differences, but the observed result will be pleasing.

2. We can use reasonable assumptions to constitute paragraphs, define styles, have structural elements/features like columns, tables, annotations, comments, etc. Yes, each of these is may be a lot of work and nobody expects this to happen overnight, but if we set this as an explicit goal and have some development resources assigned to working towards that goal then things will gradually improve. By the way, this is mostly, even if not entirely, orthogonal to making sure we don't mess up the PDF on import-then-export.

3. For the specific case of LO being the originator of the PDF, we could consider - and that is out-of-scope here I suppose - embedding auxiliary information into the PDF which allows for perfect or near-perfect reconstitution of the original LO document.


> but there are very real limits to what the project can or should do.

Certainly, but these limits depend in part on what the project defines as a goal or an important feature. Recognition of the use of LO as a PDF editor rather than its denial will allow for setting these limits farther.
Comment 36 V Stuart Foote 2022-12-23 17:42:55 UTC
(In reply to Eyal Rozenberg from comment #35)

> 3. For the specific case of LO being the originator of the PDF, we could
> consider - and that is out-of-scope here I suppose - embedding auxiliary
> information into the PDF which allows for perfect or near-perfect
> reconstitution of the original LO document.
> 

Actually we already provide that with the source ODF inserted as a data stream to our PDF export--we call it a "Hybrid PDF". 

Although some potential to improve the visibility of the ODF data stream beyond LibreOffice's handling, as for bug 95328 and making the ODF a proper PDF attachment.
Comment 37 wtambellini 2023-03-01 04:10:22 UTC
cf
https://bugs.documentfoundation.org/show_bug.cgi?id=153888

On our side, we are not interested by a pdf editor so the question to know if LO is/should better be a pdf editor is not relevant for me. That being said Eyal is right to say many users like that feature (importing/converting pdf into odf) and I can testify that many surely install LO for that powerful feature (pdf import). 

Anyway, our need is just importing/converting a pdf into the odf formats with decent conservation of the original formatting/font/visual. We are not interested by UI editing but just the capability to retouch some texts in the odf file, by simply retouching the contents xml files in the odf file.
Comment 38 Eyal Rozenberg 2023-03-01 20:31:47 UTC
(In reply to wtambellini from comment #37)
> Anyway, our need is just importing/converting a pdf into the odf formats
> with decent conservation of the original formatting/font/visual. We are not
> interested by UI editing but just the capability to retouch some texts in
> the odf file, by simply retouching the contents xml files in the odf file.

Ok, but - this particular bug is specifically about editing text. Naturally, better, more accurate import allows in turn for easier/better editing, but still. I've commented on the other bug. 


Actually, this last sentence makes me think that maybe PDF-Import-Draw and PDF-Import-Writer should be blocking this bug rather than the other way around. What do you all think?
Comment 39 Eyal Rozenberg 2023-03-01 22:09:46 UTC
(In reply to Eyal Rozenberg from comment #38)
And another question to the CC list members: I believe this bug may need to be split up, because there are several possible "asks": 

1. Constitute sub-line-level text runs into full-line text boxes
2. Constitute line-level/sub-line-level text runs into single-paragraph text boxes
3. Constitute multiple paragraphs into single text boxes - e.g. all paragraphs in a frame or contiguous in the body of a page
4. Put text into the document body, unless there is cause to place it in a separate box. (And this can still maintain the separation into pages, by having page breaks at the end of each page).

Right now the bug title is very vague, not even specific to text run consolidation, while the discussion is a bit all over the place.
Comment 40 wtambellini 2023-03-02 01:22:27 UTC
Splitting ("divide and conquer") is a good idea Eyal and I see you did it so congrats.
Tks
Comment 41 Hendrik Maryns 2023-03-20 08:39:50 UTC
(In reply to Eyal Rozenberg from comment #39)
> (In reply to Eyal Rozenberg from comment #38)
> And another question to the CC list members: I believe this bug may need to
> be split up, because there are several possible "asks": 
> 
> 1. Constitute sub-line-level text runs into full-line text boxes
> 2. Constitute line-level/sub-line-level text runs into single-paragraph text
> boxes
> 3. Constitute multiple paragraphs into single text boxes - e.g. all
> paragraphs in a frame or contiguous in the body of a page
> 4. Put text into the document body, unless there is cause to place it in a
> separate box. (And this can still maintain the separation into pages, by
> having page breaks at the end of each page).
> 
> Right now the bug title is very vague, not even specific to text run
> consolidation, while the discussion is a bit all over the place.

I do not fully understand what you write above, but I think you technically reword what I would want: import a pdf and be able to copy and edit the text.  Keeping layout intact is of second importance.
Comment 42 Eyal Rozenberg 2023-04-06 15:04:27 UTC
Linked to an overview by "The Linux Experiment" (a Youtube channel/vlog with 254K subscribers) on working with PDFs on Linux:

https://www.youtube.com/watch?v=ie7Jb1KiIBM

Notes about the video:

* LO is the first app presented for working with PDFs other than viewers
* ... followed by Inkscape.
* LO (and Inkscape) are said to "kind of suck" in editing PDFs. And still, they are what's suggested as FOSS PDF editors.
* Only LO Draw is recognized in the video, i.e. even a person who has spent some time researching the subject has not noticed that PDFs can be opened in Writer.
* Different utilities are suggested for different specific PDF-related tasks, like annotation, stamping an image, rearranging/removing pages etc.
* The bottom line is that if you _really_ want to edit your PDF, you'll need to go with a commercial app.
Comment 43 Stéphane Guillou (stragu) 2024-03-20 12:25:57 UTC
*** Bug 119070 has been marked as a duplicate of this bug. ***