Bug 153888 - Very bad formatting when importing pdf
Summary: Very bad formatting when importing pdf
Status: RESOLVED DUPLICATE of bug 99746
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
7.4.5.1 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2023-02-28 20:21 UTC by wtambellini
Modified: 2023-03-01 22:21 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
input.pdf (48.95 KB, application/pdf)
2023-02-28 20:22 UTC, wtambellini
Details
libreoffice (269.16 KB, image/png)
2023-02-28 20:23 UTC, wtambellini
Details
draw7.5 (280.25 KB, image/png)
2023-02-28 20:38 UTC, wtambellini
Details
inputpdf1.6 (43.61 KB, application/pdf)
2023-02-28 20:43 UTC, wtambellini
Details
sample doc with the 1.6 and 1.4 sample PDFs as pdfium filter imports (5.23 MB, application/vnd.oasis.opendocument.graphics)
2023-03-01 02:48 UTC, V Stuart Foote
Details

Note You need to log in before you can comment on or make changes to this bug.
Description wtambellini 2023-02-28 20:21:14 UTC
Description:
Very bad import formatting of pdf
See attached the input pdf.
See attached the screenshot of the import on odp.
 

Steps to Reproduce:
1. open the attached pdf with libreoffice Draw
2. see very messy formatting


Actual Results:
messy formatting, bad text box positions, ...

Expected Results:
At least similar box positions than original pdf


Reproducible: Always


User Profile Reset: Yes

Additional Info:
We are a professional company so opened to contract for someone to fix it ASAP.
Comment 1 wtambellini 2023-02-28 20:22:24 UTC
Created attachment 185649 [details]
input.pdf
Comment 2 wtambellini 2023-02-28 20:23:42 UTC
Created attachment 185650 [details]
libreoffice
Comment 3 wtambellini 2023-02-28 20:38:35 UTC
Created attachment 185651 [details]
draw7.5
Comment 4 wtambellini 2023-02-28 20:43:44 UTC
Created attachment 185652 [details]
inputpdf1.6
Comment 5 wtambellini 2023-02-28 20:55:31 UTC
Does nt matter if pdf std is 1.3, 1.4, 1.5 or 1.6, import is bad in any case.
Uploaded a 1.6 pdf file.
Comment 6 V Stuart Foote 2023-03-01 02:48:46 UTC
Created attachment 185656 [details]
sample doc with the 1.6 and 1.4 sample PDFs as pdfium filter imports

First thing, LibreOffice is *NOT* a PDF editor.

That said there are two import filter paths for handling the PS markup embedded in PDF. 

One, using the pdfium libs will fully parse PDF pages rending each full page to a very high fidelity against the source PDF as a raster image of appropriate size and scale.  The attached is a two page ODG Draw document with the two "input" PDF inserted each to a page.

The second, a C++ PDF filter parses the individual elements described in the PDF and lays them out (to Draw, Writer, or Impress canvas depending on filter selected) as sets of drawing objects--Text boxes, shapes, grids, raster images.

The text taken from the PS text runs that have no contextual syntax, will be assembled into drawing object text boxes.  Fonts that are not installed to system, or that have an unrecognizable title in the PDF, will receive some other fall-back font at some arbitrary size.  Multiple text box objects will be assembled onto single text box runs--but beyond that there is no reference to the source material used to prepare the PDF.  In Draw, text runs from those individual text boxes can be "consolidated" (see bug 32249) into a single text box, and the resulting text formatted. Or copied and pasted into a proper paragraph object depending on need.  The point is, for this filter import the resulting content that LibreOffice extracts from the PDF is not intended to have high fidelity to the original source used to generate the PDF.

You can have high fidelity with the pdfium based filter, or you can extract some percentage of the PDF content and render to drawing shape/text--but you can't do both with LibreOffice.
Comment 7 V Stuart Foote 2023-03-01 02:52:25 UTC
Take your pick of the BZ issues from META bug 99746 as to the import filter rendering of PDF PS elements as ODF drawing object on document canvas.

If you need fidelity, insert. If you need to edit--look elsewhere LibreOffice is not a PDF editor and can't be made one providing greater fidelity to the source for any PDF document.
Comment 8 V Stuart Foote 2023-03-01 02:58:38 UTC
There are potential improvements to the import filter making use of the pdfium libs and writing that out to ODF vector graphics (SVG text and shapes on a skia canvas and taking those elements into ODF text and drawing shapes), but that would be a major refactoring of the pdfio filter framework for what is not a core requirement of LibreOffice.

LibreOffice is *NOT* a PDF editor.
Comment 9 wtambellini 2023-03-01 04:30:21 UTC
Tks for the reply Stuart.
1 Tks for the odg/image conversion but we are not interested by such solution as it does nt allow retouching the text afterwards, these are just pixel images.

2 I ve verified and the fonts used in the pdf are available so misformatting is not due to missing fonts.

2 Globabally, the LO board has to clarify their position about pdf import: if you dont want users to import pdf or even to support the pdf importer then remove the pdf import from LO. If you keep pdf import then it's legitimate for users to ask for decent import/conversion to odf.

3 As today the issue is sadly not resolved: importing pdf to odf/odg does not simply preserve formatting. We are not asking for a pdf editor, just fixing the import algorithm(s). If you are not interested to fix the importer then just ignore that ticket, not closing it, for other contributors to comment and us moving forward. In other words, could you please let the ticket opened ?
Comment 10 Eyal Rozenberg 2023-03-01 08:19:40 UTC
(In reply to wtambellini from comment #9)
> Tks for the reply Stuart.

Don't thank Stuart - he is gaslighting you. It has already been demonstrated to Stuart that LO _is_ a PDF editor, with his arguments to the contrary refuted. 

In short: LibreOffice can open/import PDF files, let the user make changes to their contents, and save/export PDF files - just like it can do with DOCX files for example. The difference is that the importation is not very good, and improving it requires a lot of coding effort and applying heuristics to reconstitute structure which is often lost when creating PDFs.

LibreOffice is, in fact, one the most popular PDF editors - as probably millions of people use it to edit PDFs. It's just not a very good PDF editor.

It is highly inappropriate that Stuart continues to go around telling people the opposite - effectively trying to suppress bug reports about the aspects of PDF handling.

I'm linking to the bug where this discussion was held.
Comment 11 V Stuart Foote 2023-03-01 12:36:25 UTC
@Heiko, please take this to ESC. Getting tired of rehashing resolved issues with Eyal and others regards the scope of our filter handling of PDFs source. We do what we can with what is an "un-editable" format. It is untenable for the project to suggest otherwise.
Comment 12 Heiko Tietze 2023-03-01 13:49:43 UTC
(In reply to V Stuart Foote from comment #11)
> @Heiko, please take this to ESC. Getting tired of rehashing resolved issues
> with Eyal and others regards the scope of our filter handling of PDFs
> source. We do what we can with what is an "un-editable" format. It is
> untenable for the project to suggest otherwise.

The developers don't care about naming it "un-editable" or "badly working". I totally agree with your POV but that does not spare us from fixing import issues. If reported in an actionable way, which is not the fact here. Something like "word wrap not correctly applied" could work (to my knowledge we cannot read the font from PDF and it will never be pixel-perfect aka not an editor; PDF is a lossy format).

As for the UX I don't see what we could contribute.

@Eyal, please keep an open and friendly tone in comments.
Comment 13 Eyal Rozenberg 2023-03-01 19:24:10 UTC
(In reply to V Stuart Foote from comment #11)
> @Heiko, please take this to ESC. Getting tired of rehashing resolved issues
> with Eyal and others regards the scope of our filter handling of PDFs
> source. We do what we can with what is an "un-editable" format. It is
> untenable for the project to suggest otherwise.

Again with that tired argument? We've already been through this. LO does not need to be able to represent a PDF exactly and perfectly to be used as a PDF editor. Nor is it expected to be able to import PDFs perfectly. It is expected to make a decent effort and not fail on straightforward PDF contents.

Now, raster rendering is not relevant; this is LibreOffice, not GIMP or krita. You're using that as a straw-man argument: "If you want decent rendering, there's the raster option". No. The import filter is simply deficient and could stand to improve a lot, until it reaches a point where you could question the marginal utility of further work on it.

Specifically, on the sample PDF:

1. The overall scale _may_ be a bit off. It's very different from what GNOME document viewer shows at 100% zoom, and a little different from what Inkscape shows. So, maybe a bug and maybe not but merits looking into

2. The fonts are mis-identified. The body font is Minion Pro - Bold and also Italic; the footer is Myriad Pro; the subtitle is Myriad; the title seems to be Myriad Pro Semibold Condensed. I have all of these on my system - but none of them are used. Now, there may be an issue with "Myriad Pro" vs "MyriadPro" - but that's definitely a heuristic that the PDF import filter, or LO in general, should be applying when looking for fonts - and it isn't.
 
3. The paragraphs are broken up into individual lines. And if one might say "oh, but we can't know that it's a single paragraph" - well, Inkscape knows, when opening the file for editing, so LO can know this too. I'm not even sure it involves any heuristics though I might be wrong.
The paragraphs on the left should be justified, but aren't. 

4. The widths of text runs are very far from what they are in the PDF. Even if LO is not certain it has found the right font, it should make an effort to adjust, using: Font width & height/size adjustments; inter-character spacing; inter-word spacing. And specifically:

4.1. The paragraphs should be justified. But - it's debatable whether this should be counted as a problem, since the fundamental problem is forming them

4.2. The centered text in the frame is not centered at the horizontal center of the frame.

4.3 Text in the frame - exceeds the edges of frame, while in the PDF it really does not. That's the worse problem with text run dimensions.


All that being said - there are other bugs which cover some of these issues. So, what I would suggest that William do is look for them (through the meta-bug); link them as related to this bug; open specific bugs for the issues that don't already have open bugs; add this PDF as an attachment for those new issues; then finally close this bug as a dupe of bug 99746, because that one is enough to track the individual issues.

I will try and find time to help with some of this.
Comment 14 Eyal Rozenberg 2023-03-01 22:13:36 UTC
So:

* Consolidating text runs into full-paragraph or larger objects: Bug 32249
* Font family mis-identification in PDFs incl. variants : Bug 143095
* Font family mis-identification generally, w.r.t. spaces vs no-spaces: Bug 153907
* Alignment recognition: Bug 49705

I think that covers it, resolving as PDF-Import-Draw. Reporter, if I've missed something, feel free to reopen AFAIAC.

*** This bug has been marked as a duplicate of bug 99746 ***
Comment 15 wtambellini 2023-03-01 22:21:21 UTC
Tks gentlemen.