Bug 159078 - Support for Apache Parquet input for Calc and Base
Summary: Support for Apache Parquet input for Calc and Base
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.6.4.1 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Format-Filters
  Show dependency treegraph
 
Reported: 2024-01-09 06:19 UTC by Simon Aubert
Modified: 2024-01-11 01:56 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
sample 1 (2.86 KB, application/octet-stream)
2024-01-09 11:54 UTC, Xisco Faulí
Details
sample 2 (12.34 MB, application/octet-stream)
2024-01-09 11:54 UTC, Xisco Faulí
Details
sample 3 (39.08 KB, application/octet-stream)
2024-01-09 11:55 UTC, Xisco Faulí
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Aubert 2024-01-09 06:19:48 UTC
Description:
Hello all,
Apache Parquet becomes more and more popular in the data community. It comes from Big Data on Hadoop but is now supported by all major editors. The benefits of it being it's fast... however, sometimes, I would like to just open it like a table file.
Best regards,
Simon

Actual Results:
You can't connect to an Apache Parquet File

Expected Results:
You can read an Apache Parquet File


Reproducible: Always


User Profile Reset: No

Additional Info:
None
Comment 1 Mike Kaganski 2024-01-09 07:07:37 UTC
https://qa.blog.documentfoundation.org/2023/11/22/qa-dev-report-october-2023/

> 40. Kohei Yoshida upgraded liborcus and added support for conditional loading of
> Apache Parquet files into Calc
Comment 2 Buovjaga 2024-01-09 07:09:08 UTC
Already done for Calc for version 24.2 with b14583ba37a6d7ce398ccd3cf339f954785b03d8

Kohei: anything to comment here? About Base, what about loading the file into Calc and using the Calc file as a data source in Base?
Comment 3 Buovjaga 2024-01-09 07:13:09 UTC
Also, maybe the info could be added to https://wiki.documentfoundation.org/ReleaseNotes/24.2#Orcus-based_filters ?
Comment 4 Simon Aubert 2024-01-09 08:08:41 UTC
Hello. First of all, thanks for the fast answer. I tried to open some samples from https://www.tablab.app/datasets/sample/parquet and it doesn't seem to work and Apache Parquet doesn't seem to be present in the file formats that are available.

Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: fr-FR (fr_FR); UI: fr-FR
Calc: threaded

Best regards,

Simon
Comment 5 Xisco Faulí 2024-01-09 08:31:57 UTC
(In reply to Simon Aubert from comment #4)
> Hello. First of all, thanks for the fast answer. I tried to open some
> samples from https://www.tablab.app/datasets/sample/parquet and it doesn't
> seem to work and Apache Parquet doesn't seem to be present in the file
> formats that are available.
> 
> Version: 7.6.4.1 (X86_64) / LibreOffice Community
> Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
> CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL:
> win
> Locale: fr-FR (fr_FR); UI: fr-FR
> Calc: threaded
> 
> Best regards,
> 
> Simon

You should try with LibreOffice 24.2. you can download it from https://es.libreoffice.org/descarga/libreoffice/?type=rpm-x86_64&version=24.2.0
Comment 6 Mike Kaganski 2024-01-09 11:07:06 UTC
(In reply to Buovjaga from comment #2)
> Already done for Calc for version 24.2 with
> b14583ba37a6d7ce398ccd3cf339f954785b03d8

The commit tells explicitly, that the support depends on "orcus has been built with the parquet import filter enabled". With current master on Windows, I can't open files from the resource mentioned in comment 4.
Comment 7 Simon Aubert 2024-01-09 11:47:31 UTC
So I tried with the 24.2 and it's still KO and absent from the supported file formats.

Do I miss something? 

Best regards,

Simon
Comment 8 Xisco Faulí 2024-01-09 11:54:26 UTC
Reproduced in

Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 60150ef4b8fc1d0a30f20c3d9ed6ba0725da16a5
CPU threads: 8; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: es-ES (es_ES.UTF-8); UI: en-US
Calc: threaded
Comment 9 Xisco Faulí 2024-01-09 11:54:37 UTC
Created attachment 191819 [details]
sample 1
Comment 10 Xisco Faulí 2024-01-09 11:54:48 UTC
Created attachment 191820 [details]
sample 2
Comment 11 Xisco Faulí 2024-01-09 11:55:00 UTC
Created attachment 191821 [details]
sample 3
Comment 12 Buovjaga 2024-01-09 14:02:27 UTC
(In reply to Xisco Faulí from comment #8)
> Reproduced in
> 
> Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
> Build ID: 60150ef4b8fc1d0a30f20c3d9ed6ba0725da16a5
> CPU threads: 8; OS: Linux 6.1; UI render: default; VCL: gtk3
> Locale: es-ES (es_ES.UTF-8); UI: en-US
> Calc: threaded

So did you build orcus with parquet filter enabled?
Comment 13 Kohei Yoshida 2024-01-11 01:41:03 UTC
Allow me to give you guys some clarification...

In the current state on the master branch, the internal orcus is built without the parquet filter support.  The change referenced by the commit only introduces all necessary hooks to enable Parquet support when orcus is built with the parquet filter enabled, but that commit itself is not adequate to load parquet files.

Now, to enable parquet filter in orcus, you first need to build the Apache Arrow library since that becomes orcus's new dependency.  And to build the Apache Arrow library, you need to build the libraries that the Arrow library itself depends on.  Depending on how many features of Parquet you want to enable (Parquet can support multiple compression algorithms), you may need to build a few extra libraries or even more.  So, even in a minimal configuration, we are talking about 3-4 extra libraries that need to be built before we can turn on the parquet filter support in orcus.

Here is the main obstacle.  Most of these libraries use CMake as their only build system.  So if we want to build all of them as part of the regular TDF build, we first need to find a way to either integrate CMake support into our GNU Make based build system, or somehow have them built outside of our core build system and only reference them (or something).

Unfortunately I was not able to come up with a good solution for integrating these libraries, which is the reason why the internal orcus is built without parquet support at the moment...

Having said that, if someone wants to experiment with this, the easiest way to enable Parquet support is to build orcus outside of the libreoffice build along with all of its parquet related dependencies, and use --with-system-orcus to treat it as a system-provided orcus library when building libreoffice.
Comment 14 Kohei Yoshida 2024-01-11 01:56:15 UTC
(In reply to Kohei Yoshida from comment #13)

> Having said that, if someone wants to experiment with this, the easiest way
> to enable Parquet support is to build orcus outside of the libreoffice build
> along with all of its parquet related dependencies, and use
> --with-system-orcus to treat it as a system-provided orcus library when
> building libreoffice.

This may be a completely doable strategy for distro builds though, but I'm not sure whether it's desirable to have the distro builds have features the TDF build lacks.