159078 – Support for Apache Parquet input for Calc and Base

Bug 159078 - Support for Apache Parquet input for Calc and Base

Summary: Support for Apache Parquet input for Calc and Base

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	7.6.4.1 release
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Format-Filters
	Show dependency tree / graph

Reported:	2024-01-09 06:19 UTC by Simon Aubert
Modified:	2024-01-11 01:56 UTC (History)
CC List:	4 users (show)

See Also:
Crash report or crash signature:

Attachments
sample 1 (2.86 KB, application/octet-stream) 2024-01-09 11:54 UTC, Xisco Faulí	Details
sample 2 (12.34 MB, application/octet-stream) 2024-01-09 11:54 UTC, Xisco Faulí	Details
sample 3 (39.08 KB, application/octet-stream) 2024-01-09 11:55 UTC, Xisco Faulí	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Aubert 2024-01-09 06:19:48 UTC

Description:
Hello all,
Apache Parquet becomes more and more popular in the data community. It comes from Big Data on Hadoop but is now supported by all major editors. The benefits of it being it's fast... however, sometimes, I would like to just open it like a table file.
Best regards,
Simon

Actual Results:
You can't connect to an Apache Parquet File

Expected Results:
You can read an Apache Parquet File


Reproducible: Always


User Profile Reset: No

Additional Info:
None

Comment 1 Mike Kaganski 2024-01-09 07:07:37 UTC

https://qa.blog.documentfoundation.org/2023/11/22/qa-dev-report-october-2023/

> 40. Kohei Yoshida upgraded liborcus and added support for conditional loading of
> Apache Parquet files into Calc

Comment 2 Buovjaga 2024-01-09 07:09:08 UTC

Already done for Calc for version 24.2 with b14583ba37a6d7ce398ccd3cf339f954785b03d8

Kohei: anything to comment here? About Base, what about loading the file into Calc and using the Calc file as a data source in Base?

Comment 3 Buovjaga 2024-01-09 07:13:09 UTC

Also, maybe the info could be added to https://wiki.documentfoundation.org/ReleaseNotes/24.2#Orcus-based_filters ?

Comment 4 Simon Aubert 2024-01-09 08:08:41 UTC

Hello. First of all, thanks for the fast answer. I tried to open some samples from https://www.tablab.app/datasets/sample/parquet and it doesn't seem to work and Apache Parquet doesn't seem to be present in the file formats that are available.

Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: fr-FR (fr_FR); UI: fr-FR
Calc: threaded

Best regards,

Simon

Comment 5 Xisco Faulí 2024-01-09 08:31:57 UTC

(In reply to Simon Aubert from comment #4)
> Hello. First of all, thanks for the fast answer. I tried to open some
> samples from https://www.tablab.app/datasets/sample/parquet and it doesn't
> seem to work and Apache Parquet doesn't seem to be present in the file
> formats that are available.
> 
> Version: 7.6.4.1 (X86_64) / LibreOffice Community
> Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
> CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL:
> win
> Locale: fr-FR (fr_FR); UI: fr-FR
> Calc: threaded
> 
> Best regards,
> 
> Simon

You should try with LibreOffice 24.2. you can download it from https://es.libreoffice.org/descarga/libreoffice/?type=rpm-x86_64&version=24.2.0

Comment 6 Mike Kaganski 2024-01-09 11:07:06 UTC

(In reply to Buovjaga from comment #2)
> Already done for Calc for version 24.2 with
> b14583ba37a6d7ce398ccd3cf339f954785b03d8

The commit tells explicitly, that the support depends on "orcus has been built with the parquet import filter enabled". With current master on Windows, I can't open files from the resource mentioned in comment 4.

Comment 7 Simon Aubert 2024-01-09 11:47:31 UTC

So I tried with the 24.2 and it's still KO and absent from the supported file formats.

Do I miss something? 

Best regards,

Simon

Comment 8 Xisco Faulí 2024-01-09 11:54:26 UTC

Reproduced in

Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 60150ef4b8fc1d0a30f20c3d9ed6ba0725da16a5
CPU threads: 8; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: es-ES (es_ES.UTF-8); UI: en-US
Calc: threaded

Comment 9 Xisco Faulí 2024-01-09 11:54:37 UTC

Created attachment 191819 [details]
sample 1

Comment 10 Xisco Faulí 2024-01-09 11:54:48 UTC

Created attachment 191820 [details]
sample 2

Comment 11 Xisco Faulí 2024-01-09 11:55:00 UTC

Created attachment 191821 [details]
sample 3

Comment 12 Buovjaga 2024-01-09 14:02:27 UTC

(In reply to Xisco Faulí from comment #8)
> Reproduced in
> 
> Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
> Build ID: 60150ef4b8fc1d0a30f20c3d9ed6ba0725da16a5
> CPU threads: 8; OS: Linux 6.1; UI render: default; VCL: gtk3
> Locale: es-ES (es_ES.UTF-8); UI: en-US
> Calc: threaded

So did you build orcus with parquet filter enabled?

Comment 13 Kohei Yoshida 2024-01-11 01:41:03 UTC

Allow me to give you guys some clarification...

In the current state on the master branch, the internal orcus is built without the parquet filter support. The change referenced by the commit only introduces all necessary hooks to enable Parquet support when orcus is built with the parquet filter enabled, but that commit itself is not adequate to load parquet files.

Now, to enable parquet filter in orcus, you first need to build the Apache Arrow library since that becomes orcus's new dependency. And to build the Apache Arrow library, you need to build the libraries that the Arrow library itself depends on. Depending on how many features of Parquet you want to enable (Parquet can support multiple compression algorithms), you may need to build a few extra libraries or even more. So, even in a minimal configuration, we are talking about 3-4 extra libraries that need to be built before we can turn on the parquet filter support in orcus.

Here is the main obstacle. Most of these libraries use CMake as their only build system. So if we want to build all of them as part of the regular TDF build, we first need to find a way to either integrate CMake support into our GNU Make based build system, or somehow have them built outside of our core build system and only reference them (or something).

Unfortunately I was not able to come up with a good solution for integrating these libraries, which is the reason why the internal orcus is built without parquet support at the moment...

Having said that, if someone wants to experiment with this, the easiest way to enable Parquet support is to build orcus outside of the libreoffice build along with all of its parquet related dependencies, and use --with-system-orcus to treat it as a system-provided orcus library when building libreoffice.

Comment 14 Kohei Yoshida 2024-01-11 01:56:15 UTC

(In reply to Kohei Yoshida from comment #13)

> Having said that, if someone wants to experiment with this, the easiest way
> to enable Parquet support is to build orcus outside of the libreoffice build
> along with all of its parquet related dependencies, and use
> --with-system-orcus to treat it as a system-provided orcus library when
> building libreoffice.

This may be a completely doable strategy for distro builds though, but I'm not sure whether it's desirable to have the distro builds have features the TDF build lacks.