wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Parquet analyser (command line) #32

Open Zarquan opened 4 years ago

Zarquan commented 4 years ago

Notes on how to use the parquet-tools from Hadoop to inspect Parquet files. https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line

I suspect we will end up using these a lot as we evaluate different ways to partition the data into Parquet files.

The StackOverflow question is about how to do this from the command line, but as this is a Java component, we can probably use these from Zeppelin notebook or a Spark job ?

Zarquan commented 4 years ago

The code for the Java toolkit is here https://github.com/apache/parquet-mr/tree/master/parquet-tools

Zarquan commented 4 years ago

This issue overlaps with issue #15. We will need to be able to do both. Analyse the data from inside a notebook, and from the command line inside a virtual machine or container.

Zarquan commented 4 years ago

The parquet-mr toolkit will need the Java runtime installed on the host, so it might be useful to create a minimal Docker container with just enough components installed to run the parquet-mr toolkit. To analyse a set of Parquet files, run the container with the directory containing the Parquet files mounted as a read-only volume.

Zarquan commented 4 years ago

Do we need this ?

Zarquan commented 4 years ago

Make sure we have documented the tools in our notes and then close the issue.

Zarquan commented 3 years ago

If we make the parquet-mr toolkit available on the %sh interpreter then that might be all we need.

Zarquan commented 3 years ago

If we have a problem with the file mounts, having something to check the contents of a parquet file is extremely useful. If possible, we should add the parquet-mr toolkit to the test Pods associated with the Manila shares.