Open Zarquan opened 4 years ago
The code for the Java toolkit is here https://github.com/apache/parquet-mr/tree/master/parquet-tools
This issue overlaps with issue #15. We will need to be able to do both. Analyse the data from inside a notebook, and from the command line inside a virtual machine or container.
The parquet-mr
toolkit will need the Java runtime installed on the host, so it might be useful to create a minimal Docker container with just enough components installed to run the parquet-mr
toolkit.
To analyse a set of Parquet files, run the container with the directory containing the Parquet files mounted as a read-only volume.
Do we need this ?
Make sure we have documented the tools in our notes and then close the issue.
If we make the parquet-mr toolkit available on the %sh
interpreter then that might be all we need.
If we have a problem with the file mounts, having something to check the contents of a parquet file is extremely useful.
If possible, we should add the parquet-mr
toolkit to the test Pods associated with the Manila shares.
Notes on how to use the
parquet-tools
from Hadoop to inspect Parquet files. https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-lineI suspect we will end up using these a lot as we evaluate different ways to partition the data into Parquet files.
The StackOverflow question is about how to do this from the command line, but as this is a Java component, we can probably use these from Zeppelin notebook or a Spark job ?