Open Zarquan opened 5 years ago
I suspect we will end up using this a lot as we evaluate different ways to partition the Gaia data into Parquet files.
The code for a Java toolkit is here https://github.com/apache/parquet-mr/tree/master/parquet-tools
This issue overlaps with issue #32 We will need to be able to do both. Analyse the data from from the command line and from inside a Zeppelin notebook.
There will probably be lots of ways of doing this. So this issue will probably become a task of finding them and writing them up as example notebooks that we can give to our end users.
Nigel has found some Python code that lists the schema:
%pyspark
df = sqlContext.read.parquet("/hadoop/gaia/parquet/gdr2/gaia_source/*.parquet")
df.printSchema()
This could be useful to check that a spatial index has been created by something like AXS.
If we make the parquet-mr toolkit available on the %sh
interpreter then that might be all we need.
A Zeppellin notebook that demonstrates how to unpack a Parquet data set and extract the metadata, displaying things like number of files, file format, compression, number of columns, block size, indexing etc.
Three use cases: 1) Point this notebook at an unknown dataset to learn how it it formatted 2) Each individual step in the notebook can be used as an example of how to extract a metadata properties from a dataset. 3) A derivative of this notebook could be used to check that our datasets are formatted as expected.