wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Parquet analyser (notebook) #15

Open Zarquan opened 5 years ago

Zarquan commented 5 years ago

A Zeppellin notebook that demonstrates how to unpack a Parquet data set and extract the metadata, displaying things like number of files, file format, compression, number of columns, block size, indexing etc.

Three use cases: 1) Point this notebook at an unknown dataset to learn how it it formatted 2) Each individual step in the notebook can be used as an example of how to extract a metadata properties from a dataset. 3) A derivative of this notebook could be used to check that our datasets are formatted as expected.

Zarquan commented 4 years ago

I suspect we will end up using this a lot as we evaluate different ways to partition the Gaia data into Parquet files.

Zarquan commented 4 years ago

The code for a Java toolkit is here https://github.com/apache/parquet-mr/tree/master/parquet-tools

Zarquan commented 4 years ago

This issue overlaps with issue #32 We will need to be able to do both. Analyse the data from from the command line and from inside a Zeppelin notebook.

Zarquan commented 4 years ago

There will probably be lots of ways of doing this. So this issue will probably become a task of finding them and writing them up as example notebooks that we can give to our end users.

Nigel has found some Python code that lists the schema:

%pyspark
df = sqlContext.read.parquet("/hadoop/gaia/parquet/gdr2/gaia_source/*.parquet")
df.printSchema()
Zarquan commented 3 years ago

This could be useful to check that a spatial index has been created by something like AXS.

Zarquan commented 3 years ago

If we make the parquet-mr toolkit available on the %sh interpreter then that might be all we need.