xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

How to do schema inference with lowercase field names / periods in schema namespace? #416

Closed bdavs3 closed 3 years ago

bdavs3 commented 3 years ago

Hi @xitongsys. I have a use case for this project that requires reading lowercase field names from parquet. Up to this point, we've been manually declaring schema before reading the parquet, but this is not ideal, as others must wait for us to adjust the schema if it ever needs to be changed upstream. And I should mention, it would be quite difficult for us to change our field names to uppercase since they are used in several processes.

We'd like to implement schema-on-read by passing a nil schema to NewParquetReader and eventually calling ReadByNumber. But doing this results in the following error:

[NextRowGroup] Column not found: Com.my.namespace.ExampleColumn # exampleColumn in the parquet schema

I understand in theory why capitalization is important due to exporting variables, but are lowercase field names supposed to be handled in the code? In #233, it looks like you added the ability to prefix a field name with a leading underscore, so I'm assuming that lowercase field names should be handled. Is this something to do with using a namespace? How would you suggest handling this problem? I'm happy to try to make a PR if it's straightforward.

bdavs3 commented 3 years ago

Actually, I'm not sure that this is an issue with lowercase field names. I'm stepping through the code and finding an issue in the NextRowGroup function of columnbuffer.go. When it does the following check, there's a problem:

if self.PathStr == common.PathToStr(path) { ...
// self.PathStr = Com.my.namespace.ExampleColumn
// common.PathToStr(path) = Com.my.namespace.my.namespace.ExampleColumn

Seems like path is being built incorrectly:

path := make([]string, 0)
path = append(path, self.SchemaHandler.GetRootInName()) // Appends Com.my.namespace
path = append(path, columnChunks[i].MetaData.GetPathInSchema()...) // Appends my.namespace.ExampleColumn

I'm not quite sure how to fix this, because I think the second append can be traced back to something happening in thrift.

Here's a peek at my parquet schema:

message com.my.namespace {
  required binary exampleColumn (STRING);
  required int64 anotherColumn;
  optional binary aThirdColumn (STRING);
  ...
bdavs3 commented 3 years ago

Welp, this was resolved by go get (didn't realize my version was out of date). Looks like that issue was fixed in this pr.

bombsimon commented 1 year ago

This is still an issue regarding lowercase names, right? I'm inferring schemas and want to keep the name of the column exactly like it was defined in Parquet. I also found out the struct needs to have uppercase names to be exported but maybe the struct can either generate struct tags with original name or maybe the field here could be the original one and schema inference is responsible of ensuring it's capitalized? https://github.com/xitongsys/parquet-go/blob/206c5012fe1053681d6af15696f7b24a3857a1c9/parquet/parquet.go#L3662-L3673

Afaict there's no way for me to figure out if the field was named my_field or My_field after inference, is this correct?

bombsimon commented 1 year ago

I found a way to solve this; on the schema.SchemaHandler we have access to MapIndex and IndexMap to get the index from a name and vice versa. We also have InPathToExPath and ExPathToInPath to go between the names.

https://github.com/xitongsys/parquet-go/blob/b6d7d8771e2852091fc503f1fbd82463ff6ec75f/schema/schemahandler.go#L49-L60

The index can also be used together with GetExName (or GetInName).

https://github.com/xitongsys/parquet-go/blob/b6d7d8771e2852091fc503f1fbd82463ff6ec75f/schema/schemahandler.go#L152-L158

This is not super trivial when using ReadByNumber because you only get an anonymous struct but my solution ended up being to recurse down the struct with reflections and build the InPath to be able to lookup the ExName based on my path. I just have to append the field recursively and remember to add []string{"List", "Element"} if the type is reflect.Slice.