Closed bdavs3 closed 3 years ago
Actually, I'm not sure that this is an issue with lowercase field names. I'm stepping through the code and finding an issue in the NextRowGroup
function of columnbuffer.go
. When it does the following check, there's a problem:
if self.PathStr == common.PathToStr(path) { ...
// self.PathStr = Com.my.namespace.ExampleColumn
// common.PathToStr(path) = Com.my.namespace.my.namespace.ExampleColumn
Seems like path is being built incorrectly:
path := make([]string, 0)
path = append(path, self.SchemaHandler.GetRootInName()) // Appends Com.my.namespace
path = append(path, columnChunks[i].MetaData.GetPathInSchema()...) // Appends my.namespace.ExampleColumn
I'm not quite sure how to fix this, because I think the second append can be traced back to something happening in thrift.
Here's a peek at my parquet schema:
message com.my.namespace {
required binary exampleColumn (STRING);
required int64 anotherColumn;
optional binary aThirdColumn (STRING);
...
Welp, this was resolved by go get
(didn't realize my version was out of date). Looks like that issue was fixed in this pr.
This is still an issue regarding lowercase names, right? I'm inferring schemas and want to keep the name of the column exactly like it was defined in Parquet. I also found out the struct needs to have uppercase names to be exported but maybe the struct can either generate struct tags with original name or maybe the field here could be the original one and schema inference is responsible of ensuring it's capitalized? https://github.com/xitongsys/parquet-go/blob/206c5012fe1053681d6af15696f7b24a3857a1c9/parquet/parquet.go#L3662-L3673
Afaict there's no way for me to figure out if the field was named my_field
or My_field
after inference, is this correct?
I found a way to solve this; on the schema.SchemaHandler
we have access to MapIndex
and IndexMap
to get the index from a name and vice versa. We also have InPathToExPath
and ExPathToInPath
to go between the names.
The index can also be used together with GetExName
(or GetInName
).
This is not super trivial when using ReadByNumber
because you only get an anonymous struct but my solution ended up being to recurse down the struct with reflections and build the InPath
to be able to lookup the ExName
based on my path. I just have to append the field recursively and remember to add []string{"List", "Element"}
if the type is reflect.Slice
.
Hi @xitongsys. I have a use case for this project that requires reading lowercase field names from parquet. Up to this point, we've been manually declaring schema before reading the parquet, but this is not ideal, as others must wait for us to adjust the schema if it ever needs to be changed upstream. And I should mention, it would be quite difficult for us to change our field names to uppercase since they are used in several processes.
We'd like to implement schema-on-read by passing a
nil
schema toNewParquetReader
and eventually callingReadByNumber
. But doing this results in the following error:I understand in theory why capitalization is important due to exporting variables, but are lowercase field names supposed to be handled in the code? In #233, it looks like you added the ability to prefix a field name with a leading underscore, so I'm assuming that lowercase field names should be handled. Is this something to do with using a namespace? How would you suggest handling this problem? I'm happy to try to make a PR if it's straightforward.