Closed stephane-moreau closed 2 years ago
This feature would be great. I have struggled with the same kind of structure so this PR would be really helpful.
I agree it would be very useful. @stephane-moreau could you please share about a possible workaround if you used one? Thanks in advance :-)
Unfortunately, I could not find any work around - there is a PR (#357) to implement it The only other way would be to go throufh JSON serialziation which is not the most efficient process to go through for large datasets
One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.
I tried serialize my map into json with json.Marshal() and changed the struct field accordingly:
type ResultTest struct {
Result []byte `parquet:"name=Result, type=BYTE_ARRAY, convertedtype=JSON, repetitiontype=REPEATED"`
}
got this error on writeStop(): "WriteStop error runtime error: invalid memory address or nil pointer dereference"
I also tried using uint8 instead of []byte and type=UINT_8 within the parqeut tag as follows:
type ResultTest struct {
Result []uint8 `parquet:"name=Result, type=UINT_8, convertedtype=JSON, repetitiontype=REPEATED"`
}
and got this on writeStop(): "Can't create parquet writer type UINT_8: not a valid Type string"
Am I missing something? Thanks
If you're using JSON serialization then you should have a dedicated writer and reader and not alter the "schema" of the written data this would not work correcly with parquet columnar staorage as far as I understand it see : https://github.com/xitongsys/parquet-go/blob/master/example/json_write.go
One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.
If you are using the "marshaller" from my PR, then you have to define the schema of the parquet file with the exact expected columns that your map[string]interface{} will provide (just like it's done in the json-write example)
One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.
If you are using the "marshaller" from my PR, then you have to define the schema of the parquet file with the exact expected columns that your map[string]interface{} will provide (just like it's done in the json-write example)
Yes you right. In my case I have generate dynamic schema from DB base on table metadata. The mapping of data types is a another chapter.
why has this not been merged in ?
ed a PR with a couple of changes that would fit my needs, don't hesitate to comment or let me know if something is wrong with my usecase / suggested change
would love an example if you have one of going from map[string]interface{} to parquet.. implemented your pr changes but not sure how to take it all the way just yet. TIA
Here is a quick overfview on how I'm using it
var testSchema = []Column{
{Path: "RecordID", Type: TYPE_INT64},
{Path: "Email", Type: TYPE_STRING},
{Path: "FirstName", Type: TYPE_STRING},
{Path: "LastName", Type: TYPE_STRING},
{Path: "LastVisit", Type: TYPE_DATETIME},
{Path: "Products", Type: TYPE_ARRAY},
{Path: "Products.items", Type: TYPE_GROUP},
{Path: "Products.items.SKU", Type: TYPE_STRING},
{Path: "Products.items.Price", Type: TYPE_FLOAT64},
{Path: "Products.items.Currency", Type: TYPE_STRING},
{Path: "Products.items.Stock", Type: TYPE_INT8},
}
type O = map[string]interface{}
type A = []interface{}
localFile, err = os.Create(testFile)
wri, err := writer.NewParquetWriterFromWriter(localFile, converToPartquetSchema(testSchema), 1)
err = w.Write(O{
"RecordID": int64(1),
"Email": "a@b.c",
"FirstName": "Henry",
"LastName": "Chi",
"Products": A{
O{
"SKU": "123",
"Price": 4.5,
"Stock": int8(6),
},
O{
"SKU": "456",
"Price": 10.11,
},
O{
"SKU": "789",
"Stock": int8(24),
},
},
})
err = w.Write(O{
"RecordID": int64(2),
"Email": "e@b.c",
"FirstName": "Harry",
"LastName": "Cover",
"Products": A{
O{
"SKU": "123",
"Price": 4.5,
},
O{
"SKU": "789",
"Price": 10.11,
"Stock": int8(12),
},
},
})
Hope this helps understand the aim - the convertToParquetScheama just iterates on the testSchema to create the proper column definitions of the parquet format (utility functions provided internally to abstract file format description)
Hello,
I have a requirement to build parquet files where the structure si dyanmically defined through some datac ollection processes I'm able to generate the slice of schemaelements to describe the "tree" structure of data (can be direct entry in the map or a sub map or event an array of sub maps) When I use the Write method to push my collected data, it comes as a map[string]nterface{} and in the end nothing is stored in the file
after some debugging, It seems to be a "limitation" or a not yet supported use case for the Marshallers
I could eventually use the JSONWriter but would like to avoid JSON string roudtrip for each row I need to write to the file
Hoping my requirements are clear enough, I'm looking forward to advices
Stephane MOREAU PS: In the mean time, I've created a PR with a couple of changes that would fit my needs, don't hesitate to comment or let me know if something is wrong with my usecase / suggested change