xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Would it be possible to write map[string]interface{} object directly without string serialization #356

Closed stephane-moreau closed 2 years ago

stephane-moreau commented 3 years ago

Hello,

I have a requirement to build parquet files where the structure si dyanmically defined through some datac ollection processes I'm able to generate the slice of schemaelements to describe the "tree" structure of data (can be direct entry in the map or a sub map or event an array of sub maps) When I use the Write method to push my collected data, it comes as a map[string]nterface{} and in the end nothing is stored in the file

after some debugging, It seems to be a "limitation" or a not yet supported use case for the Marshallers

I could eventually use the JSONWriter but would like to avoid JSON string roudtrip for each row I need to write to the file

Hoping my requirements are clear enough, I'm looking forward to advices

Stephane MOREAU PS: In the mean time, I've created a PR with a couple of changes that would fit my needs, don't hesitate to comment or let me know if something is wrong with my usecase / suggested change

janvaca92 commented 3 years ago

This feature would be great. I have struggled with the same kind of structure so this PR would be really helpful.

eladman7 commented 3 years ago

I agree it would be very useful. @stephane-moreau could you please share about a possible workaround if you used one? Thanks in advance :-)

stephane-moreau commented 3 years ago

Unfortunately, I could not find any work around - there is a PR (#357) to implement it The only other way would be to go throufh JSON serialziation which is not the most efficient process to go through for large datasets

janvaca92 commented 3 years ago

One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.

eladman7 commented 3 years ago

I tried serialize my map into json with json.Marshal() and changed the struct field accordingly:

type ResultTest struct {
    Result []byte `parquet:"name=Result, type=BYTE_ARRAY, convertedtype=JSON, repetitiontype=REPEATED"`
}

got this error on writeStop(): "WriteStop error runtime error: invalid memory address or nil pointer dereference"

I also tried using uint8 instead of []byte and type=UINT_8 within the parqeut tag as follows:

type ResultTest struct {
    Result []uint8 `parquet:"name=Result, type=UINT_8, convertedtype=JSON, repetitiontype=REPEATED"`
}

and got this on writeStop(): "Can't create parquet writer type UINT_8: not a valid Type string"

Am I missing something? Thanks

stephane-moreau commented 3 years ago

If you're using JSON serialization then you should have a dedicated writer and reader and not alter the "schema" of the written data this would not work correcly with parquet columnar staorage as far as I understand it see : https://github.com/xitongsys/parquet-go/blob/master/example/json_write.go

stephane-moreau commented 3 years ago

One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.

If you are using the "marshaller" from my PR, then you have to define the schema of the parquet file with the exact expected columns that your map[string]interface{} will provide (just like it's done in the json-write example)

janvaca92 commented 3 years ago

One way how I have implemented was copy your marshaler file and setup in the parquet writer as marshal func instead of the default one.

If you are using the "marshaller" from my PR, then you have to define the schema of the parquet file with the exact expected columns that your map[string]interface{} will provide (just like it's done in the json-write example)

Yes you right. In my case I have generate dynamic schema from DB base on table metadata. The mapping of data types is a another chapter.

fpasomeillan commented 3 years ago

why has this not been merged in ?

fpasomeillan commented 3 years ago

ed a PR with a couple of changes that would fit my needs, don't hesitate to comment or let me know if something is wrong with my usecase / suggested change

would love an example if you have one of going from map[string]interface{} to parquet.. implemented your pr changes but not sure how to take it all the way just yet. TIA

stephane-moreau commented 3 years ago

Here is a quick overfview on how I'm using it

var testSchema = []Column{
    {Path: "RecordID", Type: TYPE_INT64},
    {Path: "Email", Type: TYPE_STRING},
    {Path: "FirstName", Type: TYPE_STRING},
    {Path: "LastName", Type: TYPE_STRING},
    {Path: "LastVisit", Type: TYPE_DATETIME},
    {Path: "Products", Type: TYPE_ARRAY},
    {Path: "Products.items", Type: TYPE_GROUP},
    {Path: "Products.items.SKU", Type: TYPE_STRING},
    {Path: "Products.items.Price", Type: TYPE_FLOAT64},
    {Path: "Products.items.Currency", Type: TYPE_STRING},
    {Path: "Products.items.Stock", Type: TYPE_INT8},
}

type O = map[string]interface{}
type A = []interface{}

localFile, err = os.Create(testFile)

wri, err := writer.NewParquetWriterFromWriter(localFile, converToPartquetSchema(testSchema), 1)

err = w.Write(O{
    "RecordID":  int64(1),
    "Email":     "a@b.c",
    "FirstName": "Henry",
    "LastName":  "Chi",
    "Products": A{
        O{
            "SKU":   "123",
            "Price": 4.5,
            "Stock": int8(6),
        },
        O{
            "SKU":   "456",
            "Price": 10.11,
        },
        O{
            "SKU":   "789",
            "Stock": int8(24),
        },
    },
})

err = w.Write(O{
    "RecordID":  int64(2),
    "Email":     "e@b.c",
    "FirstName": "Harry",
    "LastName":  "Cover",
    "Products": A{
        O{
            "SKU":   "123",
            "Price": 4.5,
        },
        O{
            "SKU":   "789",
            "Price": 10.11,
            "Stock": int8(12),
        },
    },
})

Hope this helps understand the aim - the convertToParquetScheama just iterates on the testSchema to create the proper column definitions of the parquet format (utility functions provided internally to abstract file format description)