xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Can't Parsing int96 type in parquet #408

Closed RichardFlyBird closed 2 years ago

RichardFlyBird commented 3 years ago

code as follow:

step := 1 for i := int64(0); i < count; i += int64(step) { res, err := pr.ReadByNumber(step) if err != nil { appInit.Logger.Errorf("Can't read: %s", err) return err } jsonBs, err := json.Marshal(res) // maybe i shouldn't use json.Marshal??? if err != nil { appInit.Logger.Errorf("Can't to json: %s", err) return err } mp := make([]map[string]interface{}, step) json.Unmarshal(jsonBs, &mp)

    values := make([]interface{}, 0, len(clmList))
    for i, _ := range mp {
        for j, item := range clmList {
            values = append(values, utils.GetValue(mp[i][item.Nm], clmList[j]))
        }
        _, err = stmt.Exec(values...)
        if err != nil {
            appInit.Logger.Errorf("Exec insert sql fail: %s", err)
            return err
        }
    }
}

when i have executed follow codes:

res, err := pr.ReadByNumber(step); jsonBs, err := json.Marshal(res)

I know one field is int96 type before, so when i get the int96 type field in jsonBs, the result is 18bytes, not 12bites. And when i use time := types.INT96ToTime(int96_value.(string)), the result is obviously wrong. Is my usage incorrect? I'm in a hurry, thanks so much

xitongsys commented 3 years ago

INT96 stored in parquet-go as a stirng([]byte), which are the encoded bytes. If your INT96 is just a number, you should convert it to readable string by yourself. If your INT96 is a timestamp, you can use the types.INT96ToTime. This function is compatible with Spark output. You can refer from here

RichardFlyBird commented 3 years ago

INT96 stored in parquet-go as a stirng([]byte), which are the encoded bytes. If your INT96 is just a number, you should convert it to readable string by yourself. If your INT96 is a timestamp, you can use the types.INT96ToTime. This function is compatible with Spark output. You can refer from here

My INT96 is a timestamp, and I know it's real value = '2021/9/18 18:31:46'. But when I use types.INT96ToTime to transfer the INT96(as a stirng([]byte)) ,the result value is '1925-01-01 00:12:21'. I suspect that the data retrieved from pr.ReadByNumber(step) is wrong, i don't know why does this happen?

xitongsys commented 3 years ago

I'm not sure. How did you generate this parquet file ? Could you provide a sample file ?