xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Memory usage increased by more than 4x until OOM(140G) when upgrading from v1.5.4 to the latest commit #581

Open h27771420 opened 8 months ago

h27771420 commented 8 months ago

Hello there, as the title.

Due to some cases, I need to change from s3 to use the gocloud/blob implementation to support the other storage providers. I initially used the release of github.com/xitongsys/parquet-go v1.5.4 for testing the gocloud/blob, but found that gocloud/blob sometimes failed to write/read data in this version, so I later changed it to use the latest release (github.com/xitongsys/parquet-go v1.6.2), and also have updated the parquet tags to meet the new release. (e.g., update the below tag from type=UTF8 to type=BYTE_ARRAY, convertedtype=UTF8)

PlayerID            string    `parquet:"name=player_id, type=UTF8"` 

to

PlayerID            string    `parquet:"name=player_id, type=BYTE_ARRAY, convertedtype=UTF8"`

The read/write tests were all good at first, but when I needed to write dozens gigabytes(over 100G) of data, I found that the updated version caused OOM problems. At the beginning, I thought this might be because there might be an issue with the implementation of gocloud/blob, so I changed back to s3 for testing, but found that the problem was still not solved.

I saw that after the latest release, there were several fixes that seemed to be related to memory, so I upgraded my version to the latest commit (github.com/xitongsys/parquet-go v1.6.3-0.20231102094431-8ca067b2bd32), but the OOM problem is still not solved. šŸ˜­ šŸ˜”

============================== So I'm here and raise my hand. Does anyone know of any reasons that may cause 4x(or more) of memory usage after upgrading from v1.5.4 to v1.6.2/v1.6.3-0.20231102094431-8ca067b2bd32?

When writing dozens gigabytes of data, memory usage grows as follows:

  1. v1.5.4 with S3 - 35G (works nicely)
  2. (UPDATE): v1.6.0 with S3 - 35G (works nicely)
  3. (UPDATE - 2024/04/10): v1.6.0 with gocloud/blob - 35G (works nicely)
  4. v1.6.2 with S3 - 140G (OOM, so might more than 140G)
  5. v1.6.2 with gocloud/blob - 140G (OOM, so might more than 140G)
  6. v1.6.3-0.20231102094431-8ca067b2bd32 with S3 - 140G (OOM, so might more than 140G)

My writer parameters(there isn't any code change at here from v1.5.4):

    fw, err := s3.NewS3FileWriter(ctx, bucket, s3key,
        []func(*s3manager.Uploader){
            func(u *s3manager.Uploader) {
                u.PartSize = 64 * 1024 * 1024 // 64MB per part
            }},
        awsConfig,
    )
    if err != nil {
        // do something
    }

    parquetWriter, err := writer.NewParquetWriter(fw, new(ParquetStruct), 1)
    if err != nil {
        // do something
    }
    parquetWriter.CompressionType = parquet.CompressionCodec_GZIP // gzip to reduce the file size
    parquetWriter.PageSize = 1 * 1024 * 1024                      // larger pages for better compression
    parquetWriter.RowGroupSize = 128 * 1024 * 1024                // default
h27771420 commented 8 months ago

(UPDATE): S3 + v1.6.0 looks good so far, the memory usage is almost same as v1.5.4, but not yet test with gocloud/blob.

h27771420 commented 7 months ago

(Update): Adding some information obtained from previous debugs, the .parquet files generated from S3 + v1.5.4, v1.6.0, and v1.6.2 can all be read without any issue. However, v1.6.2 and v1.6.3-0.20231102094431-8ca067b2bd32(the latest commit) encounters a huge memory usage issue.

The below couple logs are the memory usage(inuse_space) via pprof.

v1.6.2, encoding.WritePlainBYTE_ARRAY seems to consume a lot. (exported just before reaching OOM) (screenshot for the caller chains)

MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-ab-240319-after-v2.prof 
File: playerid_cli
Type: inuse_space
Time: Mar 19, 2024 at 1:01pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 60854.17MB, 97.88% of 62170.46MB total
Dropped 301 nodes (cum <= 310.85MB)
Showing top 10 nodes out of 30
      flat  flat%   sum%        cum   cum%
49100.89MB 78.98% 78.98% 49100.89MB 78.98%  github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>

v1.5.4, the encoding.WritePlainBYTE_ARRAY ranked at fourth. (exported during stable writing)

MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-240326.prof
File: playerid_cli
Type: inuse_space
Time: Mar 26, 2024 at 10:44am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 12675.05MB, 97.93% of 12942.45MB total
Dropped 253 nodes (cum <= 64.71MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
<REDACTED>...
  858.03MB  6.63% 92.99%   858.03MB  6.63%  github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>

Currently I am using v1.6.0, which works fine with gocloud/blob and memory usage is almost same as the v1.5.4. So I think I will pin the version at there for a while. I think the change that will cause the memory usage to increase should be in v1.6.1 or v1.6.2. Hope someone can give me some tips.