Open h27771420 opened 8 months ago
(UPDATE): S3
+ v1.6.0
looks good so far, the memory usage is almost same as v1.5.4
, but not yet test with gocloud/blob
.
(Update): Adding some information obtained from previous debugs, the .parquet
files generated from S3
+ v1.5.4
, v1.6.0
, and v1.6.2
can all be read without any issue.
However, v1.6.2
and v1.6.3-0.20231102094431-8ca067b2bd32
(the latest commit) encounters a huge memory usage issue.
The below couple logs are the memory usage(inuse_space
) via pprof.
v1.6.2
, encoding.WritePlainBYTE_ARRAY
seems to consume a lot. (exported just before reaching OOM) (screenshot for the caller chains)
MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-ab-240319-after-v2.prof
File: playerid_cli
Type: inuse_space
Time: Mar 19, 2024 at 1:01pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 60854.17MB, 97.88% of 62170.46MB total
Dropped 301 nodes (cum <= 310.85MB)
Showing top 10 nodes out of 30
flat flat% sum% cum cum%
49100.89MB 78.98% 78.98% 49100.89MB 78.98% github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>
v1.5.4
, the encoding.WritePlainBYTE_ARRAY
ranked at fourth. (exported during stable writing)
MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-240326.prof
File: playerid_cli
Type: inuse_space
Time: Mar 26, 2024 at 10:44am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 12675.05MB, 97.93% of 12942.45MB total
Dropped 253 nodes (cum <= 64.71MB)
Showing top 10 nodes out of 39
flat flat% sum% cum cum%
<REDACTED>...
858.03MB 6.63% 92.99% 858.03MB 6.63% github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>
Currently I am using v1.6.0
, which works fine with gocloud/blob
and memory usage is almost same as the v1.5.4
.
So I think I will pin the version at there for a while.
I think the change that will cause the memory usage to increase should be in v1.6.1
or v1.6.2
.
Hope someone can give me some tips.
Hello there, as the title.
Due to some cases, I need to change from
s3
to use thegocloud/blob
implementation to support the other storage providers. I initially used the release ofgithub.com/xitongsys/parquet-go v1.5.4
for testing thegocloud/blob
, but found thatgocloud/blob
sometimes failed to write/read data in this version, so I later changed it to use the latest release (github.com/xitongsys/parquet-go v1.6.2
), and also have updated the parquet tags to meet the new release. (e.g., update the below tag fromtype=UTF8
totype=BYTE_ARRAY, convertedtype=UTF8
)to
The read/write tests were all good at first, but when I needed to write dozens gigabytes(over 100G) of data, I found that the updated version caused OOM problems. At the beginning, I thought this might be because there might be an issue with the implementation of
gocloud/blob
, so I changed back tos3
for testing, but found that the problem was still not solved.I saw that after the latest release, there were several fixes that seemed to be related to memory, so I upgraded my version to the latest commit (
github.com/xitongsys/parquet-go v1.6.3-0.20231102094431-8ca067b2bd32
), but the OOM problem is still not solved. š š============================== So I'm here and raise my hand. Does anyone know of any reasons that may cause 4x(or more) of memory usage after upgrading from
v1.5.4
tov1.6.2
/v1.6.3-0.20231102094431-8ca067b2bd32
?When writing dozens gigabytes of data, memory usage grows as follows:
v1.5.4
withS3
- 35G (works nicely)v1.6.0
withS3
- 35G (works nicely)v1.6.0
withgocloud/blob
- 35G (works nicely)v1.6.2
withS3
- 140G (OOM, so might more than 140G)v1.6.2
withgocloud/blob
- 140G (OOM, so might more than 140G)v1.6.3-0.20231102094431-8ca067b2bd32
withS3
- 140G (OOM, so might more than 140G)My writer parameters(there isn't any code change at here from
v1.5.4
):