Spool the result to disk instead of memory

zensqlmonitor / influxdb-zabbix

Gather data from Zabbix back-end and load to InfluxDB in near real-time for enhanced performance and easier usage with Grafana.

MIT License

71 stars 26 forks source link

Spool the result to disk instead of memory #5

Closed menardorama closed 7 years ago

menardorama commented 7 years ago

Hi,

An issue came when I tried to sync 6 months of zabbix historical data.

The sql query for history_uint give several billions rows. I think that it could be a good option to be able to spool the result to a file instead of memory.

In my case it's not working and OOM Killer kill the process.

menardorama commented 7 years ago

Or limit to a number of records and keep the last record timestamp in the json file could be an option.

In that case it would not consume all the memory and it will synchronize by bucket

zensqlmonitor commented 7 years ago

@menardorama, could you please check if it's working like needed. You have now the possibility to configure the rows per batch for each table (input/output).

menardorama commented 7 years ago

The issue with the proposal fix you provide is that it won't reduce the DB footprint of those queries.

I would ssuggest that instead of defining a limit, it would be great to define a range of time .

In that case if you are using partitionning of the historical tables, you would benefit of the partition prunning mechanism.

Is this is something you would accept ?

Instead having a limit, it's just that you fetch the data with ##ENDDATE## = ##STARTDATE## + #certain amount of time#

zensqlmonitor commented 7 years ago

Have you created the indexes as specified in the readme file ? For sure it reduces the DB footprint as you scan only a subset of the tables.

Also, I'm using partitioning (not real partitioning with psql like it is done in MS SQL Server) and it works as it has to work in my side.

The solution is designated like an ETL. With SSIS, you define the input/output rows per batch. The process start and the first/top rows are queried and send to the output splitted or not in multiple subset. And then the next range is quieried and again...

I made different loads during the day and the memory footprint of the process was less than 2GB for the 4 tables with 100k/per batch and 200k/per batch .