ClamAV stopped scanning files

nidhigwari commented 2 years ago

We use Clamav for scanning files for our application. We use S3, SQS, clamAV integration. It seems to have stopped working suddenly. Adding: clam av version: ClamAV 0.103.6

andreaswittig commented 2 years ago

Sorry, we are not providing support for this free/open-source project. Check out our solution bucketAV with professional support included: https://bucketav.com

rmerrellgr commented 2 years ago

Oddly enough, we had this happen to us yesterday as well. Any instance of the s3-virusscan that we had running on a t3.micro all suddenly died at the same. Log inspection lead us to find that they all ran out of RAM and OOM killer killed clamd, but when systemd tried restarting it, it couldn't. I don't know enough about how clamd works when it phones home to get signature updates, but one theory is that it pulled an update yesterday that maxed out all the ram on the smaller instances. We fixed it just by launching new instances.

andreaswittig commented 2 years ago

@nidhigwari Sorry, I was to fast and harsh.

@rmerrellgr Thanks for providing more context.

nidhigwari commented 2 years ago

Thanks @rmerrellgr! We have launched new instances, but still the service is not working. We also see freshclam related error "WARNING: FreshClam previously received error code 429 or 403 from the ClamAV Content Delivery Network (CDN). This means that you have been rate limited or blocked by the CDN."

andreaswittig commented 2 years ago

@nidhigwari ClamAV introduced very strict throttling limits. We have been running into those limits as well and are now hosting our own mirror of the malware database.

andreaswittig commented 2 years ago

Oddly enough, we had this happen to us yesterday as well. Any instance of the s3-virusscan that we had running on a t3.micro all suddenly died at the same. Log inspection lead us to find that they all ran out of RAM and OOM killer killed clamd, but when systemd tried restarting it, it couldn't. I don't know enough about how clamd works when it phones home to get signature updates, but one theory is that it pulled an update yesterday that maxed out all the ram on the smaller instances. We fixed it just by launching new instances.

Is it possible that you tried to scan a "large" S3 object? Did you check the dead-letter queue?

michaelwittig commented 2 years ago

@rmerrellgr what is the value of the SwapSize parameter?

awsnicolemurray commented 2 years ago

@nidhigwari ClamAV introduced very strict throttling limits. We have been running into those limits as well and are now hosting our own mirror of the malware database.

What is the recommendation? How does the customer determine if the issue is because of throttling? Currently no files are being scanned and the issue impacts dev, stag, and prod environments. All appear to have been impacted on the same day.

Please help us understand what changes were made since July 15th so we can determine the best course of action for troubleshooting.

rmerrellgr commented 2 years ago

@andreaswittig Nope, no large file scans (no scans at all for some time before the crash, actually). But as I suspected, this is what we find in the logs

Jul 20 11:41:09 clamd[27447]: Database correctly reloaded (8622752 signatures)
Jul 20 11:41:11 clamd[27447]: Activating the newly loaded database...
Jul 20 11:41:13 kernel: amazon-cloudwat invoked oom-killer:
(Followed by 100+ lines of OOM killer output, which ultimately lead to clamd being killed)
Jul 20 11:41:13 kernel: Killed process 27447 (clamd)
Jul 20 11:41:13 systemd: Unit clamd@scan.service entered failed state.
Jul 20 11:41:14 systemd: clamd@scan.service holdoff time over, scheduling restart.
Jul 20 11:48:14 systemd: clamd@scan.service start operation timed out. Terminating.

At which point it just loops forever in this state of trying to start back up, but it can't. At this point, I just decided it would be easier to just launch replacement instances and be done with it.

I think it's safe to say that this isn't a Widdix problem. We have production level instances running on larger instances and they did not suffer the same fate as these. I just found it peculiar that our dev servers died unexpectedly and then someone else reported that there's did as well. I do not believe any action needs to be taken on your part, however.

And to answer your other question, these t3.micro instances have the SwapSize set to 2 in the CF config.

andreaswittig commented 2 years ago

@awsnicolemurray I'd recommend to check the logs.

andreaswittig commented 2 years ago

@rmerrellgr Interesting, haven't observed something like this before.

widdix / aws-s3-virusscan

ClamAV stopped scanning files #93