wordfence / wordfence-cli

Wordfence malware and vulnerability scanner command line utility.
https://www.wordfence.com/products/wordfence-cli/
GNU General Public License v3.0
100 stars 20 forks source link

Non-ASCII characters in filenames can result in encoding errors #259

Closed akenion closed 2 months ago

akenion commented 2 months ago
Error: 'ascii' codec can't encode character '\u030c' in position 56: ordinal not in range(128) 
NetVicious commented 2 months ago

Same problem here. It will be very useful to show on the error messages which file caused the problem.

And a parameter to skip those files with errors and continue with the scan instead of stop the scan will be a must too.

akenion commented 2 months ago

@NetVicious we should be able to address this in such a way the these filenames do not cause errors at all and can be scanned successfully, so I don't think we'll need any additional error messages or parameters. If that ends up being problematic (it shouldn't), then the approach you described would make sense, but I don't anticipate that being necessary.

youradds commented 2 months ago

Same issue here:

'/home/zakelijk/web/site.nl/public_html/wp-content/cache/db/singletables/9c0/90c/9c090c3a9a680baac22ab2aaf06e6a7d.php'
Error: 'ascii' codec can't encode character '\u200f' in position 92: ordinal not in range(128)

It kills the process. I thought it was another issue (with the terminal output not being set to utf8), but I still get the error on a cron (and it died)

Thanks! Cool system otherwise :) (using it on other servers)

Cheers

Andy

mmaunder commented 2 months ago

Milestoning this as 4.0.2 and it'll go out in the next week or two. Prioritizing this because it's a pain in the behind when a long running scan crashes like this.

NetVicious commented 2 months ago

At this moment, when we get this error we don't see on which file it was the problem. You need to check the last processed file, go to that folder and see which it's the next file in alphabetic order.

And I discovered that because I saw this issue on which I saw it was a problem with non-ascii filenames.

That's why I said the application should give more information about on which file it stopped.

akenion commented 2 months ago

@NetVicious In this case, the code that was triggering the error is intended to track the last file processed for the purpose of reporting to the user in case an unexpected error occurs. Since this code was the problematic piece, it couldn't report the file on which it failed.

When verbose logging is enabled, CLI will output a message in the form of Processing file: <file> before it begins processing each file. I will move this before the aforementioned logic for tracking the last file so there's a better chance a message is logged if a similar error were to occur in the future. Ultimately, CLI should be able to handle non-ascii strings gracefully which is what I'm implementing to address this issue.

davidnuzik commented 2 months ago

v4.0.2rc2

SUMMARY: QA validation PASSED. Testing was executed on multiple Linux and environments as well as on MacOS and automation tests were updated with a few non-ascii filenames. All automation tests pass and targeted manual regression tests pass.

TEST STEPS First, I confirmed I could reproduce the issue as described. Then I validated by doing the following:

  1. Create various non-ascii filenames on disk to be scanned. Add a .php file extension to them.
  2. Execute all relevant subcommands on such files on disk, such as: malware-scan, vuln-scan (requires WP install at path), and remediate. count-sites also checked just in case, although I don't think it would read in such files on disk.
  3. For each relevant subcommand, test various options/methods. For example, with remediate try to remediate a single file (won't actually remediate since it's not in the repo) and also try to remediate an entire folder with a non-ascii filename in the directory. For malware-scan try a path that points directly to the file or the folder it is within. For vuln-scan simply have a non-ascii file in the WP install directory.
  4. Test various relevant options, such as --output-path and ensure when the file gets mentioned in such output it appears correct in the file (possible this is not directly related to the fix here but can't hurt to check this). Also, for malware-scan test with and without the vectorscan match-engine. List of non-exhaustive additional areas checked in NOTES section below. Also tested that outputing to a non-ascii filename works as expected.

NOTES: Additional tests performed: Malware-scan

Vuln-scan

Remediate

Example file scanned now works with this change:

VERBOSE: File added to scan queue: /home/david/qa/malware-samples/smallSet/non-ascii-filenames/こんにちは.php                                                                                                                                                                                                                                                                                                                                                       
VERBOSE: Processing file: /home/david/qa/malware-samples/smallSet/non-ascii-filenames/こんにちは.php 
youradds commented 1 month ago

I'm not sure if this is a problem still related to this past issue - but when running:L

wordfence malware-scan /home --email andy@mysite.com -L ERROR --output-format human --output-path /home/wordfence-cli-scan.csv

The cron emails me:

Error: 'bytes' object has no attribute 'encode'

I have turned off output logging as it was just giving me a load of warnings about files not existing any more (mostly W3totalcache tmp files). I don't get an email - so I'm not sure if its dying, or just didn't find anything that needed emailing?

akenion commented 1 month ago

@youradds Thanks for the report, I went ahead and opened a new issue #263 to address this specific error (tracking it as a new issue fits the project workflow better than re-opening an existing one).

I'm going to go ahead and close out the discussion here. Any additional encoding-related errors can be reported as new issues.