webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

"warcio check" does not warn of illegal characters in field names or values, including LF #158

Closed acidus99 closed 1 year ago

acidus99 commented 1 year ago

"warcio check" does not warn if an illegal character is present in either a field name or value.

See the attached WARC. (To keep things simple, this is just a gzipped file, not a per-record gzipped WARC) illegal-chars.warc.gz

Running warcio check illegal-chars.warc returns no warnings or errors. However if you look at the WARC in a hex editor, you will see a custom field which contains the BEL characters (0x07) in both the field name and field value:

image

This problem actually becomes worse if the illegal character is a LF

Per the WARC spec, a LF is not allowed in the field of a record. However, as discussed in #157, warcio treats a LF as a CRLF. This means if a field name or value contains a LFLF, warcio treats this as the separator between the header and block, and immediately starts reading N bytes as the block bytes, where N is the value of the Content-Length field. Since this was not actually the separator between the header and the block, some of the block is left over, and warcio returns the follow error:

    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 545
    Remainder: b'mple" \r\n'

You can see this behavior with this WARC (again, just a plain gzipped file) lf-in-field.warc.gz

This error message is confusing because there is nothing wrong with the Content-Length. The problem is instead with illegal characters in the field! If warcio reported illegal characters in the field name, it would be much more clear.

I found this issue while troubleshooting a WARC creation library that was using custom fields on WARC records. It was outputting error messages into the custom field, and included a LFLF from a stack trace. warcio read this LFLF as the header/block separator. Once I figured out what was happening, I checked warcio with other illegal characters like 0x07, and found no illegal characters were flagged.

Expected behavior: warcio check should tell you if the field names or values contain illegal characters

CorentinB commented 1 year ago

Hi @acidus99, just my 2 cents regarding this issue and the couple other you made recently about warcio check (#157 & #156), I would strongly suggest to not rely on "warcio check", warcio itself is not spec compliant and therefore using it to verify the validity of WARCs is a bad idea. I advised on #74 that the command should be removed until the issues mentioned belowed are fixed, but I am starting to loose hope that it will ever be fixed.

That being said, of course you're free to do the experimentations and work you want, it's just my personal opinion based on observations by various members of the archiving community.

If you want to learn more about why warcio is not spec compliant, there are multiple issues opened on the subject:

https://github.com/webrecorder/warcio/issues/74 https://github.com/webrecorder/warcio/issues/129 https://github.com/webrecorder/warcio/issues/128

Have a nice week-end!

ikreymer commented 1 year ago

Hi @acidus99 thanks for reporting this - we will try to get to this when we can! We are working on a lot of different tools and will try to address this when we can.

@CorentinB As you know, this is an open source project, and we do our best to fix things when we can, and we are not being paid for this work and have to balance priorities as best we can. Not everyone has the resources to help maintain open source tools, but since you describe yourself as 'digital archivist and open source enthusiast. Software Engineer @ the Internet Archive' - would you like to contribute a PR to fix some of these issues, in the spirit of open source? Or, perhaps you would like to provide financial support for fixing particular issues, we do have OpenCollective and GitHub Sponsorships. Happy to discuss more, if you'd like to be involved in a more constructive way.

The 'warcio check' feature was contributed by @wumpus, who is also an unpaid contributor, and perhaps this was outside the original scope of the tool, which was creating and writing WARC files. We are all trying our best here.

wumpus commented 1 year ago

warcio check was only intended to check digests. I double-checked the documentation and it's pretty clear. @CorentinB it would be nice if you would stop adding to the noise in this discussion.

'warcio test' is intended to check the whole standard. It's an unfinished PR. Please give it a look.

I would appreciate it if people would stop claiming that their opinion is the only one regarding the standard. #74 suffers from this problem. Both @ikreymer and I have expressed our opinions about a way forward for this problem.

@acidus99 if you'd like to refile this bug, there are 2 possibilities: One bug would be if warcio is capable of creating such an invalid warc.

The second would be for warcio to notice and complain about this problem.

And if you think the documentation for warcio check could be improved so that people don't have unrealistic expectations, I'm all ears.

wumpus commented 1 year ago

(Closing this because warcio check only checks digests, not anything else -- @acidus99 please open another bug for these other scenarios.)

CorentinB commented 1 year ago

Hi @ikreymer & @wumpus, open source is not only code, it's also discussion, sharing views, filling issues, reporting bugs, that kind of things. Open source is also about making people aware of the various issues they might encounter when using software X or Y.

I have no interest in contributing code to warcio (for now), because I have been able to see in previous discussions that you have apparently no interest in following the WARC standard. Although, if your decision on that change in the future, then of course I would be more than happy to spend time contributing if and when needed.

Regarding the various comments you made:

"I would appreciate it if people would stop claiming that their opinion is the only one regarding the standard." The whole point of a standard is that people shouldn't have interpretations of it, they should just follow it, or not.

"warcio check was only intended to check digests." This is exactly what I think the issue is, considering that warcio does not implement payload-digest properly. If you execute warcio check on a standard-compliant WARC file, it fails the check.

"Closing this because warcio check only checks digests, not anything else" no it doesn't, and I would appreciate if you would all stop claiming that it does and remove that feature.

That being said, I understand your frustration from me coming back to those issues every few months, so I won't talk about it anymore, and I'll cross my fingers that you will seriously consider fixing warcio one day or another.

acidus99 commented 1 year ago

@ikreymer @wumpus My apologies for not understanding intended the scope of warcio check. I had thought it was more akin to a validator / linter, hence my comment about "expected behavior." Looking at the help output now, it clearly states "WARC digest checker" so I don't think my bugs apply.

@acidus99 if you'd like to refile this bug, there are 2 possibilities: One bug would be if warcio is capable of creating such an invalid warc.

warcio doesn't seem to create non-compliant WARCs like those in this or #157. No need for a bug here.

The second would be for warcio to notice and complain about this problem.

This seems reasonable, but both this and #157 sound like they would be more appropriate as part of @wumpus's warcio test. I suggest the stock behavior for warcio should err on the "be liberal in what you accept" side of things. I don't think there is a reason to refile a bug here either.

More broadly @ikreymer I can certainly sympathize with running an open source project where everyone has opinions but orders of magnitude fewer are able or willing to contribute. Your work has been immensely helpful. while I'm not a python developer, I have contributed a one-off donation of $100. Thank you for your hard work.

wumpus commented 1 year ago

@acidus99 I'm happy to try to move my 2019 warc test not-yet-ready-to-merge code into something more visible that more people might work on. Do you think that would be useful?

I spent a lot of time writing it, and it's a shame that it got derailed by an astronomy contract job before it was done.

acidus99 commented 1 year ago

That would be amazing. I am not a python developer, or I would offer to help you write the code. That said, I'm completely willing to help you document it, test it, and give feedback on any linting behavior you want to implement.