whikloj / BagItTools

A PHP BagIt library
MIT License
11 stars 5 forks source link

Check for wrapped lines first, then check for new tags. #18

Closed whikloj closed 3 years ago

whikloj commented 3 years ago

Resolves: #17

Reverse the tests to assume lines starting with at least 2 spaces or a tab character are the continuation of the previous line. Otherwise start checking for a new tag.

@henning-gerhardt does this sound reasonable?

codecov[bot] commented 3 years ago

Codecov Report

Merging #18 into main will increase coverage by 0.45%. The diff coverage is 79.54%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main      #18      +/-   ##
============================================
+ Coverage     78.35%   78.81%   +0.45%     
- Complexity      456      461       +5     
============================================
  Files             7        7              
  Lines          1266     1284      +18     
============================================
+ Hits            992     1012      +20     
+ Misses          274      272       -2     
Impacted Files Coverage Δ Complexity Δ
src/Bag.php 79.37% <79.54%> (+0.73%) 276.00 <1.00> (+5.00)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 667ae0b...64c88da. Read the comment docs.

jens-st commented 3 years ago

Dear @whikloj,

the RFC mentions

It is RECOMMENDED that lines not exceed 79 characters in length. Long values MAY be continued onto the next line by inserting a LF, CR, or CRLF, and then indenting the next line with one or more linear white space characters (spaces or tabs). Except for linebreaks, such padding does not form part of the value.

(https://tools.ietf.org/html/rfc8493#section-2.2.2)

Best regards, JS

henning-gerhardt commented 3 years ago

I tried your changes with a really long value (3 lines text with some colons between): creating and even reading the info bag was successful!

whikloj commented 3 years ago

Hi @jens-st, I don't understand your comment. Could you help explain what you are trying to direct my attention to?

whikloj commented 3 years ago

@henning-gerhardt could you give this one more test. I made a couple changes to support reading bags with newlines in the bag-info.txt tags without removing those newlines. It should all be good, but I'd appreciate a once over and then I'll cut a point release.

henning-gerhardt commented 3 years ago

I checked your latest changes and I'm unsure. If I put a very long value into a bag info tag, which is wrapped inside bag-info.txt into multiple lines, then I would assume on reading this bag back that I'm getting this bag info tag without any formation and / or added new lines back.

whikloj commented 3 years ago

Okay that is a fair point. To properly wrap lines on word break before we hit 79 characters makes the actual line length dependent on the word length. So it would be hard to tell if the newline was an auto-wrapping or a deliberate carriage return.

So maybe the trade-off is we leave your files on disk alone when reading, but if you execute $bag->update() then we are reformatting your lines. Which is unfortunate but I don't really see away around it right now. I'll revert those changes.

whikloj commented 3 years ago

Actually, I may have talked myself out of this. The section of the RFC that @jens-st referenced above says

...Except for linebreaks, such padding does not form part of the value.

Which I read to mean that linebreaks ARE part of the value. So maybe instead of removing them by default I should be storing them, but providing a way to retrieve either the original value (with line breaks) or a clean value (without line breaks). Thoughts?

henning-gerhardt commented 3 years ago

I will discuss this with my teammates (one is maintaining a Perl based BagIt implementation) on Monday or Tuesday. As I'm only creating the bag but don't read it later I have only limited view on this.

Edit: the bag info read method on the Perl implementation is located here: https://metacpan.org/release/Archive-BagIt/source/lib/Archive/BagIt/Base.pm#L657 ... but I have only a limited knowledge in Perl.

henning-gerhardt commented 3 years ago

Ok, we discussed this: adding line breaks and providing them on read is ok. But added padding white spaces must be removed on read / providing the bag info tag value.

whikloj commented 3 years ago

Sorry for the delay on this @henning-gerhardt. So I've gone with lines over 70 characters are considered to have auto-wrapped. So I remove the newline characters. Lines under that, I keep the newline.

I'm looking at storing all the newlines and providing a way to access it with or without. So the data is pure but you can access it as a single string. But that will come later.

Let me know if this still works for you.

henning-gerhardt commented 3 years ago

Sorry for my late reply but I was in vacation. Changes looks good for my use case.

whikloj commented 3 years ago

You never have to apologize for taking vacation 😄