wshito / asciidoctor-chunker

The utility to create chunked HTML files from the single HTML generated by Asciidoctor.
MIT License
25 stars 7 forks source link

Alternative Chunked Files Naming Scheme #29

Closed tajmone closed 2 years ago

tajmone commented 2 years ago

Currently chunked HTML files are named index.html, chap1.html ... chap10.html, etc. This is a reasonable default when publishing a single book as a website, either in the site root folder or in a subfolder.

In some contexts, multiple documents might be hosted in a same website folder (or other type of HTML publication). For this type of use it would be more practical if asciidoctor-chunker could adopt a different naming scheme, as described below.

Give a source (unsplit) document SomeBook.html, generate the chunked files: SomeBook.html, SomeBook_01.html ... SomeBook_10.html, etc. Where the base filename is preserved in every chunk, without a counter for the Title-Page, and with a same-width counter suffix (_<n>) in all other files, so that files are listed asciibetically in the folder (i.e. take into account the number of digits of the highest chunked file number, and add leading zeros to enforce same-width counter in all chunked files). In a publication with less than 10 chunked files, the counter would be a single digit (_n); between 10-99 it would be two digits(_nn); 100-999 three digits(_nnn), etc.

This would allow to host multiple publications in a single folder, and manually adding a custom index.html file providing link to the various publications.

Even in cases where a single publication is being hosted on the website, preserving the original base-name of the file is desirable, both in terms of highlighting the publication name and for Search Engine Optimization.

An additional CLI option should allow users to decide whether the Title-Page should be generated using the original basename (without counter) or be named index.html instead (the latter being useful when the publication is the only content shown in the website). I think that the base-name should be the default, and using index.html should require an extra option, but as long as the option is available it doesn't really make a huge difference which one is the default.

Since it's not possible to rename the chunked files without having to update all their cross reference links, these options would be extremely useful in context where multiple chunked publications need to coexist in a same folder, without filenames clashes.

wshito commented 2 years ago

This is in the realm of deployment. You can achieve it with shell or any other scripts much more easily.

tajmone commented 2 years ago

This is in the realm of deployment.

Not entirely. One might argue that naming the split files chap*.html is an opinionated semantic decision (appendices are not chapters). It also affects the possibility of storing multiple publications in the same folder.

You can achieve it with shell or any other scripts much more easily.

Every renamed HTML breaks cross reference links, and there isn't a single page which doesn't have at least one link from the TOC sidebar. Coordinating the renaming and links fixing of every HTML file is not so trivial, and the only "easy" solution would be to rely on RegExs to achieve this (which is not a good alternative to having it done by a dedicated tool which works on the DOM directly).

wshito commented 2 years ago

I see your point. It is absolutely the opinionated decision. But it is really simple to post-process the files. You can probably finish implementing the first draft while waiting for this discussion to conclude.

  1. The file names will be listed in order. You can just rename them with numberings while keeping the mappings of new and old names.
  2. Use sed to replace all the old filenames occurred in the files with new ones. You can find the sample snippet for find and replace really easily on the net. See here.
tajmone commented 2 years ago

Use sed to replace all the old filenames occurred in the files with new ones.

In the end the easier solution is to just take the asciidoctor-chunker.js distro file and tweak the single occurrence of "chap" directly in the source file (sure, this could also be done via SED, but it's a simple one-time operation).

The downside is that it hinders auto-updating the tool, since the toolchain now relies on a custom-modded version of the chunker instead of the Node.js package; but after having discussed it with my collaborators we decided it's worth the price — arbitrarily changing the original publication filename is a big "No! No!" in the editorial world, since authors are quite protective of their works.

I hope you might reconsider the idea of keeping the original HTML filename in the chunked docs (except index.html, of course). I believe that preserving whatever original filename the author decided to give to the publication is the safest default option, and I don't see any added value in replacing it with the arbitrary chosen "chap" suffix (which, by the way, only makes sense if the publication is in English).