wf49670 / ppgen

Post-processing generator for DP
6 stars 4 forks source link

In some situations .pn directives generate invalid <span> elements #21

Closed davem2 closed 10 years ago

davem2 commented 10 years ago

If a .pn directive is followed by a .if h block, then the element that is generated in the HTML fails w3c validation with a "document type does not allow element "span" here" error. The page numbers still show up fine in the HTML (at least in Firefox), but causes the file to fail w3c validation.

Here is a ppgen code snippet that produces the error:

// 254.png
.pn +1

.if h
.de hr.tight { border:none;border-bottom:1px solid;margin-top: 0.5em;margin-bottom:0.5em;margin-left:45%; width:10%; margin-right:45% }
.li
<hr class="tight"/>
.li-
.if-
.if t
.hr 10%
.if-

.in +4
.ti -4
.nr psb 0em
.fs 110%
THE AFRICAN WANDERERS;

The resulting HTML:

<span class='pageno' title='235' id='Page_235'></span><hr class="tight"/>

<p class='c019'>THE AFRICAN WANDERERS;</p>

A full version of a ppgen project that illustrates the issue further can be found here: (there are 6 errors similar to this that occur in it)

https://dl.dropboxusercontent.com/u/59728548/sr/glimpses-of-nature/glimpses-of-nature-html.zip

ghost commented 10 years ago

For this to work, ppgen has to look inside and understand a literal block that it didn't generate. Worse, it has to change the user's literal block to accommodate the page bump. By design, ppgen treats literal user code as sacrosanct and would never do that.

Is there a reason the .pn +1 cannot be placed after the .if conditional block? Seems like it would be there in practice anyway, which works as designed. I don't see a way to accommodate this requested construction short of a horrible bandaid (sliding the .pn past a contiguous .li).

Please consider if we really want to change ppgen in this manner.

wf49670 commented 10 years ago

I would vote not to make that change, Roger.

However, it would be nice to detect the dangling .pn when encountering the .if block, and to warn of the problem instead of leaving it for validation to catch.

And I'm not sure the "bandaid" is that horrible. You already slide .pn specifications down to the proper spot, and it's simply a matter of noticing the .li and ignoring lines until the .li-. It should be almost the same code I provided yesterday to avoid looking for lang specifications inside literal blocks. (And I could provide that, if you'd like.)

ghost commented 10 years ago

Here's where I see a distinction. As Dave originally coded it, ppgen has to assume that the page number transition is in effect when the horizontal rule is generated. The .hr is a displayable element, and that's key. As Dave coded it, the page number should show up next to the horizontal rule--the next displayable element. I don't want to slide around it and override the PPer's intent. If Dave insists this needs to work, I suppose I can detect an unclothed span and just protect it in a div but I'm not crazy about that either.

ghost commented 10 years ago

I just checked. His code generates: <span class='pageno' title='8' id='Page_8'></span><hr class="tight"/> If I clothe it like this, it will validate: <div><span class='pageno' title='8' id='Page_8'></span></div><hr class="tight"/> I don't know if that's something we want to do or not.

windymilla commented 10 years ago

At work, but quick comment. I agree a page number shouldn't be skipped over some literal HTML (which could potentially be a substantial size). It's very common for people to just have

[123]

when the pagenum is not easily attachable to something else (or at least Guiguts can't do it. So using an otherwise empty

or

around a pagenum span if a literal block is encountered would be perfectly acceptable to PPV.

On 10 October 2014 12:11, rfrank notifications@github.com wrote:

Here's where I see a distinction. As Dave originally coded it, ppgen has to assume that the page number transition is in effect when the horizontal rule is generated. The .hr is a displayable element, and that's key. As Dave coded it, the page number should show up next to the horizontal rule--the next displayable element. I don't want to slide around it and override the PPer's intent. If Dave insists this needs to work, I suppose I can detect an unclothed span and just protect it in a div but I'm not crazy about that either.

— Reply to this email directly or view it on GitHub https://github.com/rfrank/ppgen/issues/21#issuecomment-58641917.

ghost commented 10 years ago

Okay we should try to code that. Nigel's point about the .li potentially containing text means we have to allow the .pn before and we have to handle it before the .li starts. I was thinking too small, about the .hr only.

It feels like this has to happen in the final generation phase. I need to search the document for page number spans that need to be surrounded by <div> .. </div>. There needs to be a way to determine if a page number span isn't already inside a block element. I'm not sure how to do that. Any ideas?

wf49670 commented 10 years ago

If this is the only situation we've found where we generate an unclothed page-number span, wouldn't it be simpler to just generate it in its own <div> when we encounter the .li? To me it feels like an after-the-fact detection of the unclothed span would be much more difficult, and I would vote for the simpler solution.

davem2 commented 10 years ago

Wouldn't it be simpler to just generate it in its own

when we encounter the .li?

This is the first solution that came to mind. However, I have a feeling there may be some situation(s) where inserting a

or

may cause unwanted side effects / breakage in the HTML (middle of a table? ...). I can't really think of a specific example, but since .li is so flexible there must be a case like this.

Something else to consider, IIRC just enclosing a pageno span within a

or

has a minor side effect of inserting some extra vertical space. In my first PP project I used this technique and vaguely remember there being a small side effect (not 100% sure if/what).

wf49670 commented 10 years ago

On 10/10/2014 7:08 PM, davem2 wrote:

Wouldn't it be simpler to just generate it in its own <div> when
we encounter the .li?

This is the first solution that came to mind. However, I have a feeling there may be some situation(s) where inserting a

or

may cause unwanted side effects / breakage in the HTML (middle of a table? ...). I can't really think of a specific example, but since .li is so flexible there must be a case like this.

You can already have .pn in the middle of a table, and as I recall it works correctly. (At least, it did for the cases I tried previously.)

Something else to consider, IIRC just enclosing a pageno span within a

or

has a minor side effect of inserting some extra vertical space. In my first PP project I used this technique and vaguely remember there being a small side effect (not 100% sure if/what).

That's certainly a consideration, but it should be controllable with appropriate CSS specifications on that div.

If a simple solution using

as we encounter that .pn doesn't work for some reason, I think it would be better to issue a warning message when the .li is encountered, saying that the HTML is unlikely to validate, and suggesting that the PPer move the .pn or the .li.

The idea of post-scanning the HTML to find page-number specifications that are not enclosed, while having to ignore HTML inserted via .li, seems very complex to me, and not worth the effort for a simple case where we could warn the PPer to move a bit of code around.

So if simply using a

doesn't work, I think ppgen should just issue a warning and be done with it.

Walt

ghost commented 10 years ago

I worked on coding this a lot today trying several different approaches. I was unsuccessful.

If the PPer puts a .pn before a literal block, that block might be a paragraph of text. The PPer rightly assumes that the page number will show on the first line of the paragraph, so it needs to go into the paragraph as a span. But if the first thing in the .li isn't a block, then it still needs to go in but as a div. Deciphering what the user has put in the literal block has stymied my efforts today.

I don't want to give up just yet. Just saying "Warning: this isn't going to validate" isn't the solution I'm hoping for.

ghost commented 10 years ago

No success again through the night.

If the user chooses to escape to HTML, then they are ultimately responsible for integrating their code with ppgen's code. That's always been the case. Perhaps it's time to just accept that. This is just one case where that general rule would apply.

For this particular case, I would expect the user to include the page number in their literal block and not have the .pn +1 before it at all. Then, to keep things in sync, the next page break after the literal block the user would use an explicit ".pn 72" or whatever the next page is (or they could do a ".pn +2").

I will put in a warning if ppgen encounters a ".li" when trying to place a page number unless someone jumps in and has a better idea.

wf49670 commented 10 years ago

On 10/11/2014 12:40 AM, rfrank wrote:

If the PPer puts a .pn before a literal block, that block might be a paragraph of text. The PPer rightly assumes that the page number will show on the first line of the paragraph, so it needs to go into the paragraph as a span. But if the first thing in the .li isn't a block, then it still needs to go in but as a div. Deciphering what the user has put in the literal block has stymied my efforts today.

I started to say that if the PPer wanted a paragraph of text then it's not necessary to use a .li block, but then I thought that maybe the PPer wants to style that paragraph of text without needing to figure out the cxxx class name that ppgen will assign to it.

Even worse, the .li might be generating a complex table, and ppgen might need to apply the page number to it.

Here's a different approach. Suppose we just provide a way for the PPer to tell ppgen a safe location to emit the page number span within the .li block. We would need ppgen to scan the code within the .li block, but it would have something specific to look for.

Some possibilities: (1) Rather than placing the .pn command before the .li block, the PPer might place it inside the block, still starting in column 1. The .pn command would then be the only thing that ppgen looks for within the .li block (except .li-, of course), and it would be the PPer's responsibility to ensure that it is placed where it will be properly enclosed by something. After emitting that page-number span, ppgen would continue to scan in case the .li block generates more than one page.

(2) We might allow a new trigger string within a .li block, perhaps something like (with any operand allowed by .pn), which the PPer would place where the number belongs. PPgen would scan the .li block looking for that construct, and emit the page-number span there. As with (1), ppgen would keep scanning in case there are multiple pages generated by the .li block.

Of these, I think (1) is sufficient, because line breaks are not significant in HTML. So if the PPer is generating something such as a paragraph (or even a table cell) and needs to specify the page number at some specific location, he can simply insert a line break, the .pn command, and another line break, and then continue with that paragraph or table cell without any noticeable effect. It should also be simpler code than for (2), and it keeps us with one mechanism for specifying page numbers, rather than introducing a new one.

Walt

ghost commented 10 years ago

Clever, indeed. I would vote against it, though. I don't want the PPer to have to learn anything new here; ppgen is complex enough. I want the PPer to know that their literal block is to contain standard HTML and that they are responsible for getting it right. That would apply here and for all the other situations we have yet to encounter.

Both your solution and mine will generate the same code, so there is no PPV implication. So it goes back to Dave, who has this scenario (and I think one other slightly different, based on his test file) in a real book. I'm not clutching this to my chest; which solution we use is still open. Dave, your thoughts?

ghost commented 10 years ago

On other thought. In your solution, Walt, I think it presumes that there will be a place to put in a pageno span. But it's possible, perhaps even common, for the .li to contain no block elements. So aren't we right back to having to determine programmatically if it's a div or a span?

wf49670 commented 10 years ago

On 10/11/2014 10:07 AM, rfrank wrote:

On other thought. In your solution, Walt, I think it presumes that there will be a place to put in a pageno span. But it's possible, perhaps even common, for the .li to contain no block elements. So aren't we right back to having to determine programmatically if it's a div or a span?

Good point, Roger, but I don't think so.

With my suggestion, I think you would still need the warning if ppgen encounters .pn then a .li block before it has found a spot to place the page-number span. Presumably, then, the PPer will either have to delete the .pn and incorporate the page-number span inside the HTML he's generating, or move either the .pn or the .li block.

What the PPer should have done (if we go with my suggestion) is to move the .pn inside the .li block at the proper spot. Or, if there is no proper spot, then he's back to moving either the .pn or the .li block to eliminate the warning. Note that in this case there is no place for him to put the page-number span within the block, either.

I am somewhat concerned, with your approach, of two things: (1) The PPer needs to learn the format of the page-number spans that ppgen generates, so he can incorporate one within his .li block if that is the proper spot for the page number. Yes, someone using .li blocks for HTML needs to know HTML, but this approach forces him to also know details of what ppgen does, which conceivably might change (as they have in this area once already). So for each new release of ppgen that he uses, the PPer will need to examine the ppgen-created HTML and see what the spans look like. Usually they will be the same as for the last project, but the PPer will need to check.

(2) Conceivably, the PPer will discover that something is wrong with his page numbering somewhere in the book, possibly before the .li block(s) in which he has embedded the page-number span(s). If the PPer corrects that by adjusting his .pn commands, he then has to remember to find the code he incorporated within his .li blocks and adjust it, too. I expect this situation to be rare, but it's a trap just waiting to catch the unwary.

I don't think it's much extra burden for someone who already knows enough HTML that he can use .li blocks to also learn where he can safely place a .pn command. And it does give the PPer one method of setting page numbers (.pn) whether he's inside or outside a .li block. So it's less learning, as I see it.

Walt

windymilla commented 10 years ago

I think I agree with Roger's suggestion that the PPer needs to consider page numbers themselves if they escape to literal HTML. There's a danger and additional complexity if ".li" no longer means "literal".

It is not like previously discussed issues such as trying to override the CSS for a particular item. Then, it was undesirable because you might need to generate an HTML with ppgen, find out the internal class used, then modify it, with a risk of the class name changing when you made subsequent edits earlier in the file and regenerating, etc.

In this case, the PPer knows the page number (from the scan) before they've ever run ppgen and so they can specify it manually, both inside the literal block for as many pages as necessary, and also afterwards to get the page counter back in sync.

I think the situation is that ppgen can already be used by experts to create any HTML they want. What we need to avoid is making it offputting to non-experts. I think as technically knowledgeable PPers we have a greater responsibility to the newcomer and the nervous than we do to people like ourselves.

Nigel

windymilla commented 10 years ago

My post crossed with Walt's.

I can see your point (1), Walt. However, I think having to specify a pagenum inside a literal will be a rare occurrence anyway, unless someone is doing so much literal coding that one might question whether ppgen is the best tool for that project.

Regarding point (2), I don't agree. The PPer will know to specify "page 72" in their literal code by looking at the scan and seeing that it is page 72. The only way page numbering could conceivably go wrong would be if you had the wrong number of ".pn +1" type directives and so ppgen got our of sync with the scans. If that was the case, then you would need to fix it regardless of use of literal, and your page numbers from your literal section onwards would automatically be brought into sync by using either ".pn 73" or ".pn +2" without any further adjustment needed to the literal block.

wf49670 commented 10 years ago

Truth be told, I'm not unhappy with simply warning the PPer, and it was (iirc) one of my early suggestions.

And it does keep things simpler in the ppgen code.

Walt

ghost commented 10 years ago

Great. Then I'll take your suggestion, Walt, and make it a warning.

ghost commented 10 years ago

Done and merged with develop.

wf49670 commented 10 years ago

Hi, Roger,

I've coded an enhancement to ppgen, based on the discussion in the forums from the PPer who wanted ppgen to translate Footnote and Illustration to his book's language. That was handled with ".nr Footnote" and ".nr Illustration", but he also wanted ppgen to generate "[Footnote 1 : ...]" with a space before the ":".

I'm not sure I agree with doing that, but it bothered me a bit that if he is going to do it he'll need to manually edit the text output files after ppgen creates them. With that after-ppgen step there's always the possibility that the PPer will neglect to do it before finally submitting the book.

So, for fun, I implemented a .sr directive.

The .sr directie gathers search/replace regular expressions that 

will be applied during the postprocessing phase of ppgen to make changes to the generated output.

Syntax: .sr <which> /search/replace/

Arguments:
  which is a string containing some combination of
      ulth (UTF-8, Latin-1, Text, HTML)
  search is  reg-ex search string
  replace is a reg-ex replace string
  / is any character not contained within either search or replace

The s/r strings are gathered during preprocessCommon and saved for use
during post-processing. Messages are issued telling how many lines 

contained each search string, and how many times the replacement was applied.

Example:
  .sr t /(Footnote \d+):/\1 :/
     This will apply to UTF-8 and Latin-1 text generation, and will
     replace the string Footnote <number>:
                   with Footnote <number> :

Note: The user must understand Python reg-ex syntax (e.g., \1 not 

$1). , but the strings are treated as raw strings (no Python-specific escapes are needed.

The -dd command line option will supply some additional debugging
information if needed.

I'm not sure what uses it might have other than that one. But it should handle any legitimate search and replace strings that re.search and re.sub(n) understand. One needs to be careful using this, of course, especially against the HTML output files.

I have not given you a pull request for this yet, as it's something totally unexpected that I wasn't sure you'd want to integrate. But it was an interesting learning exercise, even if you don't integrate it into the main branch of ppgen, and it has given me some ideas for another tool.

If interested, you can view it at https://github.com/wf49670/ppgen/tree/PostGenSR

Regards, (and thanks, again, for ppgen!) Walt

windymilla commented 10 years ago

I'm very disappointed it doesn't have the \C...\E feature of Guiguts regexps, in order to allow execution of arbitrary python code. :)

Seriously, it looks interesting - another of those advanced features which means ppgen could do almost anything, but that new PPers would not need to know about.

Splitting the ppgen documentation into beginner/advanced would be good for newcomers. Several of the existing sections, and several parts of some commands could be moved into the "advanced" area. Just looking down the sections, the following would be candidates, IMHO: Centering text, Conditionals, Comments (form 2), Cover image, Division, Drop caps, Emdashes(?), Greek, some of Illustrations, Macros, Mapping, Named Registers, Temporary Indent.

While looking through I also saw the .dt command under Special Situations. I think it is in fact required for every book. The comment about only for PPing convenience isn't true at DP, I don't think.

Nigel

wf49670 commented 10 years ago

Thanks for the comments, Nigel.

I will look into adding \C...\E and the other extensions; I may know a way to do it, but it will take some experimentation. I'm not sure how much it should really be needed, though. That canonical example in the GG manual seems to be increasing or decreasing page numbers by some amount in the HTML, but with ppgen wouldn't it be better to use regexes in the editor you're using to maintain the program and change the .pn statements appropriately, then regenerate?

Being run as they are, they affect what ppgen has generated, not the source file, and most such things are probably better changed in the source, aren't they?

Discussion probably belongs elsewhere, not in the long-closed issue. It only accidentally ended up here when I misdirected an email. If Roger is interested in this at all I suppose we could discuss it in the team thread. That's also a good spot to discuss any revamping of the doc. I'm not sure whether it's better to change the reference manual or do some more simple tutorials. Tutorials, along with better workflow and auxiliary tool suggestions may be more appropriate.

(But one tool that I'm starting to work on is something to apply the GG scanno regexes.)

ghost commented 10 years ago

Walt, I think Nigel was joking about \C...\E. I hope so, at least. More to the point, your .sr mod looks interesting but in your opinion do we need it? I think you see it for fixing one specific situation for the wording used in a LOTE footnote. Would that be more naturally served by a named register? Or just editing the file after generation. Ppgen is not meant to be a master format, after all. I'm struggling to justify adding a general-purpose regex at runtime based on what I know now. It's quite different from GG regexs because they change the source permanently, as in any editor. It's a different thing. Can you tell me more why we need it? I want to support what you do for ppgen, but I know complexity, even if completely optional, is scary to many. I guess what I am saying is that if you want the .sr dot command in there and are willing to document it, then I'll put it in. I don't have a strong reason not to. Just a fear of creeping featurism. You've paid your dues: you get to make this call. Let me know.

Related: Nigel suggests a user-friendly manual of the most common ppgen directives and constructions. I get the feeling that what we have is pretty overwhelming.

windymilla commented 10 years ago

I'm very sorry Walt. I was joking about \C...\E. I should know better than to post a joking message without flagging it much more explicitly - sorry for the confusion.

I think I agree with Roger. We could keep the code changes safely tucked away somewhere, in case other situations that need it arise, but I think it does add an additional air of technical complexity to ppgen. Anything we can do to open ppgen up to less confident PPers is worth doing.

On documentation, I wondered about transclusion (though I'm not really sure if/how it works) as a way of having two manuals based on one set of documentation. Each command on a separate wiki page (some commands split into basic/advanced features on separate pages). The basic manual just transcludes basic commands. The full manual transcludes all.

wf49670 commented 10 years ago

On 10/20/2014 11:04 PM, rfrank wrote:

Walt, I think Nigel was joking about \C...\E. I hope so, at least.

Ah, yes, I seem to have missed the smiley and irony. Nonetheless, while avoiding the ability to execute arbitrary code, it's an interesting challenge that I'll still probably take on :)

More to the point, your .sr mod looks interesting but in your opinion do we need it? I think you see it for fixing one specific situation for the wording used in a LOTE footnote. Would that be more naturally served by a named register? Or just editing the file after generation. Ppgen is not meant to be a master format, after all. I'm struggling to justify adding a general-purpose regex at runtime based on what I know now. It's quite different from GG regexs because they change the source permanently, as in any editor. It's a different thing. Can you tell me more why we need it? I want to support what you do for ppgen, but I know complexity, even if completely optional, is scary to many. I guess what I am saying is that if you want the .sr dot command in there and are willing to document it, then I'll put it in. I don't have a strong reason not to. Just a fear of creeping featurism. You've paid your dues: you get to make this call. Let me know.

I'm not convinced it's needed. (I'm also not convinced that a separate .nr is needed for this.)

I guess I do tend to think of ppgen as providing a master format, and I suspect that others might also view it that way. But at this point we don't have a good use case showing this is needed, so I'm fine with treating it as an educational exercise and leaving it one the shelf until a need is found. It's separate enough that integrating it later, if it turns out to be needed, should be simple to do.

Related: Nigel suggests a user-friendly manual of the most common ppgen directives and constructions. I get the feeling that what we have is pretty overwhelming.

Yes, he could be right about that. And transclusion may be a reasonable approach to having two manuals. The difficulty might be figuring out what is truly basic and what is advanced.

Walt