Handle \{ and \} escapes

GoogleCodeExporter commented 9 years ago

It appears ASS has no way to produce the "{" character. For this reason Libass 
added "\{" and "\}" escape sequences 3 years ago as libass specific features.

Maybe it would be a good idea to add these escapes in xy-vsfilter? Also see 
https://code.google.com/p/libass/issues/detail?id=98.

Original issue reported on code.google.com by nfxjfg@googlemail.com on 22 Jun 2013 at 8:51

GoogleCodeExporter commented 9 years ago

I'd be concerned that it would break the rare script which placed a strange 
"comment" within brackets with the intention of not being displayed. What was 
it that made you notice this difference in the first place?

VSfilter will display {  just fine as long as it's not followed by a } and vice 
versa.
Only when both opening and closing brackets are used together is everything 
between eliminated.

The attached scripts using brackets with your "com" example will render just 
fine in VSFilter:

c｛o｝m <-- In a pinch you could always use full-width brackets ｛ ｝ and 
these will be displayed by VSFilter. Libass renders it too wide.
c{om <-- Libass does not render this correctly, only showing "c", VSFilter 
displays normally
co}m <-- Both VSFilter & Libass display normally
c{o\}m <-- Both VSFilter & Libass display as "cm"
c\{om <-- Libass treats this as an escape, only showing "c{om", VSFilter 
displays normally
co\}m <-- Libass treats this as an escape, only showing "co}m", VSFilter 
displays normally

Even though it had good intentions, the current behavior in Libass seems like 
it would result in more broken scripts than correct ones.

Considering VSFilter's present behavior, supporting  \{ and \} escapes 
independently is not an option. It would only make sense if a \{ escape was 
only accepted if a \} escape followed. Otherwise they should be displayed 
as-is, and everything between brackets eliminated. For example, after such a 
change:

\} <--- should continue to be displayed as "\}" (currently "}" in Libass)
\{ <--- should continue to be displayed as "\{" (currently "{" in Libass)
c\{o}m <-- should continue to be displayed as "c\m" (currently "c{o}m" in 
Libass)
c\{o\}m <-- only this should display as "c{o}m" (currently "c\m" in VSFilter)

Thoughts? This would seem to be the only way to minimize potential breakage, 
while allowing normal opening/closing {} brackets to be displayed together. 
Though I would need to check with YuZhuoHuang if it's even practical to parse 
scripts is this fashion, since I'm unfamiliar with how VSFilter currently 
handles elimination of brackets.

Original comment by cyber.sp...@gmail.com on 22 Jun 2013 at 11:53

Attachments:

brackets.ass

GoogleCodeExporter commented 9 years ago

Uh no, that's way too complicated, and doesn't strictly reduce breakage.

libass probably has this because some of its users want to render any text with 
it, even if it doesn't originate from an ASS script, and { } were the only 
things that couldn't be properly escaped. (The original use case was probably 
converting and rendering other subtitle formats with libass? I can only 
speculate.)

Then here are some alternative proposals:

1. Add special cased ASS tags to generate { and }. This could be as simple as 
interpreting \{ _inside_ style overrides. For example, {\{} would produce {.

2. Add ASS tags to generate arbitrary unicode characters, like {\unicode1234} 
produces U+1234. Any character can be generated by this, including { and }. (I 
don't like this so much because it requires extra code to convert the code to 
UTF-8 and UTF-16, and it's not clear how characters outside of the BMP (i.e. > 
U+10000 and beyond) should behave... what if users start to encode surrogate 
pairs as separate \unicode commands?)

3. Add an ASS tag that includes quoted text literally. For example, the tag 
could be named \lit, and {\lit"{}"} would generate {}.

4. Alternative form of \lit: include a byte length of the text that should be 
reproduced literally. For example, {\lit1:}} would generate }. I like this 
best, because it makes programmatically escaping text trivial and efficient. On 
the other hand, byte length are hard to explain to users, and if the script is 
recoded to another charset, it will break hard. (IMO ASS scriptd should always 
be in UTF-8, but I have seen scripts in UTF-16 too.)

Original comment by nfxjfg@googlemail.com on 23 Jun 2013 at 12:26

GoogleCodeExporter commented 9 years ago

> 1. Add special cased ASS tags to generate { and }. This could be as simple as
> interpreting \{ _inside_ style overrides. For example, {\{} would produce {.

This sounds reasonable, but I would suggest using a text tag of some kind to 
create these curly brackets. {\{} and {\}} would likely cause havoc for 
existing parsers.

Something like {\fbl} {\fbr} or {\fb1} {\fb0} for "font brace left (open)" & 
"font brace right (close)" respectively. This would minimize breakage when 
encountered by an existing parser, and limit purpose of such a tag to 
displaying curly brackets only.

> 2. Add ASS tags to generate arbitrary unicode characters
> ...I don't like this so much

I'm not fond of this idea either. It seems overkill just to support curly 
brackets, when every other unicode character could be displayed as-is with 
proper file encoding.

> 3. Add an ASS tag that includes quoted text literally. 
> For example, the tag could be named \lit, and {\lit"{}"} would generate {}.

What usefulness would this have outside of displaying lines with curly 
brackets? Placing random strings under a \lit override would cause existing 
parsers to not display these lines remotely correctly.

> 4. Alternative form of \lit: include a byte length of the text that should be 
reproduced literally.

Same objections as 3).

Original comment by cyber.sp...@gmail.com on 23 Jun 2013 at 2:30

GoogleCodeExporter commented 9 years ago

Overall I believe {\fb1}text{\fb0} would make most sense, functioning as 
required open & close tags to display text within curly brackets.

Original comment by cyber.sp...@gmail.com on 23 Jun 2013 at 2:55

GoogleCodeExporter commented 9 years ago

>> 3. Add an ASS tag that includes quoted text literally. 
>> For example, the tag could be named \lit, and {\lit"{}"} would generate {}.
>
>What usefulness would this have outside of displaying lines with curly 
>brackets? Placing random strings under a \lit override would cause existing 
>parsers to not display these lines remotely correctly.
>
>
>> 4. Alternative form of \lit: include a byte length of the text that should 
be 
>reproduced literally.
>
>Same objections as 3).

It would be nice to have a simple way to escape text. In fact, the first 
proposal is missing \, which from what I know can only be produced with a 
track: emit \ followed by a zero width joiner (U+2060) special character.

So if we change something anyway, why not improve the situation in general? 4. 
would be the simplest solution for code which has to escape ASS. Actually, let 
me suggest 5:

5. Like 4, but with a hack to improve backward compatibility. For example 
{\lit4}{}\N would produce {}\N. After the closing  } following a \litN tag 
(where N is an integer), N bytes are copied literally without parsing tags or 
escapes. But I don't really see any improvements over 4, other than 
questionable backwards compatibility that "sometimes" works.

While suggestions 2 and 3 were somewhat "fancy" (with the intention to propose 
something elegant, but which also happens to solve the problem), 4 is quite 
straightforward and trivial to implement.

>Overall I believe {\fb1}text{\fb0} would make most sense

Fine by me. But why the parameter 1/0? This is normally used to enable a style 
or a mode, which is not the case here. They just emit { and }, and they won't 
necessarily be matched. Simiarily, {\fb1\fb1} emits two characters (again, 
because \fb is not enabling/disabling state).

As I said above, I want to request another tag that emits \ if we go with this 
solution.

So I would suggest:

\fbl produces {
\fbr produces }
\backslash produces \

Original comment by nfxjfg@googlemail.com on 23 Jun 2013 at 3:31

GoogleCodeExporter commented 9 years ago

So are you ok with \fbl \fbr \backslash?

Original comment by nfxjfg@googlemail.com on 28 Jun 2013 at 4:39

GoogleCodeExporter commented 9 years ago

> Fine by me. But why the parameter 1/0? This is normally used to enable a 
style or a mode,
> which is not the case here. They just emit { and }, and they won't 
necessarily be matched. 

Well my original thought was for it to essentially act as a style override for 
a text segment, with enable/disable states, but thinking about it a bit more, I 
now see that there would be a potential need for tags which display these 
separately, in order to ensure scripts are otherwise parsed correctly.

> \fbl produces {
> \fbr produces }

That said, this will probably be fine.

> \backslash

At first I wasn't so sure about this, but I guess it would be needed as well if 
someone actually wanted to display \N or \n without a line break. Though it 
should probably be shortened somehow, maybe \bksl (backslash) or \fbksl (font 
backslash) to keep with the current short naming scheme for tags, unless 
someone has a better naming idea.

Any changes such as this will need to be approved by an Aegisub dev as well 
before before a final decision is made. Go ahead and contact Plorkyeran 
(tgoyne) or jfs (jiifurusu) and point one of them towards this issue to express 
any comments, suggestions, or objections they may have.

Original comment by cyber.sp...@gmail.com on 28 Jun 2013 at 11:59

GoogleCodeExporter commented 9 years ago

Well, some html tags, e.g. , can not be displayed either...

Original comment by YuZhuoHu...@gmail.com on 3 Jul 2013 at 12:38

GoogleCodeExporter commented 9 years ago

Does vsfilter interpret HTML tags? libass doesn't.

Original comment by nfxjfg@googlemail.com on 3 Jul 2013 at 1:10

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

> Does vsfilter interpret HTML tags? libass doesn't.

I remember a subtitle format uses XML or something similar. VSFilter supports 
it. And (un)forturnitely, VSFilter uses the some code to parse the dialog 
content for almost all text based subtitles. So those tags can be used in ASS. 
You should be able to use ASS tags in srt too 。。

And there's a trick to display :"<{}b>"
similar to this "display \" trick:
> emit \ followed by a zero width joiner (U+2060) special character.
I'm not encouraging it. But if special characters are only {, } and \, this 
would be fine enough
> \backslash produces \
> \fbl produces {
> \fbr produces } (as long as we have \backslash and \fbl, \fbr is not a 
requirement?)

Original comment by YuZhuoHu...@gmail.com on 3 Jul 2013 at 2:08

GoogleCodeExporter commented 9 years ago

>as long as we have \backslash and \fbl, \fbr is not a requirement?

Not strictly. So if you want we can omit this.

Original comment by nfxjfg@googlemail.com on 3 Jul 2013 at 4:01

GoogleCodeExporter commented 9 years ago

I'd prefer to have \fbr for consistency despite it not being strictly required.

I like that {\fbl} degrades well with older renderers, but introducing 
printable characters to override blocks would cause a lot of issues. What 
happens if \fbl is in an override block that also contains other tags? If this 
isn't allowed then everything currently existing which collapses adjacent 
override blocks is broken, and the process for doing that gets a lot more 
complex. If it is allowed, then manipulating the contents of override blocks 
gets a lot more complex (currently you can eliminate duplicate tags in a block 
and rearrange them pretty willy-nilly, which a lot of code takes advantage of 
(not just in Aegisub, but also in scripts other people have written)).

One option would be to use a character other than \ to begin it and require 
that it be in a block with no \ characters (i.e. a comment block). I suspect 
this would turn out to be more complex to support than something like \U007B 
outside of blocks.

> (IMO ASS scripts should always be in UTF-8, but I have seen scripts in UTF-16 
too.)

I've been told that Chinese software not designed with i18n in mind tends to be 
UTF-16-only since it's actually easier to deal with on Windows than GBK.

Original comment by tgoyne on 16 Jul 2013 at 10:47

GoogleCodeExporter commented 9 years ago

So would you be more comfortable with the \U007B  syntax (used outside of style 
overrides)? That would be ok with me. The sequence \U alone should be extremely 
rare, so there should be no compatibility concerns.

We have to make sure that nobody thinks you're supposed to split \U into 
surrogate pairs, though.

Original comment by nfxjfg@googlemail.com on 17 Jul 2013 at 3:52

GoogleCodeExporter commented 9 years ago

I suppose so. It's overly general, but should be fairly trivial to support and 
is a better fit for how ASS already works than the other suggestions.

Most effective way to avoid people trying to encode surrogate pairs would 
probably be to have VSFilter explicitly drop/mangle/whatever codepoints 
U+D800-U+DFFF to avoid having surrogate pairs work by coincidence.

Original comment by tgoyne on 17 Jul 2013 at 4:59

GoogleCodeExporter commented 9 years ago

How should codepoints beyond the BMP be encoded? the syntax should be 
unambiguous.

I suggest: \U123B0\

Or in other word, terminate it with a second \ .

Original comment by nfxjfg@googlemail.com on 17 Jul 2013 at 5:21

GoogleCodeExporter commented 9 years ago

\u for 4 digits and \U for 8 digits is something I'm mildly a fan of, but 
ultimately I don't care too much what the syntax is as long as it's 
unambiguous. Terminating \ looks a little funny, but would be fine with me.

Original comment by tgoyne on 17 Jul 2013 at 8:25

GoogleCodeExporter commented 9 years ago

Whatever is done, it should:

A) Ignored by current and past versions of VSFilter & Libass script parsers.

B) Only enable the display of control characters which are unable to be 
displayed currently in plaintext.

Am I misunderstanding something about this recent suggestion?

Original comment by cyber.sp...@gmail.com on 17 Jul 2013 at 10:26

GoogleCodeExporter commented 9 years ago

Then I suggest that \u is interpreted in normal text. This is done only in the 
text body of subtitle events, and not anywhere else in the ASS file structure. 
Further, they are not interpeted inside of comment/tag sections (i.e. inside of 
{...}). Behavior inside of drawing mode is unspecified.

\uXXXX

X is a hex digit, i.e. one character of 0-9, a-z, A-Z. End-of-string (as well 
as embedded NULs) behaves as if non-hex-digits are encountered. The number 
formed by XXXX represents the unicode codepoint of the character the subtitle 
parser is supposed to output in place of the \u escape. If one of the X 
characters is not a hex digit, the parser must not interpret the escape, but 
pass it through to the renderer without changing it.

Examples: (input -> what is rendered)

"abc\u00512345" -> "abcQ2345" (U+0051, which is Q)
"abc\u51def" -> "abc\u51def" (not interpreted because there are invalid 
characters)

\UXXXXXX

X is defined as above. This represents a unicode codepoint. The allowed range 
goes from U+0001 to (including) U+10FFFD, excluding invalid/reserved 
codepoints. (There are no higher unicode code points.) Like the short variant 
\u, \U must not be interpreted if any of the 6 X characters is not a hex digit.

Examples:

"abc\U1039cde" -> "abc𐎜de" (the codepoint is U+1039C UGARITIC LETTER U)
"\abc\U0051" -> "\abc\U0051" (not interpreted because there are invalid/not 
enough characters)

If an \u or \U escape encodes invalid or reserved codepoints, the 
implementation should reject these and should not interpret the escape. This 
includes U+0000, and all codepoints defined as invalid or reserved by the 
Unicode standard (such as anything past U+10FFFD).

In particular, it is NOT allowed to encode UTF-16 surrogate pairs using this 
scheme (like JSON actually does, curse it). The codepoints U+D800 to 
(including) U+DFFF must be rejected by any implementation to avoid that anyone 
tries to encode surrogate pairs as escape sequences. (Note that the codepoints 
used for surrogate are defined as reserved by Unicode.)

I tried to be as precise as possible.

> A) Ignored by current and past versions of VSFilter & Libass script parsers.

You mean no output at all? I don't see the advantage. They will not interpret 
the escape and display it as text, and that's just as broken as not displaying 
anything at all. It's not the same as override tags, where you _can_ get 
degraded but useful display of the text.

> B) Only enable the display of control characters which are unable to be 
displayed currently in plaintext.

I don't see any advantage here either. It'd be easier for an implementation, 
though.

Original comment by nfxjfg@googlemail.com on 18 Jul 2013 at 4:44

xiaowan3 / xy-vsfilter

Handle \{ and \} escapes #149