Open GoogleCodeExporter opened 9 years ago
I'd be concerned that it would break the rare script which placed a strange
"comment" within brackets with the intention of not being displayed. What was
it that made you notice this difference in the first place?
VSfilter will display { just fine as long as it's not followed by a } and vice
versa.
Only when both opening and closing brackets are used together is everything
between eliminated.
The attached scripts using brackets with your "com" example will render just
fine in VSFilter:
c{o}m <-- In a pinch you could always use full-width brackets { } and
these will be displayed by VSFilter. Libass renders it too wide.
c{om <-- Libass does not render this correctly, only showing "c", VSFilter
displays normally
co}m <-- Both VSFilter & Libass display normally
c{o\}m <-- Both VSFilter & Libass display as "cm"
c\{om <-- Libass treats this as an escape, only showing "c{om", VSFilter
displays normally
co\}m <-- Libass treats this as an escape, only showing "co}m", VSFilter
displays normally
Even though it had good intentions, the current behavior in Libass seems like
it would result in more broken scripts than correct ones.
Considering VSFilter's present behavior, supporting \{ and \} escapes
independently is not an option. It would only make sense if a \{ escape was
only accepted if a \} escape followed. Otherwise they should be displayed
as-is, and everything between brackets eliminated. For example, after such a
change:
\} <--- should continue to be displayed as "\}" (currently "}" in Libass)
\{ <--- should continue to be displayed as "\{" (currently "{" in Libass)
c\{o}m <-- should continue to be displayed as "c\m" (currently "c{o}m" in
Libass)
c\{o\}m <-- only this should display as "c{o}m" (currently "c\m" in VSFilter)
Thoughts? This would seem to be the only way to minimize potential breakage,
while allowing normal opening/closing {} brackets to be displayed together.
Though I would need to check with YuZhuoHuang if it's even practical to parse
scripts is this fashion, since I'm unfamiliar with how VSFilter currently
handles elimination of brackets.
Original comment by cyber.sp...@gmail.com
on 22 Jun 2013 at 11:53
Attachments:
Uh no, that's way too complicated, and doesn't strictly reduce breakage.
libass probably has this because some of its users want to render any text with
it, even if it doesn't originate from an ASS script, and { } were the only
things that couldn't be properly escaped. (The original use case was probably
converting and rendering other subtitle formats with libass? I can only
speculate.)
Then here are some alternative proposals:
1. Add special cased ASS tags to generate { and }. This could be as simple as
interpreting \{ _inside_ style overrides. For example, {\{} would produce {.
2. Add ASS tags to generate arbitrary unicode characters, like {\unicode1234}
produces U+1234. Any character can be generated by this, including { and }. (I
don't like this so much because it requires extra code to convert the code to
UTF-8 and UTF-16, and it's not clear how characters outside of the BMP (i.e. >
U+10000 and beyond) should behave... what if users start to encode surrogate
pairs as separate \unicode commands?)
3. Add an ASS tag that includes quoted text literally. For example, the tag
could be named \lit, and {\lit"{}"} would generate {}.
4. Alternative form of \lit: include a byte length of the text that should be
reproduced literally. For example, {\lit1:}} would generate }. I like this
best, because it makes programmatically escaping text trivial and efficient. On
the other hand, byte length are hard to explain to users, and if the script is
recoded to another charset, it will break hard. (IMO ASS scriptd should always
be in UTF-8, but I have seen scripts in UTF-16 too.)
Original comment by nfxjfg@googlemail.com
on 23 Jun 2013 at 12:26
> 1. Add special cased ASS tags to generate { and }. This could be as simple as
> interpreting \{ _inside_ style overrides. For example, {\{} would produce {.
This sounds reasonable, but I would suggest using a text tag of some kind to
create these curly brackets. {\{} and {\}} would likely cause havoc for
existing parsers.
Something like {\fbl} {\fbr} or {\fb1} {\fb0} for "font brace left (open)" &
"font brace right (close)" respectively. This would minimize breakage when
encountered by an existing parser, and limit purpose of such a tag to
displaying curly brackets only.
> 2. Add ASS tags to generate arbitrary unicode characters
> ...I don't like this so much
I'm not fond of this idea either. It seems overkill just to support curly
brackets, when every other unicode character could be displayed as-is with
proper file encoding.
> 3. Add an ASS tag that includes quoted text literally.
> For example, the tag could be named \lit, and {\lit"{}"} would generate {}.
What usefulness would this have outside of displaying lines with curly
brackets? Placing random strings under a \lit override would cause existing
parsers to not display these lines remotely correctly.
> 4. Alternative form of \lit: include a byte length of the text that should be
reproduced literally.
Same objections as 3).
Original comment by cyber.sp...@gmail.com
on 23 Jun 2013 at 2:30
Overall I believe {\fb1}text{\fb0} would make most sense, functioning as
required open & close tags to display text within curly brackets.
Original comment by cyber.sp...@gmail.com
on 23 Jun 2013 at 2:55
>> 3. Add an ASS tag that includes quoted text literally.
>> For example, the tag could be named \lit, and {\lit"{}"} would generate {}.
>
>What usefulness would this have outside of displaying lines with curly
>brackets? Placing random strings under a \lit override would cause existing
>parsers to not display these lines remotely correctly.
>
>
>> 4. Alternative form of \lit: include a byte length of the text that should
be
>reproduced literally.
>
>Same objections as 3).
It would be nice to have a simple way to escape text. In fact, the first
proposal is missing \, which from what I know can only be produced with a
track: emit \ followed by a zero width joiner (U+2060) special character.
So if we change something anyway, why not improve the situation in general? 4.
would be the simplest solution for code which has to escape ASS. Actually, let
me suggest 5:
5. Like 4, but with a hack to improve backward compatibility. For example
{\lit4}{}\N would produce {}\N. After the closing } following a \litN tag
(where N is an integer), N bytes are copied literally without parsing tags or
escapes. But I don't really see any improvements over 4, other than
questionable backwards compatibility that "sometimes" works.
While suggestions 2 and 3 were somewhat "fancy" (with the intention to propose
something elegant, but which also happens to solve the problem), 4 is quite
straightforward and trivial to implement.
>Overall I believe {\fb1}text{\fb0} would make most sense
Fine by me. But why the parameter 1/0? This is normally used to enable a style
or a mode, which is not the case here. They just emit { and }, and they won't
necessarily be matched. Simiarily, {\fb1\fb1} emits two characters (again,
because \fb is not enabling/disabling state).
As I said above, I want to request another tag that emits \ if we go with this
solution.
So I would suggest:
\fbl produces {
\fbr produces }
\backslash produces \
Original comment by nfxjfg@googlemail.com
on 23 Jun 2013 at 3:31
So are you ok with \fbl \fbr \backslash?
Original comment by nfxjfg@googlemail.com
on 28 Jun 2013 at 4:39
> Fine by me. But why the parameter 1/0? This is normally used to enable a
style or a mode,
> which is not the case here. They just emit { and }, and they won't
necessarily be matched.
Well my original thought was for it to essentially act as a style override for
a text segment, with enable/disable states, but thinking about it a bit more, I
now see that there would be a potential need for tags which display these
separately, in order to ensure scripts are otherwise parsed correctly.
> \fbl produces {
> \fbr produces }
That said, this will probably be fine.
> \backslash
At first I wasn't so sure about this, but I guess it would be needed as well if
someone actually wanted to display \N or \n without a line break. Though it
should probably be shortened somehow, maybe \bksl (backslash) or \fbksl (font
backslash) to keep with the current short naming scheme for tags, unless
someone has a better naming idea.
Any changes such as this will need to be approved by an Aegisub dev as well
before before a final decision is made. Go ahead and contact Plorkyeran
(tgoyne) or jfs (jiifurusu) and point one of them towards this issue to express
any comments, suggestions, or objections they may have.
Original comment by cyber.sp...@gmail.com
on 28 Jun 2013 at 11:59
Well, some html tags, e.g. , can not be displayed either...
Original comment by YuZhuoHu...@gmail.com
on 3 Jul 2013 at 12:38
Does vsfilter interpret HTML tags? libass doesn't.
Original comment by nfxjfg@googlemail.com
on 3 Jul 2013 at 1:10
[deleted comment]
> Does vsfilter interpret HTML tags? libass doesn't.
I remember a subtitle format uses XML or something similar. VSFilter supports
it. And (un)forturnitely, VSFilter uses the some code to parse the dialog
content for almost all text based subtitles. So those tags can be used in ASS.
You should be able to use ASS tags in srt too 。。
And there's a trick to display :"<{}b>"
similar to this "display \" trick:
> emit \ followed by a zero width joiner (U+2060) special character.
I'm not encouraging it. But if special characters are only {, } and \, this
would be fine enough
> \backslash produces \
> \fbl produces {
> \fbr produces } (as long as we have \backslash and \fbl, \fbr is not a
requirement?)
Original comment by YuZhuoHu...@gmail.com
on 3 Jul 2013 at 2:08
>as long as we have \backslash and \fbl, \fbr is not a requirement?
Not strictly. So if you want we can omit this.
Original comment by nfxjfg@googlemail.com
on 3 Jul 2013 at 4:01
I'd prefer to have \fbr for consistency despite it not being strictly required.
I like that {\fbl} degrades well with older renderers, but introducing
printable characters to override blocks would cause a lot of issues. What
happens if \fbl is in an override block that also contains other tags? If this
isn't allowed then everything currently existing which collapses adjacent
override blocks is broken, and the process for doing that gets a lot more
complex. If it is allowed, then manipulating the contents of override blocks
gets a lot more complex (currently you can eliminate duplicate tags in a block
and rearrange them pretty willy-nilly, which a lot of code takes advantage of
(not just in Aegisub, but also in scripts other people have written)).
One option would be to use a character other than \ to begin it and require
that it be in a block with no \ characters (i.e. a comment block). I suspect
this would turn out to be more complex to support than something like \U007B
outside of blocks.
> (IMO ASS scripts should always be in UTF-8, but I have seen scripts in UTF-16
too.)
I've been told that Chinese software not designed with i18n in mind tends to be
UTF-16-only since it's actually easier to deal with on Windows than GBK.
Original comment by tgoyne
on 16 Jul 2013 at 10:47
So would you be more comfortable with the \U007B syntax (used outside of style
overrides)? That would be ok with me. The sequence \U alone should be extremely
rare, so there should be no compatibility concerns.
We have to make sure that nobody thinks you're supposed to split \U into
surrogate pairs, though.
Original comment by nfxjfg@googlemail.com
on 17 Jul 2013 at 3:52
I suppose so. It's overly general, but should be fairly trivial to support and
is a better fit for how ASS already works than the other suggestions.
Most effective way to avoid people trying to encode surrogate pairs would
probably be to have VSFilter explicitly drop/mangle/whatever codepoints
U+D800-U+DFFF to avoid having surrogate pairs work by coincidence.
Original comment by tgoyne
on 17 Jul 2013 at 4:59
How should codepoints beyond the BMP be encoded? the syntax should be
unambiguous.
I suggest: \U123B0\
Or in other word, terminate it with a second \ .
Original comment by nfxjfg@googlemail.com
on 17 Jul 2013 at 5:21
\u for 4 digits and \U for 8 digits is something I'm mildly a fan of, but
ultimately I don't care too much what the syntax is as long as it's
unambiguous. Terminating \ looks a little funny, but would be fine with me.
Original comment by tgoyne
on 17 Jul 2013 at 8:25
Whatever is done, it should:
A) Ignored by current and past versions of VSFilter & Libass script parsers.
B) Only enable the display of control characters which are unable to be
displayed currently in plaintext.
Am I misunderstanding something about this recent suggestion?
Original comment by cyber.sp...@gmail.com
on 17 Jul 2013 at 10:26
Then I suggest that \u is interpreted in normal text. This is done only in the
text body of subtitle events, and not anywhere else in the ASS file structure.
Further, they are not interpeted inside of comment/tag sections (i.e. inside of
{...}). Behavior inside of drawing mode is unspecified.
\uXXXX
X is a hex digit, i.e. one character of 0-9, a-z, A-Z. End-of-string (as well
as embedded NULs) behaves as if non-hex-digits are encountered. The number
formed by XXXX represents the unicode codepoint of the character the subtitle
parser is supposed to output in place of the \u escape. If one of the X
characters is not a hex digit, the parser must not interpret the escape, but
pass it through to the renderer without changing it.
Examples: (input -> what is rendered)
"abc\u00512345" -> "abcQ2345" (U+0051, which is Q)
"abc\u51def" -> "abc\u51def" (not interpreted because there are invalid
characters)
\UXXXXXX
X is defined as above. This represents a unicode codepoint. The allowed range
goes from U+0001 to (including) U+10FFFD, excluding invalid/reserved
codepoints. (There are no higher unicode code points.) Like the short variant
\u, \U must not be interpreted if any of the 6 X characters is not a hex digit.
Examples:
"abc\U1039cde" -> "abc𐎜de" (the codepoint is U+1039C UGARITIC LETTER U)
"\abc\U0051" -> "\abc\U0051" (not interpreted because there are invalid/not
enough characters)
If an \u or \U escape encodes invalid or reserved codepoints, the
implementation should reject these and should not interpret the escape. This
includes U+0000, and all codepoints defined as invalid or reserved by the
Unicode standard (such as anything past U+10FFFD).
In particular, it is NOT allowed to encode UTF-16 surrogate pairs using this
scheme (like JSON actually does, curse it). The codepoints U+D800 to
(including) U+DFFF must be rejected by any implementation to avoid that anyone
tries to encode surrogate pairs as escape sequences. (Note that the codepoints
used for surrogate are defined as reserved by Unicode.)
I tried to be as precise as possible.
> A) Ignored by current and past versions of VSFilter & Libass script parsers.
You mean no output at all? I don't see the advantage. They will not interpret
the escape and display it as text, and that's just as broken as not displaying
anything at all. It's not the same as override tags, where you _can_ get
degraded but useful display of the text.
> B) Only enable the display of control characters which are unable to be
displayed currently in plaintext.
I don't see any advantage here either. It'd be easier for an implementation,
though.
Original comment by nfxjfg@googlemail.com
on 18 Jul 2013 at 4:44
Original issue reported on code.google.com by
nfxjfg@googlemail.com
on 22 Jun 2013 at 8:51