Closed GoogleCodeExporter closed 8 years ago
Issue 16 has been merged into this issue.
Original comment by chrisisbeef
on 2 Dec 2009 at 7:53
Issue 21 has been merged into this issue.
Original comment by chrisisbeef
on 2 Dec 2009 at 7:54
Introduction:
-------------
In dealing with the codec's internals I started thinking of ways that
the architecture might be improved. I was going to wait until after 2.0
but Jim encouraged me to do it now. My suggestions/proposal here is for
a advanced user API and a SPI. As such it doesn't affect the current
Encoder interface that has received a fair bit of attention of late and
also needs help.
I've probably gone overboard on this but I haven't figured out a way to
make it simpler without having the functionality I would like to see. Any
suggestions for simplification without reduction of functionality would
be appreciated (especially when it comes to input needing pushback).
It's also worth mentioning that I'm willing to head up implementing
this and that I certainly am not proposing something I want others
to implement;)
Goals:
------
Simple to use codecs and implement codecs yet flexible enough to allow
easy expansion and custom features.
Bases:
------
There are advantages and disadvantages to interfaces vs. abstract classes
as the base. 1.4 uses a interface and 2.0 currently uses a abstract
class. I tend to like the in between of a interface and a abstract base
class that implements it.
One thing that I think should be done is to split up the codec
interface so that a encoder can be implemented without requiring a
decoder. Additionally, the encode and decode do not need to separate
interfaces either as both convert from one type to another. In these
thoughts I refer to one as a Codec that codes but other names may be
preferred (Transformer that transforms).
Such combination leads to four Codec types:
Bytes to Bytes
Characters to Characters
html, css and js encode and decode
Characters to Bytes
hex and base64 decode
base64 decode
Bytes to Characters
hex and base64 encode
Technically Bytes to Bytes is only really needed but not providing it
and requiring dealing with character encodings at each steep wouldn't
be a good idea.
Additionally, there are two general methods of usage for codecs:
buffer to buffer
stream/reader to stream/writer
Currently ESAPI only supports the first one. Either can be implemented
via the other (eg: in a abstract base class). Implementing streaming
with buffers in a generic way is challenging with out buffering the
whole stream into memory (where do you chunk the input). My suggestions
below tend toward the latter method with base methods that implement
the former. This may seem an excessive amount of object wrapping but
it should be pointed out that the compiler itself wraps most string
concatenation in StringBuilder these days.
Methods of the streaming type would read from the input and preform
coding to the output until EOF is received on the input.
API vs. SPI:
------------
Making one public API interface with a public SPI seems a logical choice
here. Some of my suggestions on interfaces below are based on JAXP which
is similar. This also allows simpler SPI interfaces.
Byte Streaming Ins and Outs:
----------------------------
InputStream and OutputStream are the obvious choices here.
Char Streaming Ins and Outs:
----------------------------
On the input there isn't much of a choice: Reader.
Output could be Writer but I'm favoring Appendable as you can plug
in both a Writer or a StringBuilder/Buffer. The latter is really
convenient except for the nasty IOExceptions. The API could include
specific StringBuilder/Buffer methods that wrap the IOException in a
IllegalStateException as they never throw the IOException anyway (this
however would not work if we choose to throw encoding/decoding exceptions
as a subclass of IOException).
Byte Buffer Ins and Outs:
-------------------------
There isn't much to choose here except byte[] which is sufficient. I
think though that offset and length should always be available to as this
prevents copying buffers around all over the place. Methods that take
byte[] without offset and length should be provided in the API though.
Output to a byte[] is something I think I would pass on for now as
this gets complicated. Not only would the API have to return the amount
actually written to the output but it would also need to be able to handle
the case where there is more input than will fit in the output. This
may be something worth adding in the future if the need presents itself.
Char Buffer Ins and Outs:
-------------------------
Clearly input and output of Strings needs to be supported. Is it worth
including offset and length on input here as well? As String is immutable
so it must be returned so offset and length aren't really feasible.
char[] may also be worth supporting too though conversion back and forth
between char[] and String is fairly easy.
One recurring issue is whether to base methods on char, Character or
something else. 1.4 uses char. 2.0 currently uses Character. I suggest
using int. char is simple and fast. Character is faster in 1.5+ than it
was and makes Set's and Map's possible for implementations.
Both char and Character have the drawback though that they can't handle
unicode code points above 0xFFFF making full internationalization very
difficult (encode(char ch...) isn't possible as you may need two chars
for the full code point). This is one of the frustrating effects of the
unicode spec changing after Java adopted it.
Exceptions:
-----------
What should be done if a input cannot be encoded/decoded into an
output. This is probably more obviously a issue with decoding (xml &
with no ';') than encoding but it is still a problem with encoding to (a
null character cannot be encoded for CSS as the CSS spec clearly states
that null characters are not supported).
Input of characters also has the possibility of issues with surrogate
pairs. There is nothing that prevents surrogate pair values from being
in invalid sequences in a String. Similarly, should invalid/reserved
code points throw exceptions as well?
How to report such issues is another question. My suggestion is to throw
runtime exceptions here.
Lookahead:
----------
One of the short comings of the current codec implementation is that
there is only one character lookahead. While this is sufficient for
most of the codecs, it is not for HTML. Only recently was it discovered
that the HTML codec couldn't handle entities that started with the same
characters. The previous implementation would happily read characters
until it found a match in a Map. This caused &piv to always be treated
as &pi. When this was originally fixed, using the provided one character
lookahead seemed sufficient until theta and thetasym were found. This
required hacking around the one character lookahead.
An additional issue here is that the lookahead is provided by a internal
class making interoperability difficult. Someone parsing input with
their own lookahead system would have to deal with the using both
at the same time. My suggestion here is to just use the JRE provided
PushbackInputStream and PushbackReader. These can wrap other InputStreams
and Readers if needed and allow users of the standard classes to have
no difficulty working with ESAPI.
This does however bring another issue to the table: how big a push
back buffer is needed? Sadly the answer to this depends on the codec
itself. Here I think the codecs needing pushback could provide a method
to find out how much they need.
Lastly, is lookahead needed for encoding? I can only think of surrogate
pairs missing but that would be an error condition not a lookahead
problem.
+---+
|API|
+---+
The general API methods would be the abstract factory pattern used to
create individual codecs. Utility classes and methods are not listed
below but would include such functionality as allowing a codec to be
used in a input or output chain.
The Codec interfaces below are fairly extensive in the methods that are
provided. The idea here is to make the codecs easy to use. Most of these
would be implemented in the framework and call the far simpler SPICodecs
described afterwards.
public interface CodecConstants
{
public static final String HTML_ENCODER = "HTML Encoder";
public static final String HTML_ENCODER_FEATURE_USE_ENTITY_NAMES = "Use
Entity Names";
public static final String HTML_DECODER = "HTML Decoder";
public static final String XML_ENCODER = "XML Encoder";
// ...
}
public abstract class CodecFactory
{
public static BytesTosByteCodecFactory newByteByteCodecFactory(String name);
public static BytesTosByteCodecFactory newByteByteCodecFactory(String name,
String clsName, ClassLoader loader);
public static CharsTosCharCodecFactory newCharCharCodecFactory(String name);
public static CharsTosCharCodecFactory newCharCharCodecFactory(String name,
String clsName, ClassLoader loader);
public static CharsTosByteCodecFactory newCharByteCodecFactory(String name);
public static CharsTosByteCodecFactory newCharByteCodecFactory(String name,
String clsName, ClassLoader loader);
public static BytesTosCharCodecFactory newByteCharCodecFactory(String name);
public static BytesTosCharCodecFactory newByteCharCodecFactory(String name,
String clsName, ClassLoader loader);
public Object getAttribute(String name);
public void setAttribute(String name, Object value);
public boolean getFeature(String name);
public void setFeature(String name);
}
public abstract class BytesToBytesCodecFactory
{
public BytesToBytesCodec newInstance();
}
public abstract class CharsToCharsCodecFactory
{
public CharsToCharsCodec newInstance();
}
public abstract class BytesToCharsCodecFactory
{
public BytesToCharsCodec newInstance();
}
public abstract class CharsToBytesCodecFactory
{
public CharsToBytesCodec newInstance();
}
public interface Codec
{
public boolean isPushbackCodec();
public int getMinPushbackSize();
}
public interface BytesToBytesCodec extends Codec
{
// buff to buff
public byte[] code(byte[] in, int off, int len);
public byte[] code(byte[] in);
public byte[] code(int b);
// buff to stream
public OutputStream code(byte[] in, int off, int len, OutputStream out)
throws IOException;
public OutputStream code(byte[] in, OutputStream out) throws IOException;
public OutputStream code(int b, OutputStream out) throws IOException;
// stream to buff
public byte[] code(InputStream in) throws IOException;
public byte[] code(PushbackInputStream in) throws IOException;
}
public interface CharsToCharsCodec extends Codec
{
// buff to buff
public String code(char[] in, int off, int len);
public String code(char[] in);
public String code(CharSequence in, int off, int len);
public String code(CharSequence in);
public String code(int ch);
// buff to stream
public Appendable code(char[] in, int off, int len, Appendable out) throws
IOException;
public Appendable code(char[] in, Appendable out) throws IOException;
public Appendable code(CharSequence in, int off, int len, Appendable out)
throws IOException;
public Appendable code(CharSequence in, Appendable out) throws IOException;
public Appendable code(int ch, Appendable out) throws IOException;
public StringBuilder code(char[] in, int off, int len, StringBuilder out);
public StringBuilder code(char[] in, StringBuilder out);
public StringBuilder code(CharSequence in, int off, int len, StringBuilder
out);
public StringBuilder code(CharSequence in, StringBuilder out);
public StringBuilder code(int ch, StringBuilder out);
public StringBuffer code(char[] in, int off, int len, StringBuffer out)
public StringBuffer code(char[] in, StringBuffer out)
public StringBuffer code(CharSequence in, int off, int len, StringBuffer out)
public StringBuffer code(CharSequence in, StringBuffer out)
public StringBuffer code(int ch, StringBuffer out)
// stream to buff
public String code(PushbackReader in) throws IOException;
public String code(Reader in) throws IOException;
}
public interface BytesToCharsCodec extends Codec
{
// buff to buff
public String code(byte[] in, int off, int len);
public String code(byte[] in);
public String code(b ch);
// buff to stream
public Appendable code(byte[] in, int off, int len, Appendable out) throws
IOException;
public Appendable code(byte[] in, Appendable out) throws IOException;
public Appendable code(int b, Appendable out) throws IOException;
public StringBuilder code(byte[] in, int off, int len, StringBuilder out);
public StringBuilder code(byte[] in, StringBuilder out);
public StringBuilder code(int b, StringBuilder out);
public StringBuffer code(byte[] in, int off, int len, StringBuffer out)
public StringBuffer code(byte[] in, StringBuffer out)
public StringBuffer code(int b, StringBuffer out)
// stream to buff
public String code(PushbackInputStream in) throws IOException;
public String code(InputStream in) throws IOException;
}
public interface CharsToBytesCodec extends Codec
{
// buff to buff
public byte[] code(char[] in, int off, int len);
public byte[] code(char[] in);
public byte[] code(CharSequence in, int off, int len);
public byte[] code(CharSequence in);
public byte[] code(int ch);
// buff to stream
public OutputStream code(char[] in, int off, int len, OutputStream out)
throws IOException;
public OutputStream code(char[] in, OutputStream out) throws IOException;
public OutputStream code(CharSequence in, int off, int len, OutputStream out)
throws IOException;
public OutputStream code(CharSequence in, OutputStream out) throws
IOException;
public OutputStream code(int ch, OutputStream out) throws IOException;
// stream to buff
public byte[] code(PushbackReader in) throws IOException;
public byte[] code(Reader in) throws IOException;
// stream to stream
public OutputStream code(Reader in, OutputStream out) throws IOException;
public OutputStream code(PushbackReader in, OutputStream out) throws
IOException;
}
+----+
|SPI:|
+----+
The interfaces for the SPI follow. There are far more SPICodec interfaces
than API ones to allow implementation to be fairly simple. The richer
API interfaces are provided by the framework.
Factory classes for the codecs here are also wrapped by the API.
Base:
-----
public interface SPICodecFactory
{
public Object getAttribute(String name);
public void setAttribute(String name, Object value);
public boolean getFeature(String name);
public void setFeature(String name);
}
public interface SPICodec
{
}
public interface PushbackSPICodec extends Codec
{
public int getMinPushbackSize();
}
Chars To Chars:
---------------
public interface PushbackCharsToCharsSPICodecFactory extends SPICodecFactory
{
public PushbackCharsToCharsSPICodec newInstance();
}
public interface PushbackCharsToCharsSPICodec extends PushbackSPICodec
{
public Appendable tansform(PushbackReader in, Appendable out) throws
IOException;
}
public interface CharsToCharsSPICodecFactory extends SPICodecFactory
{
public CharsToCharsSPICodec newInstance();
}
public interface CharsToCharsSPICodec extends SPICodec
{
public Appendable tansform(Reader in, Appendable out) throws IOException;
}
public interface CharToCharsSPICodecFactory extends SPICodecFactory
{
public CharToCharsSPICodec newInstance();
}
public interface CharToCharsSPICodec extends SPICodec
{
public Appendable tansform(int ch, Appendable out) throws IOException;
}
Bytes To Bytes:
---------------
public interface PushbackBytesToBytesSPICodecFactory extends SPICodecFactory
{
public PushbackBytesToBytesSPICodec newInstance();
}
public interface PushbackBytesToBytesSPICodec extends SPICodec
{
public OutputStream tansform(PushbackInputStream in, OutputStream out) throws
IOException;
}
public interface BytesToBytesSPICodecFactory extends SPICodecFactory
{
public BytesToBytesSPICodec newInstance();
}
public interface BytesToBytesSPICodec extends SPICodec
{
public OutputStream tansform(InputStream in, OutputStream out) throws
IOException;
}
public interface ByteToBytesSPICodecFactory extends SPICodecFactory
{
public ByteToBytesSPICodec newInstance();
}
public interface ByteToBytesSPICodec extends SPICodec
{
public OutputStream tansform(int b, OutputStream out) throws IOException;
}
Bytes To Chars:
---------------
public interface PushbackBytesToCharsSPICodecFactory extends SPICodecFactory
{
public PushbackBytesToCharsSPICodec newInstance();
}
public interface PushbackBytesToCharsSPICodec extends PushbackSPICodec
{
public Appendable tansform(PushbackInputStream in, Appendable out) throws
IOException;
}
public interface BytesToCharsSPICodecFactory extends SPICodecFactory
{
public BytesToCharsSPICodec newInstance();
}
public interface BytesToCharsSPICodec extends SPICodec
{
public Appendable tansform(InputStream in, Appendable out) throws
IOException;
}
public interface ByteToCharsSPICodecFactory extends SPICodecFactory
{
public ByteToCharsSPICodec newInstance();
}
public interface ByteToCharsSPICodec extends SPICodec
{
public Appendable tansform(int b, Appendable out) throws IOException;
}
Chars To Bytes:
---------------
public interface PushbackCharsToBytesSPICodecFactory extends SPICodecFactory
{
public PushbackCharsToBytesSPICodec newInstance();
}
public interface PushbackCharsToBytesSPICodec extends PushbackSPICodec
{
public OutputStream tansform(PushbackReader in, OutputStream out) throws
IOException;
}
public interface CharsToBytesSPICodecFactory extends SPICodecFactory
{
public CharsToBytesSPICodec newInstance();
}
public interface CharsToBytesSPICodec extends SPICodec
{
public OutputStream tansform(Reader in, OutputStream out) throws IOException;
}
public interface CharToBytesSPICodecFactory extends SPICodecFactory
{
public CharToBytesSPICodec newInstance();
}
public interface CharToBytesToCharsSPICodec extends SPICodec
{
public OutputStream tansform(int ch, OutputStream out) throws IOException;
}
Original comment by manico.james@gmail.com
on 19 Jan 2010 at 12:27
Original comment by chrisisbeef
on 20 Nov 2010 at 9:53
Original comment by manico.james@gmail.com
on 29 May 2012 at 3:19
Hi,
The bug I can reproduce seems quite related to this issue, so instead of
creating a new bug report I first start by commenting here. Tell me if I should
submit a new bug report instead.
After doing some small test with ESAPI 2.0.1, it seems to me that the HTML
encoder does not properly handle surrogate pairs when converting string to HTML
entity.
I would expect the following behavior :
assertEquals("💩", ESAPI.encoder().encodeForHTML("\uD83D\uDCA9"));
assertEquals("💩", ESAPI.encoder().encodeForHTMLAttribute("\uD83D\uDCA9"))
But in both case the string is incorrectly encoded as follow : "��"
The javascript encoder seems OK.
I did not test other encoder such as the XML
I have attached a small JSP to check behavior inside browser.
Original comment by olivier....@gmail.com
on 28 Oct 2013 at 11:52
Attachments:
I've attached a patch based on ESAPI-surrogate-encoder[0]. Please review and
apply if deemed appropriate.
[0]: https://github.com/kei-51/ESAPI-surrogate-encoder
Original comment by fe...@oghina.com
on 1 Apr 2014 at 10:51
Attachments:
The API is changing in 3.0 and the RI will be offloaded to independant
implemntation components. This will not be applicable.
Original comment by chrisisbeef
on 18 Sep 2014 at 8:26
Original issue reported on code.google.com by
chrisisbeef
on 2 Dec 2009 at 7:51