PreparedStatement setBytes()

danap commented 8 years ago

Hello,

Somewhere between Release 3.8.10.2 and current code the PreparedStatement setBytes() has been changed that results in what appears to be twice the byte size being stored in the database. I'm not sure if getBytes() is the issue. This results in corrupt BLOB type being stored.

A cursory look appears it maybe around the changes in the Kaufmann, (character > U+FFFF) commit with the JDBC3PreparedStatement. I suppose it might also have to do with the native library. Currently I'm using the Linux/64bit one. I have not tested on Windows or 32 bit machines.

danap.

xerial commented 8 years ago

Thanks for reporting. This sounds serious. Could you give me a sample query (and code) to reproduce this issue? Thanks.

danap commented 8 years ago

Hello,

The test case code I have created to provide is not duplicating the behavior like the application code. The only difference that i can determine at this time is the temporary HashMap that is used to store the bytes for the user before election to dump to a file.

Without changing anything other then the build jars, 3.8.10.1 to the current exempts the behavior described above. I will not be able to investigate further for a day or two. It may be a problem with the current code jar build created on my end. The environment is JRE 7.

I would not classify this as a BUG at this time without further work on my end.

danap.

danap commented 8 years ago

Hello,

After further analysis, I have determined this is not related to the PreparedStatements setBytes(). Rather in seems the ResultSet getBytes() retrieval.

I have created a test case that demonstrates the issue. I do not believe this would be defined as a BUG, but a new inconsistency introduced after build 3.8.10.1/2 in retrieving the field data more then once using getString() then getBytes().

The application uses what has been a general safe getString() to determine if the content is NULL. If not normally processing takes place with reading a second time the content of the field with getBytes(). In so doing the returned bytes are 2X the size of the stored BLOB.

I have ran into a similar issue with Oracle, HyperSQL, and MSAccess. MSAccess only allows ResultSets to be read once, then they become null I believe. None of the other databases my application supports restrict or deform the data in retrieving a ResultSet more then once.

The application has been updated to insure this is no longer an issue, but I would say in the sqlite-jdbc it could snare other users who may also read ResultSet more then once. I learned long ago what you don't think the user will do in, will often be done.

Test Case: http://dandymadeproductions.com/temp/ File: SQLite_JDBC.java

Thank you for your work with the sqlite-jdbc driver. By the way the pull request I had opened for Date, Time, Literals, #33 can be closed. The current code is no longer causing an issue, the Date changes must have fixed the problem I was having with my application.

danap. Dana M. Proctor MyJSQLView Project Manager.

mkauf commented 8 years ago

I agree that this is a change in behavior of sqlite-jdbc. Sorry about that. But keep in mind that JDBC drivers don't need to support getBytes() for strings at all. Look at the table B-6 in the JDBC 4.1 specification ("Use of ResultSet getter Methods to Retrieve JDBC Data Types")

Behavior when reading a string before commit a4cf82d:

getBytes(); => the result is in the database's native encoding (see "PRAGMA encoding")
getString(); getBytes(); => always UTF-8

Behavior when reading a string after commit a4cf82d:

getBytes(); => the result is in the database's native encoding (see "PRAGMA encoding")
getString(); getBytes(); => always UTF-16

Maybe it would make sense to always return UTF-8 when getBytes() is called on a string?

getBytes(); => UTF-8
getString(); getBytes(); => UTF-8

I have created a pull request for this.

danap commented 8 years ago

Hello Michael,

The test case was only using strings to simplify the code to demonstrate the behavior. As I said the code for my application only uses getString() to perform a preliminary test for NULL content on all data types. I use getBytes() for BLOB content, and getString(); getBytes(); => UTF-8 would just again produce malformed data when working with Binary content, images and such.

The issue I think is the ResultSet read, first with getString() then getBytes() does not seem to be resetting the storage container to reflect the last call getBytes()?

In each of the databases my application supports, other then the ones mentioned, for instance PostgreSQL I can before the first call for testing a NULL entry with getString() and if not NULL then use the correct data type call, BLOB for instance getBytes() and not have malformed data returned. This is a non-issue for me since I have modified the code to just read the BLOB data type content once, which it really should have been doing to begin with for efficiency.

I would not recommend the new pull request => getString(); getBytes(); => UTF-8

danap.

mkauf commented 8 years ago

getString(); getBytes(); => UTF-8 would just again produce malformed data when working with Binary content, images and such

In my proposed pull request, the conversion to UTF-8 is done only for TEXT. Binary data (BLOB) is never modified. SQLite does not perform any conversions that would produce malformed data. See https://sqlite.org/c3ref/column_blob.html

The question is: What character encoding shall getBytes() use for TEXT? I propose to always use UTF-8, because that's what your original code expected. It's also the native format for databases created by sqlite-jdbc.

danap commented 8 years ago

Hello Michael,

In my proposed pull request, the conversion to UTF-8 is done only for TEXT. Binary data (BLOB) is never modified. SQLite does not perform any conversions that would produce malformed data. See https://sqlite.org/c3ref/column_blob.html

I'm sure it does not since I have already demonstrated with the test case that the data stored is unmodified. Its the retrieval of the data as defined by the test case provided.

In the test case I provided there is a method testBlobs() with a commented call to writeRead_BinaryFile(). It calls for the selection of a file for import to fill the BLOB field defined by the database. Run it, choose an image for data and you will see the same result, with a call to getString() then getBytes(), returns the same failed results. By the way a call to getObject() then getBytes() does not produce a failure.

As far as BLOB fields I have no idea what a user may wish to place in them in a database. They may arbitrarily just place UTF-8 characters instead of data outside that range. All I know is when an inital call to getString() then getBytes(), returns data that does not seem to match the stored content.

As indicated, somewhere after sqlite-jdbc 3.8.10.1/2 this behavior turned up. It matters not to me at this point since it no longer effects my application, MyJSQLView, because I now only do a single getBytes() call for a defined BLOB field. Works fine! I find it to be nothing more then an oddity to be dealt with like my other supported databases and their JDBC drivers.

The question is: What character encoding shall getBytes() use for TEXT? I propose to always use UTF-8, because that's what your original code expected. It's also the native format for databases created by sqlite-jdbc.

Then that is the answer to your question for a TEXT defined field.

danap.

mkauf commented 8 years ago

Hi danap,

thank you for the example program and the explanations. I have been able to reproduce the problem. "getString(); getBytes();" on a BLOB indeed changes the blob's data.

I will try to create a bugfix.

Regards, Michael

danap commented 8 years ago

Hi danap,

thank you for the example program and the explanations. I have been able to reproduce the problem. "getString(); getBytes();" on a BLOB indeed changes the blob's data.

I will try to create a bugfix.

Regards, Michael

Hello Michael,

I started thinking overnight about this issue. I did look at the sqlite-jdbc code at one point when I had an issue with temporal values.

It appeared to me that the driver does a test on content by a typeof to determine the type. I realize that even though a table may define a column to be BLOB anything can go in the column, so the driver needs to figure that out before processing. This is the issue I never liked about WEAK TYPE scripting languages and in this case SQLite data storage. It provides a very versatile ability to store whatever you want in the 'VARIABLE', but processing can be hazards.

I briefly looked to see if my application could try to do a typeof, but did not think it was possible or feasible. I rely on the defined Database Meta Data TYPE & CLASS results. So if a user defines the table TYPE to be BLOB I have to assume that is the type they will store there, so thereby use getBtyes(). The use of a TEST, generic getString(), for NULL has been safe in most of the database to not throw an error for any column, but poor choice. Since it seems the driver in this case thinks I may wish to get TEXT for the column and that is what stored there? So appears the issue I guess.

I really think the current code is acceptable, from my limited knowledge since as I have said seen this type of behavior in a couple of other databases. I now use a single read of getBytes() test for NULL then process accordingly.

The only possible issue that I can point out is, after reviewing the SQLite docs this morning, TEXT can be stored as "(UTF-8, UTF-16BE or UTF-16LE)" Java uses UTF-16 for strings, TEXT, and therefore if a column TYPE is defined as BLOB and the user puts TEXT there then it may not be limited to UTF-8 only.

Dana M. Proctor MyJSQLView Project Manager

xerial / sqlite-jdbc

PreparedStatement setBytes() #78