zillow / ctds

Python DB-API 2.0 library for MS SQL Server
MIT License
83 stars 12 forks source link

VARCHAR and NVARCHAR string encoding #67

Open kafran opened 4 years ago

kafran commented 4 years ago

Why does ctds.SqlVarChar receives byte strings while ctds.SqlNVarChar receives just strings? Shouldn't this be the other way around, with ctds.SqlNVarChar passing byte strings unchanged to the database, once the NVarChar is the encoded type on the database?

I always get confuse about encoding strings when bulk inserting.

joshuahlang commented 4 years ago

ctds converts all byte Python objects to a SQL BINARY type. The ctds.SqlVarChar wrapper allows the client to explicitly specify a SQL VARCHAR type to ctds. Note that in Python 2, the str type is really more like bytes. Likewise, for Python 3 str types (and Python 2 unicode), ctds infers a SQL NVARCHAR type. This is done for usability and relies on the fact that SQL Server will convert the NVARCHAR type to VARCHAR as needed on the server.

For bulk_insert, FreeTDS doesn't do any encoding of the column data on the client side and sends the raw column data to the server for insertion. ctds encodes all string data to utf-8 prior to passing it off to FreeTDS unless the underlying Python object is already a bytes-like object. Hence this section regarding bulk_insert and strings.

kafran commented 4 years ago

@joshuahlang Thanks. I think I better understand now.

kafran commented 4 years ago

@joshuahlang I think VARCHAR and NVARCHAR names are really switched. This is causing me confusion. Am I wrong?

For example, in one script I have these two functions to convert strings to ctds types before bulk inserting. The problem is, the columns which are nvarchar at the database table (where collation is Latin1_General_CI_AS) I need to wrapped it on ctds.SqlVarChar and the columns which are varchar at the database I can wrap it to ctds.SqlNVarChar.

Don't you think this weird? If I change and use ctds.SqlNVarChar to wrap values to NVARCHAR columns (collation Latin1_General_CI_AS) the data gets encoded wrong on the database. I think ctds.SqlVarChar should be name ctds.SqlNVarChar and vice-versa.

def _to_nvarchar(txt: str) -> ctds.VARCHAR:
    if txt == "null":
        return None
    return ctds.SqlVarChar(txt.encode("utf-16le"))

def _to_varchar(txt: str) -> ctds.VARCHAR:
    if txt == "null":
        return None
    return ctds.SqlNVarChar(txt)
joshuahlang commented 4 years ago

When passing in a python str or ctds.SqlNVarChar type, ctds will encode the value as UTF-8 and pass the raw bytes along with the SQL type of NVARCHAR to freetds. This works as expected typically, and freetds gets the data to SQL Server as expected. This, however, is not the case for the bulk_* extensions that FreeTDS provides. I don't remember the exact details (it was a while ago now), but the bulk_* APIs in freetds would always pass the raw encoded string bytes (UTF-8) to SQL server. As I recall, this was incorrect since the multi-byte characters should actually be UTF-16 encoded on the wire (per TDS protocol). I assumed this was just another bug in the bulk_* FreeTDS APIS. I vaguely remember builk update not supporting the NVARCHAR type, but I'd have to go back and look again.

Anyway, the current ctds behavior of passing encoded bytes to the ctdsSqlVarChar type is simply to allow the client to encode the data properly (as expected by SQL Server), and still specify the desired type of VARCHAR. Without wrapping the encode bytes in ctds.SqlVarChar, ctds would infer a BINARY type for that parameter.

kafran commented 4 years ago

I think the ctdsNvarchar should be the one supporting encode, since on the database side the Nvarchar columns are the ones which supports Unicode.