Open kafran opened 4 years ago
ctds
converts all byte
Python objects to a SQL
BINARY
type. The ctds.SqlVarChar
wrapper allows the client to explicitly specify a SQL
VARCHAR
type to ctds
. Note that in Python 2, the str
type is really more like bytes
. Likewise, for Python 3 str
types (and Python 2 unicode
), ctds
infers a SQL
NVARCHAR
type. This is done for usability and relies on the fact that SQL Server will convert the NVARCHAR
type to VARCHAR
as needed on the server.
For bulk_insert
, FreeTDS doesn't do any encoding of the column data on the client side and sends the raw column data to the server for insertion. ctds
encodes all string data to utf-8
prior to passing it off to FreeTDS unless the underlying Python object is already a bytes
-like object. Hence this section regarding bulk_insert
and strings.
@joshuahlang Thanks. I think I better understand now.
@joshuahlang I think VARCHAR and NVARCHAR names are really switched. This is causing me confusion. Am I wrong?
For example, in one script I have these two functions to convert strings to ctds types before bulk inserting. The problem is, the columns which are nvarchar at the database table (where collation is Latin1_General_CI_AS) I need to wrapped it on ctds.SqlVarChar and the columns which are varchar at the database I can wrap it to ctds.SqlNVarChar.
Don't you think this weird? If I change and use ctds.SqlNVarChar to wrap values to NVARCHAR columns (collation Latin1_General_CI_AS) the data gets encoded wrong on the database. I think ctds.SqlVarChar should be name ctds.SqlNVarChar and vice-versa.
def _to_nvarchar(txt: str) -> ctds.VARCHAR:
if txt == "null":
return None
return ctds.SqlVarChar(txt.encode("utf-16le"))
def _to_varchar(txt: str) -> ctds.VARCHAR:
if txt == "null":
return None
return ctds.SqlNVarChar(txt)
When passing in a python str
or ctds.SqlNVarChar
type, ctds
will encode the value as UTF-8 and pass the raw bytes along with the SQL type of NVARCHAR
to freetds. This works as expected typically, and freetds gets the data to SQL Server as expected. This, however, is not the case for the bulk_*
extensions that FreeTDS provides. I don't remember the exact details (it was a while ago now), but the bulk_*
APIs in freetds would always pass the raw encoded string bytes (UTF-8) to SQL server. As I recall, this was incorrect since the multi-byte characters should actually be UTF-16 encoded on the wire (per TDS protocol). I assumed this was just another bug in the bulk_*
FreeTDS APIS. I vaguely remember builk update not supporting the NVARCHAR
type, but I'd have to go back and look again.
Anyway, the current ctds
behavior of passing encoded bytes to the ctdsSqlVarChar
type is simply to allow the client to encode the data properly (as expected by SQL Server), and still specify the desired type of VARCHAR
. Without wrapping the encode bytes in ctds.SqlVarChar
, ctds
would infer a BINARY
type for that parameter.
I think the ctdsNvarchar should be the one supporting encode, since on the database side the Nvarchar columns are the ones which supports Unicode.
Why does ctds.SqlVarChar receives byte strings while ctds.SqlNVarChar receives just strings? Shouldn't this be the other way around, with ctds.SqlNVarChar passing byte strings unchanged to the database, once the NVarChar is the encoded type on the database?
I always get confuse about encoding strings when bulk inserting.