wintersrd / pipelinewise-tap-mssql

Pipelinewise tap for Microsoft SQL Server
GNU Affero General Public License v3.0
14 stars 50 forks source link

Supporting different SQL Server info schemas - Singer Decimal #65

Closed s7clarke10 closed 8 months ago

s7clarke10 commented 8 months ago

Overview: The output of Singer Decimal is a string for a numeric column (with hints about the precision and scale) so it can be reconstructed correctly in the target. Singer Decimal - that is output as a string is done to avoid rounding issue in Python and the Singer Framework. It is a safer method as there is no loss in precision.

Problem:

  1. The output of Singer Decimal is a string for a numeric based columns. The old code did not support string as a value data_type when Singer Decimal was enabled. This cause JSON Schema validation to fail.
  2. The precision is not being correctly determined in later versions of SQL Server. The results from the SQL Server Information Schema can vary from database version to database version. In earlier database versions a int / decimal will have the value populated in the character_maximum_length in later database versions the maximum allow size in the numeric_precision field.

Ingestion into target-jsonl would fail due to schema validation failing. Loading into a custom Target Snowflake would fail due to the precision being set to None which is an invalid precision.

Changes:

  1. Add 'string' to the list of data types which can be outputted when singer decimal is enabled.
  2. Try to obtain the precision from the character_maximum_length field in the information schema. If the result is None then obtain the precision from the numeric_precision field.

Results:

Target JSONL

image

Target Snowflake

image image

s7clarke10 commented 8 months ago

Happy with these changes - approving

Perhaps it would be good to review in more detail in future:

https://stackoverflow.com/questions/12728243/retrieve-size-integer-values-sql-server-table-with-information-schema-columns-vi

It seems there are alternative approaches such as making use of the sys.columns view which has specific columns named prec/scale.

Thank you for the review 👍 . I looked at the suggestion around the sys objects rather than the information schema. I think the risk that we have is from a DBA perspective often objects owned by sys are not always granted to most users where as a public view like the information schema is available. As you mentioned we may be able to do some conditional logic to use sys.columns if available. Most likely a public view is more future proof. We can consider other approaches in a future PR.