xnuinside / simple-ddl-parser

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, BigQuery, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc. & table properties, types, domains, etc.
MIT License
179 stars 40 forks source link

Support for non-numeric column type parameters #171

Closed blazewicz closed 1 year ago

blazewicz commented 1 year ago

Parser fails on parametrized column type definitions using non-numeric parameters, type parameter is always assumed to be a size associated with the type and parser expects it to be an int.

PostGIS, a pupular PostgreSQL extension for spatial data adds some extra types, for example Geometry, which are parametrized with a subtype (docs).

Example table definition:

CREATE TABLE t1 (
    p Geometry(MultiPolygon, 26918)
);

Tool fails with following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../site-packages/simple_ddl_parser/parser.py", line 345, in run
    self.tables = self.parse_data()
  File ".../site-packages/simple_ddl_parser/parser.py", line 257, in parse_data
    self.process_line(num != len(lines) - 1)
  File ".../site-packages/simple_ddl_parser/parser.py", line 287, in process_line
    self.process_statement()
  File ".../site-packages/simple_ddl_parser/parser.py", line 292, in process_statement
    self.parse_statement()
  File ".../site-packages/simple_ddl_parser/parser.py", line 300, in parse_statement
    _parse_result = yacc.parse(self.statement)
  File ".../site-packages/ply/yacc.py", line 333, in parse
    return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
  File ".../site-packages/ply/yacc.py", line 1120, in parseopt_notrack
    p.callable(pslice)
  File ".../site-packages/simple_ddl_parser/dialects/sql.py", line 283, in p_column
    self.set_column_size(p_list, p)
  File ".../site-packages/simple_ddl_parser/dialects/sql.py", line 291, in set_column_size
    p[0]["size"] = self.get_size(p_list)
  File ".../site-packages/simple_ddl_parser/dialects/sql.py", line 249, in get_size
    value_0 = int(p_list[-3])
ValueError: invalid literal for int() with base 10: 'MultiPolygon'

Describe the solution you'd like

The assumption about size being the only parameter should be replaced with a list of types for which this is known to be true. In all cases the parser could return an additional field with a list of the parameters associated with the type. This shouldn't be a breaking change.

Example output:

        "tables": [
            {
                "columns": [
                    {
                        "name": "p",
                        "type": "geometry",
                        "type_params": ["MultiPolygon", 26918],
                        "size": None,
                        "references": None,
                        "unique": False,
                        "nullable": True,
                        "default": None,
                        "check": None,
                        "on_update": None,
                    }
                ],
                ...
            }
        ],
blazewicz commented 1 year ago

Here's a sample test case:

from simple_ddl_parser import DDLParser

def test_postgis_geometry():
    ddl = """CREATE TABLE t1 (
        p Geometry (MultiPolygon, 26918)
    );"""
    result = DDLParser(ddl).run(group_by_type=True)
    expected = {
        "tables": [
            {
                "columns": [
                    {
                        "name": "p",
                        "type": "Geometry",
                        "type_params": ["MultiPolygon", 26918],
                        "size": None,
                        "references": None,
                        "unique": False,
                        "nullable": True,
                        "default": None,
                        "check": None,
                    },
                ],
                "primary_key": [],
                "alter": {},
                "checks": [],
                "index": [],
                "partitioned_by": [],
                "tablespace": None,
                "schema": None,
                "table_name": "t1",
            }
        ],
        "types": [],
        "sequences": [],
        "domains": [],
        "schemas": [],
        "ddl_properties": [],
    }
    assert expected == result

and the simplest POC that would make it work:

diff --git a/simple_ddl_parser/dialects/sql.py b/simple_ddl_parser/dialects/sql.py
index 9166bd8..017c8a1 100644
--- a/simple_ddl_parser/dialects/sql.py
+++ b/simple_ddl_parser/dialects/sql.py
@@ -288,7 +288,14 @@ class Column:
             and bool(re.match(r"[0-9]+", p_list[-1]))
             or p_list[-1] == "max"
         ):
-            p[0]["size"] = self.get_size(p_list)
+            try:
+                p[0]["size"] = self.get_size(p_list)
+            except ValueError:
+                p[0]["type_params"] = [
+                    int(param) if param.isnumeric() else param
+                    for param in p_list[2:]
+                    if param != ","
+                ]

     @staticmethod
     def set_property(p: List) -> List:
xnuinside commented 1 year ago

@blazewicz you can just open PR with same changes - feel free ) just check that all unit tests are passed & add new test case - I will merge it

xnuinside commented 1 year ago

@blazewicz hi, I released version 0.29.0 (https://pypi.org/project/simple-ddl-parser/) with support non-numeric column type parameters https://github.com/xnuinside/simple-ddl-parser/blob/main/tests/test_simple_ddl_parser.py#L3139 - here is a test case for your query. If will be needed anything else - feel free to open new issue!