Build with ply (lex & yacc in python). A lot of samples in 'tests/.
Yes, library already has about 9000+ downloads per day - https://pypistats.org/packages/simple-ddl-parser..
As maintainer, I guarantee that any backward incompatible changes will not be done in patch or minor version. But! Pay attention that sometimes output in keywords can be changed in minor version because of fixing wrong behaviour in past.
The full list of updates can be found in the Changelog below (at the end of README).
Version 1.0.0 was released due to significant changes in the output structure and a stricter approach regarding the scope of the produced output. Now, you must provide the argument 'output_mode=name_of_your_dialect' if you wish to see arguments/properties specific to a particular dialect
Parser supports:
You can check dialects sections after Supported Statements
section to get more information that statements from dialects already supported by parser. If you need to add more statements or new dialects - feel free to open the issue.
Pay attentions that I'm adding functional tests for all supported statement, so if you see that your statement is failed and you didn't see it in the test 99,9% that I did n't have sample with such SQL statement - so feel free to open the issue and I will add support for it.
If you need some statement, that not supported by parser yet: please provide DDL example & information about that is it SQL dialect or DB.
Types that are used in your DB does not matter, so parser must also work successfully to any DDL for SQL DB. Parser is NOT case sensitive, it did not expect that all queries will be in upper case or lower case. So you can write statements like this:
Alter Table Persons ADD CONSTRAINT CHK_PersonAge CHECK (Age>=18 AND City='Sandnes');
It will be parsed as is without errors.
If you have samples that cause an error - please open the issue (but don't forget to add ddl example), I will be glad to fix it.
A lot of statements and output result you can find in tests on the github - https://github.com/xnuinside/simple-ddl-parser/tree/main/tests .
pip install simple-ddl-parser
In some dialects like HQL there is a lot of additional information about table like, fore example, is it external table, STORED AS, location & etc. This property will be always empty in 'classic' SQL DB like PostgreSQL or MySQL and this is the reason, why by default this information are 'hidden'. Also some fields hidden in HQL, because they are simple not exists in HIVE, for example 'deferrable_initially' To get this 'hql' specific details about table in output please use 'output_mode' argument in run() method.
example:
ddl = """
CREATE TABLE IF NOT EXISTS default.salesorderdetail(
SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal
)
PARTITIONED BY (batch_id int, batch_id2 string, batch_32 some_type)
LOCATION 's3://datalake/table_name/v1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
"""
result = DDLParser(ddl).run(output_mode="hql")
print(result)
And you will get output with additional keys 'stored_as', 'location', 'external', etc.
# additional keys examples
{
...,
'location': "'s3://datalake/table_name/v1'",
'map_keys_terminated_by': "'\\003'",
'partitioned_by': [{'name': 'batch_id', 'size': None, 'type': 'int'},
{'name': 'batch_id2', 'size': None, 'type': 'string'},
{'name': 'batch_32', 'size': None, 'type': 'some_type'}],
'primary_key': [],
'row_format': 'DELIMITED',
'schema': 'default',
'stored_as': 'TEXTFILE',
...
}
If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.
Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
from simple_ddl_parser import DDLParser
parse_results = DDLParser("""create table dev.data_sync_history(
data_sync_id bigint not null,
sync_count bigint not null,
sync_mark timestamp not null,
sync_start timestamp not null,
sync_end timestamp not null,
message varchar(2000) null,
primary key (data_sync_id, sync_start)
); """).run()
print(parse_results)
from simple_ddl_parser import parse_from_file
result = parse_from_file('tests/sql/test_one_statement.sql')
print(result)
simple-ddl-parser is installed to environment as command sdp
sdp path_to_ddl_file
# for example:
sdp tests/sql/test_two_tables.sql
You will see the output in schemas folder in file with name test_two_tables_schema.json
If you want to have also output in console - use -v flag for verbose.
sdp tests/sql/test_two_tables.sql -v
If you don't want to dump schema in file and just print result to the console, use --no-dump flag:
sdp tests/sql/test_two_tables.sql --no-dump
You can provide target path where you want to dump result with argument -t, --target:
sdp tests/sql/test_two_tables.sql -t dump_results/
If you want to get output in JSON in stdout you can use argument json_dump=True in method .run() for this
from simple_ddl_parser import DDLParser
parse_results = DDLParser("""create table dev.data_sync_history(
data_sync_id bigint not null,
sync_count bigint not null,
); """).run(json_dump=True)
print(parse_results)
Output will be:
[{"columns": [{"name": "data_sync_id", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}, {"name": "sync_count", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}], "primary_key": [], "alter": {}, "checks": [], "index": [], "partitioned_by": [], "tablespace": null, "schema": "dev", "table_name": "data_sync_history"}]
DDLParser(ddl).run()
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument output_mode
that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Also in .run() method exists argument group_by_type
(by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.
To make work little bit easy you can set group_by_type=True and you will get output already sorted by types, like:
{
'tables': [all_pasrsed_tables],
'sequences': [all_pasrsed_sequences],
'types': [all_pasrsed_types],
'domains': [all_pasrsed_domains],
...
}
For example:
ddl = """
CREATE TYPE "schema--notification"."ContentType" AS
ENUM ('TEXT','MARKDOWN','HTML');
CREATE TABLE "schema--notification"."notification" (
content_type "schema--notification"."ContentType"
);
CREATE SEQUENCE dev.incremental_ids
INCREMENT 10
START 0
MINVALUE 0
MAXVALUE 9223372036854775807
CACHE 1;
"""
result = DDLParser(ddl).run(group_by_type=True)
# result will be:
{'sequences': [{'cache': 1,
'increment': 10,
'maxvalue': 9223372036854775807,
'minvalue': 0,
'schema': 'dev',
'sequence_name': 'incremental_ids',
'start': 0}],
'tables': [{'alter': {},
'checks': [],
'columns': [{'check': None,
'default': None,
'name': 'content_type',
'nullable': True,
'references': None,
'size': None,
'type': '"schema--notification"."ContentType"',
'unique': False}],
'index': [],
'partitioned_by': [],
'primary_key': [],
'schema': '"schema--notification"',
'table_name': '"notification"'}],
'types': [{'base_type': 'ENUM',
'properties': {'values': ["'TEXT'", "'MARKDOWN'", "'HTML'"]},
'schema': '"schema--notification"',
'type_name': '"ContentType"'}]}
Right now added support only for ALTER statements with FOREIGEIN key
For example, if in your ddl after table definitions (create table statements) you have ALTER table statements like this:
ALTER TABLE "material_attachments" ADD FOREIGN KEY ("material_id", "material_title") REFERENCES "materials" ("id", "title");
This statements will be parsed and information about them putted inside 'alter' key in table's dict. For example, please check alter statement tests - tests/test_alter_statements.py
You can find in tests/ folder.
To dump result in json use argument .run(dump=True)
You also can provide a path where you want to have a dumps with schema with argument .run(dump_path='folder_that_use_for_dumps/')
By default Parser does not raise the error if some statement cannot be parsed - and just skip & produce empty output.
To change this behavior you can pass 'silent=False' argumen to main parser class, like:
DDLParser(.., silent=False)
Use DDLParser(.., normalize_names=True)flag that change output of parser: If flag is True (default 'False') then all identifiers will be returned without '[', '"' and other delimiters that used in different SQL dialects to separate custom names from reserved words & statements. For example, if flag set 'True' and you pass this input:
CREATE TABLE [dbo].[TO_Requests]( [Request_ID] [int] IDENTITY(1,1) NOT NULL, [user_id] [int]
In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO_Requests]'.
CREATE [OR REPLACE] TABLE [ IF NOT EXISTS ] + columns definition, columns attributes: column name + type + type size(for example, varchar(255)), UNIQUE, PRIMARY KEY, DEFAULT, CHECK, NULL/NOT NULL, REFERENCES, ON DELETE, ON UPDATE, NOT DEFERRABLE, DEFERRABLE INITIALLY, GENERATED ALWAYS, STORED, COLLATE
STATEMENTS: PRIMARY KEY, CHECK, FOREIGN KEY in table definitions (in create table();)
ALTER TABLE STATEMENTS: ADD CHECK (with CONSTRAINT), ADD FOREIGN KEY (with CONSTRAINT), ADD UNIQUE, ADD DEFAULT FOR, ALTER TABLE ONLY, ALTER TABLE IF EXISTS; ALTER .. PRIMARY KEY; ALTER .. USING INDEX TABLESPACE; ALTER .. ADD; ALTER .. MODIFY; ALTER .. ALTER COLUMN; etc
PARTITION BY statement
CREATE SEQUENCE with words: INCREMENT [BY], START [WITH], MINVALUE, MAXVALUE, CACHE
CREATE TYPE statement: AS TABLE, AS ENUM, AS OBJECT, INTERNALLENGTH, INPUT, OUTPUT
LIKE statement (in this and only in this case to output will be added 'like' keyword with information about table from that we did like - 'like': {'schema': None, 'table_name': 'Old_Users'}).
TABLESPACE statement
COMMENT ON statement
CREATE SCHEMA [IF NOT EXISTS] ... [AUTHORIZATION] ...
CREATE DOMAIN [AS]
CREATE [SMALLFILE | BIGFILE] [TEMPORARY] TABLESPACE statement
CREATE DATABASE + Properties parsing
ENCODE column property
SORTKEY, DISTSTYLE, DISTKEY, ENCODE table properties
CREATE TEMP / TEMPORARY TABLE
syntax like with LIKE statement:
create temp table tempevent(like event);
To get logging output to file you should provide to Parser 'log_file' argument with path or file name:
DDLParser(ddl, log_file='parser221.log').run(group_by_type=True)
To set logging level you should provide argument 'log_level'
DDLParser(ddl, log_level=logging.INFO).run(group_by_type=True)
Most biggest 'Thanks' ever goes for contributions in parser: https://github.com/dmaresma https://github.com/cfhowes https://github.com/swiatek25 https://github.com/slurpyb https://github.com/PBalsdon
Big thanks for the involving & contribution with test cases with DDL samples & opening issues goes to:
for help with debugging & testing support for BigQuery dialect DDLs:
v1.7.1
v1.7.0
v1.6.1
v1.6.0
In this versions there is some output changes & fixes that can break your code.
Now all arguments inside brackets are parsed as separate strings in the list.
For example:
file_format = (TYPE=JSON NULL_IF=('field')
this was parsed like 'NULL_IF': "('field')",
now it will be: 'NULL_IF': ["'field'"],
Added separate tokens for EQ =
and IN (previously they was parsed as IDs also - for internal info, for contributors.
Some check statements in columns now parsed validly, also IN statements parsed as normal lists. So this statement include_exclude_ind CHAR(1) NOT NULL CONSTRAINT chk_metalistcombo_logicalopr CHECK (include_exclude_ind IN ('I', 'E')),
will produce this output:
{'check': {'constraint_name': 'chk_metalistcombo_logicalopr', 'statement': {'in_statement': {'in': ["'I'", "'E'"], 'name': 'include_exclude_ind'}}},
v1.5.4
pattern
token for external table statement, and improve location renderingv1.5.3
v1.5.2
v1.5.1
v1.5.0
unique
set up to column only if it was only one column in unique constraint/index. Issue - https://github.com/xnuinside/simple-ddl-parser/issues/255v1.4.0
temp
& is_global
. Added support for create global temporary table - https://github.com/xnuinside/simple-ddl-parser/issues/182v1.3.0
PostgreSQL:
`*_TRUNC PARTITIONS
like DATETIME_TRUNC, TIMESTAMP_TRUNC, etc, second argument moved to arg 'trunc_by'PostgreSQL:
BigQuery:
v1.2.1
MySQL:
MySQL:
Oracle:
PostgreSQL:
v1.1.0
MySQL:
Snowflake:
v1.0.4
DEFAULT
and CHECK
statements. https://github.com/xnuinside/simple-ddl-parser/issues/240
v1.0.3
CREATE OR REPLACE SCHEMA
.stage_
fileformat option value equal a single string as FIELD_OPTIONALLY_ENCLOSED_BY = '\"'
, FIELD_DELIMITER = '|'
v1.0.2
v1.0.1
normalize_names=True
do not remove []
from types like decimal(21)[]
.normalize_names=True
ensure that "complex"."type"
style names convert to complex.type
.v1.0.0 In output structure was done important changes that can in theory breaks code.
all custom table properties that are defined after column definition in 'CREATE TABLE' statement and relative to only one dialect (only for SparkSQL, or HQL,etc), for example, like here:
https://github.com/xnuinside/simple-ddl-parser/blob/main/tests/dialects/test_snowflake.py#L767 or https://github.com/xnuinside/simple-ddl-parser/blob/main/tests/dialects/test_spark_sql.py#L133 will be saved now in property table_properties
as dict.
Previously they was placed on same level of table output as columns
, alter
, etc. Now, they grouped and moved to key table_properties
.
Formatting parser result now represented by 2 classes - Output & TableData, that makes it more strict and readable.
The output mode now functions more strictly. If you want to obtain output fields specific to a certain dialect, use output_mode='snowflake' for Snowflake or output_mode='hql' for HQL, etc. Previously, some keys appeared in the result without being filtered by dialect. For example, if 'CLUSTER BY' was in the DDL, it would show up in the 'cluster_by' field regardless of the output mode. However, now all fields that only work in certain dialects and are not part of the basic SQL notation will only be shown if you choose the correct output_mode.
Full list of supported dialects you can find in dict - supported_dialects
:
from simple_ddl_parser import supported_dialects
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.
Dataset
used instead of schema
in BigQuery dialect.v0.32.1
v0.32.0
create table
statement and alter.
For example, if in create table you use quotes like "schema_name"."table_name", but in alter was schema_name.table_name - previously it didn't work, but now parser understand that it is the same table.v0.31.3
AS ()
statement - https://github.com/xnuinside/simple-ddl-parser/issues/218v0.31.2
ORDER|NOORDER
statement - https://github.com/xnuinside/simple-ddl-parser/issues/213v0.31.1
WITH TAG
statement.v0.31.0
Snowflake Table DDL support of WITH MASKING POLICY column definition
- https://github.com/xnuinside/simple-ddl-parser/issues/201v0.30.0
IN TABLESPACE IBM DB2 statement now is parsed into 'tablespace' key. Issue: https://github.com/xnuinside/simple-ddl-parser/issues/194. INDEX IN also parsed to 'index_in' key. Added support for ORGANIZE BY statement
Added support for PostgreSQL INHERITS statement. Issue: https://github.com/xnuinside/simple-ddl-parser/issues/191
v0.29.1
v0.29.0
v0.28.1 Improvements:
Fixes:
v0.28.0
Important Changes (Pay attention):
Amazing innovation:
Improvements: MariaDB:
Common:
/* */
syntax.group_by_type=
arg in parser.v0.27.0
Fixes:
Improvements:
New features:
v0.26.5
Fixes:
v0.26.4
v0.26.3
Improvements:
v0.26.2
Fixes:
Improvements:
v0.26.1
Fixes:
v0.26.0 Improvements:
New features:
Fixes:
v0.25.0
CREATE TABLE [dbo].[TO_Requests]( [Request_ID] [int] IDENTITY(1,1) NOT NULL, [user_id] [int]
In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO_Requests]'.
v0.24.2
v0.24.1
v0.24.0
v0.23.0
Big refactoring: less code complexity & increase code coverage. Radon added to pre-commit hooks.
PostgreSQL:
v0.22.5
v0.22.4
v0.22.3
v0.22.2
v0.22.1