zzzprojects / sqlfiddle4

New version of SQL Fiddle enhanced by AI
https://sqlfiddle.com/
0 stars 0 forks source link

Extracting SQL Snippets #4

Closed mrigger closed 3 months ago

mrigger commented 4 months ago

Is there any way in how we can (and are allowed to) extract the SQL snippets that users decided to share (i.e., they have a public URL)?

We are researchers aiming to make database systems and other data-centric systems more reliable (e.g., we have developed https://github.com/sqlancer/sqlancer) and the SQL snippets would be useful for various purposes, such as seed inputs for fuzzers, as stand-alone test cases, or for training models. This would likely require some curation to prepare the snippets as a data set and we'd be excited to work/collaborate on this.

A somewhat related study: https://faculty.washington.edu/billhowe/publications/pdfs/jain2016sqlshare.pdf.

odys-z commented 4 months ago

HI,

I can't wrap my head around your "hypothesis" in your paper, and hence it is difficult to understand the problem context, TLDR. As to "extract SQL snippets", I have done this with Antlr4 https://www.antlr.org/. It also comes with a bunch of grammar definitions. I have used them to build a tool that can parse SQL statement strings into AST, then generate valid SQL statements committable to real DBMS.

Best

On Thu, Jun 13, 2024 at 2:47 PM Manuel Rigger @.***> wrote:

Is there any way in how we can (and are allowed to) extract the SQL snippets that users decided to share (i.e., they have a public URL)?

We are researchers aiming to make database systems and other data-centric systems more reliable (e.g., we have developed https://github.com/sqlancer/sqlancer) and the SQL snippets would be useful for various purposes, such as seed inputs for fuzzers, as stand-alone test cases, or for training models. This would likely require some curation to prepare the snippets as a data set and we'd be excited to work/collaborate on this.

A somewhat related study: https://faculty.washington.edu/billhowe/publications/pdfs/jain2016sqlshare.pdf .

— Reply to this email directly, view it on GitHub https://github.com/zzzprojects/sqlfiddle4/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSIDLNFPLYDZU3QTG2BJXDZHE6BZAVCNFSM6AAAAABJHXNLX2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGM2TAMRZHEYDMNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mrigger commented 4 months ago

Thanks for your quick response, @odys-z!

I can't wrap my head around your "hypothesis" in your paper [...]

The paper linked above is actually not ours.

[...] and hence it is difficult to understand the problem context

The problem we are trying to solve is the lack of publicly-available large and diverse SQL data sets, which could be used for various purposes (e.g., we would be interested in mutating them for finding bugs in database systems).

As to "extract SQL snippets", I have done this with Antlr4

Yes, the technical aspect of parsing the SQL snippets might be relatively simple (but probably quite a bit of hassle due to the differences in SQL dialects).

I was unclear whether we could get access to the SQL snippets that were made sharable. If we do, we could clean the snippets (e.g., removing redundant or simple ones), analyze them (e.g., what are the statement distributions), and then provide them as a dataset for us and others to build upon, if that aligns with your goals.

mrigger commented 4 months ago

What do you mean by "get access to the SQL snippets"? I don't think it is a question of accessing privilege.

I might misunderstand something, but based on my understanding, the SQL snippets from https://sqlfiddle.com/ are stored server-side, so that they can be shared. The IDs part of the URL (e.g., https://sqlfiddle.com/sqlite/online-compiler?id=fc8e8f12-a1ce-4244-9c92-476e04954412) don't seem to follow a clear pattern, so it is unclear to me how the snippets can be extracted in an automatic fashion. Is there some way they can be extracted?

As for now, let me guess your goal.

The goal is to create the curated data set, but the application you outlined would be one of the potential applications the data set could be used for.

The AST, or semantics, are almost the same.

Differences within SQL dialects have been a major challenge in our research, but we also consider more diverse database systems with special constraints (e.g., time-series ones). We have a study under review that suggests that the differences are even significant for SQLite, MySQL, and PostgreSQL in both syntax and semantics, even for common features. Below are some examples for semantic differences.

PostgreSQL disagrees:

mysql> SELECT '\a' LIKE '\a'; -- true
postgres> SELECT '\a' LIKE '\a'; -- false
sqlite> SELECT '\a' LIKE '\a'; -- true

MySQL disagrees:

mysql> SELECT 5/2 -- 2.5000
postgresql> SELECT 5/2; -- 2
sqlite> SELECT 5/2; -- 2

SQLite disagrees:

mysql> SELECT 1 = '1'; -- true
postgresql> SELECT 1 = '1'; -- true
sqlite> SELECT 1 = '1'; -- false

Thanks again for the feedback and your interest!

odys-z commented 4 months ago

I think I have some wrong configuration on Github, and have on intention to add misinformation to the question. I know nothing about Sqlfiddle. Sorry about my rushing into the thread.

best regards

mrigger commented 4 months ago

I see, no worries!

JonathanMagnan commented 4 months ago

Hello @mrigger ,

Indeed, all fiddles are stored in a database, and I could share them in multiple ways.

I would not mind sharing a few hundred/thousands of SQL codes that are usually very easy to find out since they are shared in articles / posted on social networks, but sharing the whole database will make me very uncomfortable as some people (even if created public) would probably not be happy with this decision and I would 100% understand them.

I suggest you to contact us directly here: info@zzzprojects.com

Best Regards,

Jon