Open kenshaw opened 3 years ago
I have been working through some design ideas on what can be possible for a usql
version of \copy
. In conjunction with the plans for a post 1.0 version, I have come up with what I believe will be a reasonable implementation.
The syntax for the \copy
command will not follow psql
, nor will any attempt be made to try to implement the SQL-esque syntax parsing that exists in psql
's \copy
. While this goes against some of the thinking around usql
, after careful contemplation I do not find it feasible for a number of reasons (which I will layout in a different comment).
That said, here is what I think usql
's copy should be:
\copy [-all] [[FROM] [TO]] [SOURCE_QUERY] [DESTINATION_QUERY]
To expand on this, each of the optional [FROM]
and [TO]
would support two forms:
driver+transport://user:pass@host/dbname?opt1=a&opt2=b
), OR:@
character and follows standard Go identifier rules (@my_connection
)For number 2, this will be defined via a number of ways, in the standard .usqlpass
file, with the addition of the @identifier
and a single space (
) as a prefix to a line entry. This new form will also enable defining standard URL style entries. For example:
$ cat ~/.usqlpass
@my_connection postgres:*:*:*:postgres:P4ssw0rd
@my_url_dsn sqlite:/path/to/my_database.db
These entries will be parsed/loaded at startup and made available/set via a new \cset
command. Entries configured in the ~/.usqlpass
or created via the \cset
command will be made available to the \c
/\connect
commands:
# list connections:
(not connected)=> \cset
@my_connection "postgres://postgres@localhost:5432"
@my_url_dsn "sqlite:/path/to/my_database.db"
# note on the above that for the named connection @my_connection above were interpolated as the _default_ values for the URL's protocol type
# create a new named connection
(not connected)=> \cset my_new_dsn postgres://user:pass@localhost/mydb?opt...
# connect to a named connection
(not connected)=> \connect @my_new_dsn
(not connected)=> \connect @my_connection
Additionally, that will required a \cunset
command:
(not connected)=> \cunset my_connection
In the interest of user flexibility/and robustness, the same functionality will also be made available for the \set
and \unset
commands, which any identifier prefixed with @
will be treated as a named connection:
(not connected)=> \set @my_other_conn "driver://name/..."
(not connected)=> \unset @my_other_conn
Thus, the proposed \copy
form will support the following:
# copy from named connection 1 to named connection 2
(not connected)=> \copy @my_new_dsn @my_connection SOURCE_QUERY DESTINATION_QUERY
(not connected)=> \copy pg:// @my_connection SOURCE_QUERY DESTINATION_QUERY
# copy from current active connection to a named connection
pg:postgres@=> \copy @my_connection SOURCE_QUERY DESTINATION_QUERY
# copy from current active connection to current active connection
pg:postgres@=> \copy SOURCE_QUERY DESTINATION_QUERY
Which brings us to the second part, the SOURCE_QUERY
and DESTINATION_QUERY
. It is expected that a SOURCE_QUERY
will be any well formed query (such as a SELECT
query) that can be executed against the FROM
connection, and returns one or more result sets with each result set having 0 or more rows. Similarly, the DESTINATION_QUERY
should be a well formed query (such as an INSERT
query) that can be executed against the TO
connection.:
# copy ONE result set from the source query to the destination query -- more on query syntax later
pg:postgres@=> \copy 'SELECT * FROM test' 'INSERT INTO ...'
By default, the \copy
command would only execute against the first result set returned by the SOURCE_QUERY
, unless the optional parameter -all
was specified:
# copy ALL result sets from the source query to the destination query
pg:postgres@=> \copy -all 'SELECT ...' 'INSERT ...'
The way that the SOURCE_QUERY
and DESTINATION_QUERY
are applied is as follows (in pseudo-code):
RESULT := TO.query SOURCE_QUERY
for
for ROW in RESULT
COPY_RESULT := FROM.query DESTINATION_QUERY, ROW
if ERROR
break
end
if COPY_RESULT.has_output
write STDOUT, COPY_RESULT
end
end
if not ALL
break
end
RESULT = RESULT.next
if not RESULT
break
end
end
In English: for every row returned from the source query, execute the destination query using the row's columns as parameters. Repeat for all result sets (if applicable).
This brings us to the next part of the proposal, which would be the addition of defined variables. Similar to using @identifier
for named connections, an additional concept of defined queries is introduced, making use of the %
character. These would be accessible via the %
value, and created with a new \define
, \end
command set:
(not connected)=> \define my_select_query
(not connected)=> SELECT one AS a, two AS b FROM test
(not connected)'> \end
These defined queries would similarly have a \undefine
command, and would not defer variable interpolation until time of execution. They could be used anywhere within the standard query buffer or with commands:
(not connected)=> \echo %my_select_query
SELECT one AS a, two AS b FROM test
(not connected)=> \p
Query buffer is empty
(not connected)=> %my_select_query
(not connected)-> \p
SELECT one AS a, two AS b FROM test
(not connected)-> \r
Query buffer reset (cleared).
(not connected)=> \c pg://
Connected with driver postgres (PostgreSQL 13.1 (Debian 13.1-1.pgdg100+1))
pg:postgres@=> %my_select_query \g
As with named connections, named queries could be \unset
:
pg:postgres@=> \unset %my_select_query
Care will need to be taken to ensure that named queries work with standard commands and with "expected" results. For example, the \edit
/ \e
, and \print
/\p
commands:
# opens query in external editor, saving the result back to the named query my_new_query
pg:postgres@=> \e %my_new_query
# prints the named query
pg:postgres@=> \p %my_new_query
SELECT * FROM my_table
While this behavior has not yet been determined for other commands, it is expected that this will be done prior to a full release of this functionality.
Named queries also have an important component, which is that they work as a full usql
script when executed. That is, they will capture both SQL statements and usql
commands:
(not connected)=> \define a_select_query
(not connected)=> \echo a b c
(not connected)-> \end
This has the added ability of being able to modify the results using usql
's own functionality or output. For example, using the \crosstabview
view, or other functionality to be introduced at a later date:
(not connected)=> \define crosstab_select_query
(not connected)=> select * from test \crosstabview v h c
(not connected)-> \end
Note that a named query will be able to call \define
within the named query itself, so long as the \end
's balance. This behavior will follow the conventions established for the \if
/\elif
/\else
/\end
commands.
With the ability to define named queries, the \copy
command would be able to work like the following:
(not connected)=> \copy @my_conn1 @my_conn2 %my_select_query %my_insert_query
Which brings us to interpolation and interpretation rules for the DESTINATION_QUERY
. Essentially, since a \define
captures the user's raw query, and execution is deferred until time execution, the final part of the proposal would be to create a new variable scope and setting a named variable for each column of the resulting query. For illustration, consider the following query:
SELECT one AS a, two AS b FROM test
The above would return two columns, a
, and b
. These would in turn be placed into new variable scope that could be used/referred to in the DESTINATION_QUERY
. For example, one might \define
a DESTINATION_QUERY
with the following:
INSERT INTO destination_table (one, two) VALUES (:'a', :'b')
The above would get executed/ran in the same way as usql
interpreted it as if a user was manually executing the above. In essence, this is functionally similar to executing a \gset
on a row-by-row basis from the result set, and then executing the query against the second database connection.
The expected behavior when a FROM
and TO
connection would be to connect to the FROM
connection (which then becomes the primary connection in usql
, effectively the same as a \connect
. After the execution of a \copy
, the connection will stay open to the FROM
connection. In this way, subsequent \copy
's (or other database queries) would continue to be performed against the primary database connection.
Additionally, the TO
database connection would be closed immediately following the end of a \copy
call. Any errors encountered would be output to usql
's handler's stderr
, in the same way as any other errors that are encountered. For the standard \copy
command, if any errors are encountered on the secondary connection, the \copy
command's execution will be terminated and the connection to the secondary database will be closed. Eventually, a variant (or option) to the \copy
command might be made to force a single transaction on the second connection, or to instead provide a special query that is only executed once for all rows in a result set.
Can we start with a most basic implementation, that would only have:
INSERT
statementThis should not require too much work upfront and would help us figure out usability issues before jumping to adding new usql features like new variable syntax.
I also would like to first focus on what kind of interfaces need to be added to drivers to allow them to use a more efficient copy implementation, wherever possible, with a fallback to a generic loop with INSERT
statements. There are some interesting examples in lib/pq and pgx:
Need a
\copy
implementation.Since
\copy
is its own beast, open to ideas on what can be done.