`\copy` proposal - Githubissues

kenshaw commented 3 years ago

Need a \copy implementation.

Since \copy is its own beast, open to ideas on what can be done.

kenshaw commented 3 years ago

I have been working through some design ideas on what can be possible for a usql version of \copy. In conjunction with the plans for a post 1.0 version, I have come up with what I believe will be a reasonable implementation.

The syntax for the \copy command will not follow psql, nor will any attempt be made to try to implement the SQL-esque syntax parsing that exists in psql's \copy. While this goes against some of the thinking around usql, after careful contemplation I do not find it feasible for a number of reasons (which I will layout in a different comment).

That said, here is what I think usql's copy should be:

\copy [-all] [[FROM] [TO]] [SOURCE_QUERY] [DESTINATION_QUERY]

To expand on this, each of the optional [FROM] and [TO] would support two forms:

Standard URL syntax used elsewhere (driver+transport://user:pass@host/dbname?opt1=a&opt2=b), OR:
A new "named connection" syntax that will always begin with the @ character and follows standard Go identifier rules (@my_connection)

For number 2, this will be defined via a number of ways, in the standard .usqlpass file, with the addition of the @identifier and a single space ( ) as a prefix to a line entry. This new form will also enable defining standard URL style entries. For example:

$ cat ~/.usqlpass
@my_connection postgres:*:*:*:postgres:P4ssw0rd
@my_url_dsn sqlite:/path/to/my_database.db

These entries will be parsed/loaded at startup and made available/set via a new \cset command. Entries configured in the ~/.usqlpass or created via the \cset command will be made available to the \c/\connect commands:

# list connections:
(not connected)=> \cset
@my_connection "postgres://postgres@localhost:5432"
@my_url_dsn "sqlite:/path/to/my_database.db"
# note on the above that for the named connection @my_connection above were interpolated as the _default_ values for the URL's protocol type

# create a new named connection
(not connected)=> \cset my_new_dsn postgres://user:pass@localhost/mydb?opt...

# connect to a named connection
(not connected)=> \connect @my_new_dsn
(not connected)=> \connect @my_connection

Additionally, that will required a \cunset command:

(not connected)=> \cunset my_connection

In the interest of user flexibility/and robustness, the same functionality will also be made available for the \set and \unset commands, which any identifier prefixed with @ will be treated as a named connection:

(not connected)=> \set @my_other_conn "driver://name/..."
(not connected)=> \unset @my_other_conn

Thus, the proposed \copy form will support the following:

# copy from named connection 1 to named connection 2
(not connected)=> \copy @my_new_dsn @my_connection SOURCE_QUERY DESTINATION_QUERY
(not connected)=> \copy pg:// @my_connection SOURCE_QUERY DESTINATION_QUERY

# copy from current active connection to a named connection
pg:postgres@=> \copy @my_connection SOURCE_QUERY DESTINATION_QUERY

# copy from current active connection to current active connection
pg:postgres@=> \copy SOURCE_QUERY DESTINATION_QUERY

Which brings us to the second part, the SOURCE_QUERY and DESTINATION_QUERY. It is expected that a SOURCE_QUERY will be any well formed query (such as a SELECT query) that can be executed against the FROM connection, and returns one or more result sets with each result set having 0 or more rows. Similarly, the DESTINATION_QUERY should be a well formed query (such as an INSERT query) that can be executed against the TO connection.:

# copy ONE result set from the source query to the destination query -- more on query syntax later
pg:postgres@=> \copy 'SELECT * FROM test' 'INSERT INTO ...'

By default, the \copy command would only execute against the first result set returned by the SOURCE_QUERY, unless the optional parameter -all was specified:

# copy ALL result sets from the source query to the destination query
pg:postgres@=> \copy -all 'SELECT ...' 'INSERT ...'

The way that the SOURCE_QUERY and DESTINATION_QUERY are applied is as follows (in pseudo-code):

RESULT := TO.query SOURCE_QUERY
for
  for ROW in RESULT
    COPY_RESULT := FROM.query DESTINATION_QUERY, ROW
    if ERROR
      break
    end
    if COPY_RESULT.has_output
       write STDOUT, COPY_RESULT
    end
  end
   if not ALL 
    break
  end
  RESULT = RESULT.next
  if not RESULT 
    break
  end
end

In English: for every row returned from the source query, execute the destination query using the row's columns as parameters. Repeat for all result sets (if applicable).

This brings us to the next part of the proposal, which would be the addition of defined variables. Similar to using @identifier for named connections, an additional concept of defined queries is introduced, making use of the % character. These would be accessible via the % value, and created with a new \define, \end command set:

(not connected)=> \define my_select_query
(not connected)=> SELECT one AS a, two AS b FROM test
(not connected)'> \end

These defined queries would similarly have a \undefine command, and would not defer variable interpolation until time of execution. They could be used anywhere within the standard query buffer or with commands:

(not connected)=> \echo %my_select_query
SELECT one AS a, two AS b FROM test
(not connected)=> \p
Query buffer is empty
(not connected)=> %my_select_query
(not connected)-> \p
SELECT one AS a, two AS b FROM test
(not connected)-> \r
Query buffer reset (cleared).
(not connected)=> \c pg://
Connected with driver postgres (PostgreSQL 13.1 (Debian 13.1-1.pgdg100+1))
pg:postgres@=> %my_select_query \g

As with named connections, named queries could be \unset:

pg:postgres@=> \unset %my_select_query

Care will need to be taken to ensure that named queries work with standard commands and with "expected" results. For example, the \edit / \e, and \print/\p commands:

# opens query in external editor, saving the result back to the named query my_new_query
pg:postgres@=> \e %my_new_query

# prints the named query
pg:postgres@=> \p %my_new_query
SELECT * FROM my_table

While this behavior has not yet been determined for other commands, it is expected that this will be done prior to a full release of this functionality.

Named queries also have an important component, which is that they work as a full usql script when executed. That is, they will capture both SQL statements and usql commands:

(not connected)=> \define a_select_query
(not connected)=> \echo a b c
(not connected)-> \end

This has the added ability of being able to modify the results using usql's own functionality or output. For example, using the \crosstabview view, or other functionality to be introduced at a later date:

(not connected)=> \define crosstab_select_query
(not connected)=> select * from test \crosstabview v h c
(not connected)-> \end

Note that a named query will be able to call \define within the named query itself, so long as the \end's balance. This behavior will follow the conventions established for the \if/\elif/\else/\end commands.

With the ability to define named queries, the \copy command would be able to work like the following:

(not connected)=> \copy @my_conn1 @my_conn2 %my_select_query %my_insert_query

Which brings us to interpolation and interpretation rules for the DESTINATION_QUERY. Essentially, since a \define captures the user's raw query, and execution is deferred until time execution, the final part of the proposal would be to create a new variable scope and setting a named variable for each column of the resulting query. For illustration, consider the following query:

SELECT one AS a, two AS b FROM test

The above would return two columns, a, and b. These would in turn be placed into new variable scope that could be used/referred to in the DESTINATION_QUERY. For example, one might \define a DESTINATION_QUERY with the following:

INSERT INTO destination_table (one, two) VALUES (:'a', :'b')

The above would get executed/ran in the same way as usql interpreted it as if a user was manually executing the above. In essence, this is functionally similar to executing a \gset on a row-by-row basis from the result set, and then executing the query against the second database connection.

Connection and error behavior

The expected behavior when a FROM and TO connection would be to connect to the FROM connection (which then becomes the primary connection in usql, effectively the same as a \connect. After the execution of a \copy, the connection will stay open to the FROM connection. In this way, subsequent \copy's (or other database queries) would continue to be performed against the primary database connection.

Additionally, the TO database connection would be closed immediately following the end of a \copy call. Any errors encountered would be output to usql's handler's stderr, in the same way as any other errors that are encountered. For the standard \copy command, if any errors are encountered on the secondary connection, the \copy command's execution will be terminated and the connection to the secondary database will be closed. Eventually, a variant (or option) to the \copy command might be made to force a single transaction on the second connection, or to instead provide a special query that is only executed once for all rows in a result set.

nineinchnick commented 3 years ago

Can we start with a most basic implementation, that would only have:

4 arguments, with connection strings explicitly defined
both connections explicitly opened for the duration of the copy and closed afterwards
only target table name with columns names, not a full INSERT statement

This should not require too much work upfront and would help us figure out usability issues before jumping to adding new usql features like new variable syntax.

I also would like to first focus on what kind of interfaces need to be added to drivers to allow them to use a more efficient copy implementation, wherever possible, with a fallback to a generic loop with INSERT statements. There are some interesting examples in lib/pq and pgx:

xo / usql

`\copy` proposal #146

Connection and error behavior