yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.39k stars 878 forks source link

SQL temp function being recognized as update query, returning no result #2359

Open AndhikaWB opened 3 months ago

AndhikaWB commented 3 months ago

Some SQL engines such as BigQuery provide a way to define temporary user function using the CREATE TEMP FUNCTION syntax. It's similar as create temporary table with the WITH table_name AS syntax but for function

Here I'm replicating WIDTH_BUCKET function from PostgreSQL which isn't available natively in BigQuery:

{sql}

CREATE TEMP FUNCTION WIDTH_BUCKET(
  value NUMERIC,
  min_value NUMERIC,
  max_value NUMERIC,
  num_buckets NUMERIC
) AS (
  RANGE_BUCKET(
    value,
    GENERATE_ARRAY(
      min_value,
      max_value,
      (max_value - min_value) / num_buckets
    )
  )
);

SELECT
  x,
  WIDTH_BUCKET(x, 0, MAX(x) OVER(), 4) as rank_equ_range,
  NTILE(4) OVER(ORDER BY x) AS rank_equ_member
FROM
  UNNEST([1, 1, 2, 5, 6, 7, 8, 9, 48, 100]) AS x

However, it's being executed with dbExecute because it match the CREATE keyword here https://github.com/yihui/knitr/blob/fb1f4231d2df7b156e314686f39fc040ce807513/R/engine.R#L561 https://github.com/yihui/knitr/blob/fb1f4231d2df7b156e314686f39fc040ce807513/R/engine.R#L618

Hence returning no result if I use SQL code chunk directly

Currently, I'm using this workaround to execute it directly in R

{r}

library(DBI)
library(bigrquery)

cred_file <- paste0(
  'docker/root/.config/gcloud/legacy_credentials/',
  Sys.getenv('USER_EMAIL'), '/adc.json'
)

Sys.setenv(GOOGLE_APPLICATION_CREDENTIALS = cred_file)
bq_auth()

con <- dbConnect(
  bigquery(), 
  dataset = 'sales_data', 
  project = 'mumu-431300', 
  use_legacy_sql = FALSE
)

query = 'CREATE TEMP FUNCTION WIDTH_BUCKET(
  value NUMERIC,
  -- min_value is inclusive
  min_value NUMERIC,
  -- max_value is exclusive
  max_value NUMERIC,
  num_buckets NUMERIC
) AS (
  RANGE_BUCKET(
    value,
    GENERATE_ARRAY(
      min_value,
      max_value,
      (max_value - min_value) / num_buckets
    )
  )
);

SELECT
  x,
  WIDTH_BUCKET(x, 0, MAX(x) OVER(), 4) as rank_equ_range,
  NTILE(4) OVER(ORDER BY x) AS rank_equ_member
FROM
  UNNEST([1, 1, 2, 5, 6, 7, 8, 9, 48, 100]) AS x'

dbGetQuery(con, query)

Which has an actual output instead of empty like dbExecute

# A tibble: 10 × 3
       x rank_equ_range rank_equ_member
   <int>          <int>           <int>
 1     1              1               1
 2     1              1               1
 3     2              1               1
 4     5              1               2
 5     6              1               2
 6     7              1               2
 7     8              1               3
 8     9              1               3
 9    48              2               4
10   100              5               4

The possible fix is to exclude the CREATE FUNCTION or CREATE TEMP FUNCTION keyword from the matcher, something like this (I'm bad at regex) CREATE[\s]+(TEMP[\s]+FUNCTION|FUNCTION) but using exclude rule

Or perhaps making a new cell option to force treat it as non-update query, which I'm not sure how to do it

AndhikaWB commented 3 months ago

After a few hours of tinkering, I finally created a working hook to force show the result using this code:

``` {r}
#| cache: false

knitr::knit_hooks$set(force_result = function(before, options) {
  if (before & options$force_result) {
    knitr::opts_hooks$set(sql = options$code)
    knitr::opts_hooks$set(code = c(''))
  } else if (options$force_result) {
    sql <- paste(knitr::opts_hooks$get('sql'), collapse = '\n')
    result <- dbGetQuery(options$connection, sql)
    paste(knitr::kable(result), collapse = '\n')
  }
})

Then call my previous SQL query using the new cell option
```` sql
``` {sql}
#| force_result: true

CREATE TEMP FUNCTION WIDTH_BUCKET(
  value NUMERIC,
  -- min_value is inclusive
  min_value NUMERIC,
  -- max_value is exclusive
  max_value NUMERIC,
  num_buckets NUMERIC
) AS (
  RANGE_BUCKET(
    value,
    GENERATE_ARRAY(
      min_value,
      max_value,
      (max_value - min_value) / num_buckets
    )
  )
);

SELECT
  x,
  WIDTH_BUCKET(x, 0, MAX(x) OVER(), 4) as rank_equ_range,
  NTILE(4) OVER(ORDER BY x) AS rank_equ_member
FROM
  UNNEST([1, 1, 2, 5, 6, 7, 8, 9, 48, 100]) AS x

And the output will now appear

But still, a native solution would be great