Function blackmarbler::bm_extract is not downloading

worldbank / blackmarbler

Georeferenced Rasters and Statistics of Nighttime Lights from NASA Black Marble

https://worldbank.github.io/blackmarbler/

Other

16 stars 3 forks source link

Function blackmarbler::bm_extract is not downloading #10

Open alexyshr opened 5 months ago

alexyshr commented 5 months ago

Can you please advise me on why the function blackmarbler::bm_extract is not downloading the files as expected? Despite the system indicating that the files were processed 100%, no downloads are appearing in my designated folders. I used a local Administrator account on Windows 10 to run the R script, but no files are being written to the disk drive. I did not encounter any error messages or warnings during the execution of the script. I just installed the development version of blackmarbler package from GitHub, but It didn't work with the CRAN version either. I'm running the last R version [4.3.3]. Any assistance in resolving this issue would be greatly appreciated. I am using my NASA-EARTHDATA-TOKEN to define the bearer variable.

library(sf)
library(dplyr)
library(ggplot2)
library(purrr)
library(lubridate)
bearer <- "NASA-EARTHDATA-TOKEN"
#test
roi_sf <- gadm(country = "GHA", level = 1, path = tempdir()) |> st_as_sf()
blackmarbler::bm_extract(
  roi_sf = roi_sf,
  product_id = "VNP46A2",
  date = seq.Date(
    from = lubridate::ymd("2023-01-01"),
    to = lubridate::ymd("2023-01-02"), by = 1
  ),
  bearer = bearer,
  output_location_type = "file",
  file_dir = file.path(getwd())
)

This is the outcome: error

alexyshr commented 4 months ago

Just to clarify, the function is always downloading to temporal files but not to the folder defined in file_dir

ramarty commented 4 months ago

@alexyshr Thanks for flagging this! The issue should be fixed in the dev version (# install.packages("devtools") devtools::install_github("worldbank/blackmarbler"), but let me know if you're still running into issues.

alexyshr commented 4 months ago

Thank you @ramarty. After reinstalling from GitHub. I am using the function as below:

  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )

Inside file_dir I have all the hdf5 files. After running the code the bm_extract command placed all the Rds (e.g. VNP46A2_t2017_08_21.Rds) files, but their size is only 3k.

The process fails. I tried multiple times. The content of ntl_counties_df is empty or null. After the second execution of the code, the system does not try to download files anymore.

Thank you!

alexyshr commented 4 months ago

The content of the Rds files is OK, but the combination of those results in long format inside the variable ntl_counties_df does not work.

ramarty commented 4 months ago

@alexyshr This is all helpful feedback. A couple things:

1/ After running the below code, do you get values for ntl_counties_df?

ntl_counties_df <- bm_extract( roi_sf = counties, product_id = "VNP46A2", date = days_to_get_ntl, bearer = bearer, variable = variable, quality_flag_rm = quality_flag_rm )

2/ I'll update the documentation to make this more clear (or just update what the package does to give a more expected result). But when:

output_location_type = "file", both bm_extract and bm_raster only output data to the directory specified in file_dir. The functions don't return data to R. (Right now I tibble of 0 rows is returned, but I should change this to NULL -- or maybe have the function load all the Rds files, append, then return that -- in addition to saving the Rds files.
When output_location_type = "memory" (the default), the functions return data. So here, ntl_counties_df should then have data -- but no Rds files are saved.

I've used output_location_type = "file" if I want to download data for a lot of dates, then I'll do something like: list.files(file_dir, full.names = F, pattern = "*.Rds") |> map_df(readRDS) to get the appended dataframe.

Happy to clarify more, or let me know if the above code still doesn't work.

Thanks!

alexyshr commented 4 months ago

@ramarty Thank you!

1/ It returns data in ntl_counties_df after downloading all the HDF5 files to temporal places.

When using these options:

ntl_counties_df <- bm_extract(
roi_sf = counties,
product_id = "VNP46A2",
date = days_to_get_ntl, # as.character "2018-05-25",#,
bearer = bearer,
variable = variable,
quality_flag_rm = quality_flag_rm,
output_location_type = "memory",
file_dir = file_dir,
file_skip_if_exists = TRUE
)

After re-downloading all the HDF5 files to temporary locations, the result is stored in ntl_counties_df.

When using these options:

  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl, # as.character "2018-05-25",#,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )
  ntl_counties_df <- file_dir |>
    list.files(pattern = "*.Rds", full.names = TRUE) |>
    purrr::map_df(readRDS)

I was able to gather all the rds inside ntl_counties_df. The commands did not initiate any additional downloads of files.

There are multiple issues with previous approach (3)

4.1

If I change the variable to read from HDF5 files, the system does not replace existing Rds files

VNP46A2_t2017_08_29.Rds" already exists; skipping.
VNP46A2_t2017_08_30.Rds" already exists; skipping
...

4.2

What if we have other Rds (not related) inside that folder. The integrated Rds will be wrong.

For instance, an additional Rds created as below. Those files will provoke a bad result.

  # # Append daily-level datasets into one file

  file_dir |>
    list.files(pattern = "*.Rds", full.names = TRUE) |>
    purrr::map_df(readRDS) |>
    saveRDS(file.path(
      file_dir,
      paste0(hurricanename, "_", as.character(year), "_ntl_daily.Rds")
    ))

4.3 What if we change the geometry of roi_sf? The Rds need to be recreated.

4.4 What if we change other parameters of the package commands? The Rds need to be recreated.

I think a better solution could be:

RDS creation should always occur within a temporary folder or within the specified file_dir with guaranteed unique names. This implies that the commands will not reuse RDS files (unless you have a way to prove that the exact command outcome was already created -- then the reuse of Rds will be valid)
Our package functions should internally utilize the purrr::map_df operation to consistently return the data after executing the functions, regardless of the chosen file or memory options (output_location_type).
In the event of creating temporary folders, perhaps the command's output could report the path of the temporary folder to the user.

Thank you!

alexyshr commented 4 months ago

I thought that removing the Rds would be the solution to avoid re-downloading HDF5 files, but it seems that once no Rds are located inside file_dir (only hdf5 files are inside it), the next code is again downloading all HDF5 files. The step that I want to avoid is to download HDF5 files multiple times.

  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl, # as.character "2018-05-25",#,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )

Thank you for your time. I am not in a hurry. Take your time. I know that this package is just starting. I will try working with HDF5 files directly using starsor terraand sf.

ramarty commented 4 months ago

@alexyshr These are excellent comments — thanks so much. Great point about making variable names more unique—such as incorporating the variable in the name.

And yep, definitely agree about returning the appended dataframe even when output_location_type = "file" is specified. Agree that would make more sense for the user.

Helpful point too on wanting to keep the HDF5 files to, for example, extract multiple variables but avoid re-downloading the data (which can be slow). I'll think on the best way to implement this, but figure could add a function that just downloads HDF5 files to a directory, then could add an option in bm_raster and bm_extract to read data from the HDF5 files instead of downloading those files.

Thanks again — will work on making these improvements!

alexyshr commented 4 months ago

Hi @ramarty ,

I have edited the current version in GitHub to avoid downloading HDF5 files if they exist in file_dir.

I know you are changing multiple things in the package (migration to TERRA) but I would like you to consider the changes that can be useful for the user. Please find the code attached.

I did all the changes testing the function bm_extract. I think the function bm_raster is also working by default, but some changes in the messages and warnings need to be made (in case you decide to accept some or all the proposed changes).

Main considerations:

check_all_tiles_exist: If TRUE the code does not stop but reports the missing tiles (messages and warnings). Currently, it also generates a CSV file with the report of missing tiles. If FALSE it does not stop or report missing tiles.
file_skip_if_exists: It is related only to override or not Rds files.
In your original version, it seems add_n_pixels = FALSE does not work when interpol_na = TRUE (not fixed here).
I have tested all the variable values inside the function bm_raster and it seems to work.
bm_extract always returns the data in long format (independent of file_dir value)

You can test this version with the code below (those dates in the code have one missing tile for the Puerto Rico area)

source("blackmarbler.R")

#library(blackmarbler)
library(geodata)
library(sf)
library(stars)
library(raster)
library(ggplot2)
library(dplyr)
library(tools)
library(terra)
library(zoo)
library(lubridate)

library(purrr) #blackmarbler.R
library(httr)
library(exactextractr)
library(tidyr)
library(knitr)
library(readr)
library(hdf5r)
library(tidyr)
library(stringr)

#libraries for the blackmarbler function
library(stringr)

bearer <- "..."

variable <- "Gap_Filled_DNB_BRDF-Corrected_NTL"
#[1]:"DNB_BRDF-Corrected_NTL"
#[2]:"DNB_Lunar_Irradiance"
#[3]:"Gap_Filled_DNB_BRDF-Corrected_NTL"
#[4]:"Latest_High_Quality_Retrieval"
#[5]:"Mandatory_Quality_Flag"
#[6]:"QF_Cloud_Mask"
#[7]:"Snow_Flag"

file_dir <- "C:/temp/bm"

the_dates  <- c("09/27/2017", "09/30/2017")
start_date <- as.Date(the_dates[1], format = "%m/%d/%Y")

stop_date <- as.Date(the_dates[2], format = "%m/%d/%Y")

days_to_get_ntl <- seq(start_date, stop_date, by = "day")

#Retrieve GADM polygons of Puerto Rico (PRI) (4326)
#
the_geometry <- gadm(
  country = "PRI",
  level = 1,
  path = "C:/temp/bm",
) |>
  st_as_sf()

#filter the field NAME to get only "Guaynabo" and "Arroyo"
the_geometry <- the_geometry %>%
  filter(NAME_1 %in% c("Guaynabo", "Arroyo"))

aggregation_fun <- c("mean")

values_df_long <- bm_extract(
  roi_sf = the_geometry,
  product_id = "VNP46A2",
  date = days_to_get_ntl,
  bearer = bearer,
  aggregation_fun = aggregation_fun,
  add_n_pixels = TRUE,
  variable = variable,
  quality_flag_rm = 2,
  check_all_tiles_exist = TRUE,
  interpol_na = FALSE,
  output_location_type = "file", # memory, file
  file_dir = file_dir,
  file_prefix = NULL,
  file_skip_if_exists = FALSE, #TRUE do not replace
  quiet = FALSE
)
[blackmarbler.zip](https://github.com/worldbank/blackmarbler/files/14875276/blackmarbler.zip)

#bm_raster is working but some updates are needed
# raster_stack <- bm_raster(
#   roi_sf = the_geometry,
#   product_id = "VNP46A2",
#   date = days_to_get_ntl,
#   bearer = bearer,
#   variable = variable,
#   quality_flag_rm = 2,
#   check_all_tiles_exist = FALSE,
#   interpol_na = FALSE,
#   output_location_type = "file", # memory, file
#   file_dir = file_dir,
#   file_prefix = NULL,
#   file_skip_if_exists = FALSE, #TRUE do not replace
#   quiet = FALSE
# )

ramarty commented 4 months ago

@alexyshr thanks so much -- these changes are great & I agree makes sense. Aiming to go through this week, so should have an update by end of week!

ramarty commented 4 months ago

@alexyshr Apologies that its taken me forever to make changes, but few updates (on the development version -- not on cran yet)

Switched from raster to terra
Added an h5_dir parameter. If NULL (default), h5 files get saved to a working directory & deleted. However, if you set a file path, h5 files get saved to that file path and are not deleted -- and the function first checks if needed h5 files are in the directory before downloading
When output_location_type is set to file, the function still returns expected output (raster or dataframe)
output_location_type is set to file, the filenames default to more unique names (eg, the variable is included)

I'll do some additional tests before updating these to CRAN -- but definitely let me know if you run into more issues or if have more feedback!

(Also working on allowing one to use bm_extract using terra::extract -- potentially an option to use either that or exact_extract).

alexyshr commented 4 months ago

That is wonderful. I will check this tomorrow. Thank you!

alexyshr commented 2 months ago

Hi there!

The function does not download R files to file_dir. I installed it from GitHub. hdf5 files are downloaded to h5_dir.

library(geodata)
library(sf)
library(terra)
library(ggplot2)
library(tidyterra)
library(lubridate)
library(stringr)
library(dplyr)

library(blackmarbler)

bearer <- "THE_BEARER_KEY"

# ROI

# # # # Retrieve GADM polygons of USA states (4326)
roi_sf <- gadm(
  country = "USA",
  level = 1,
  path = "./data/"
) |>
  st_as_sf()

# # #filter roi_sf to get florida state (ISO_1 == "US-FL") (dplyr)
roi_sf <- roi_sf %>%
  filter(ISO_1 == "US-FL")

##################################################################
# Structure of the folders to store the data:
#
dirname_base <- "U:/Ohio2021-1/DataBlackMarble/"

# organization of folders inside dirname_base
#  ./bm_files/daily/
# Create directories to store R data (not hdf5 files)
dir.create(file.path(dirname_base, "data_temp", "bm_files"))
dir.create(file.path(dirname_base, "data_temp", "bm_files", "daily"))

##################################################################
# Extract daily data: "VNP46A2"

the_df <- bm_extract(
  roi_sf = roi_sf,
  product_id = "VNP46A2",
  variable = "Gap_Filled_DNB_BRDF-Corrected_NTL",
  date = seq.Date(from = ymd("2017-09-15"), to = ymd("2017-09-29"), by = 1),
  bearer = bearer,
  output_location_type = "file",
  file_dir = file.path(dirname_base, "data_temp", "bm_files", "daily"),
  # file_skip_if_exists = FALSE, #default TRUE
  # file_return_null = TRUE, #default FALSE
  h5_dir = file.path(dirname_base, "h5_temp2")
)

ramarty commented 2 months ago

Ah thanks for flagging! I'll look into the code and fix in the next day or so!

alexyshr commented 1 month ago

Hi,

As the code is working in a tryCatch it doesn't show errors.

My errors were:

I didn't load all the needed libraries

library(geodata)
library(sf)
library(terra)
library(stars)
library(ggplot2)
library(tidyterra)
library(lubridate)
library(stringr)
library(dplyr)
library(purrr)
library(hdf5r)

Is it possible to load the libraries in the initial part of each function?

I had to change this line:

h5_data <- h5file(the_h5_file, "r") #r+

in order to make next line to work

outr <- terra::rast(out,
    crs = myCrs,
    extent = c(xMin, xMax, yMin, yMax)
  )

My bearer was failing, so I had to renew it.

I have an additional question. What is the purpose of this nodata_val <- NA and the next line out[out == nodata_val] <- NA. It seems it is doing nothing.

The function file_to_raster can be more or less replaced by outr <- terra::rast(f) as terra understand h5 files, read the metadata and define correctly CRS, bounding box, etc. Please note that the use of f as variable name is not good for debugging purposes considering that f is finish while you are debugging.

Thanks

Alexys