yonghah / esri2sf

Scrape features from ArcGIS Server REST API and create simple features dataframe
Other
137 stars 37 forks source link

Feature request: esriIndex function to list all folders and services on ArcGIS server #39

Open elipousson opened 2 years ago

elipousson commented 2 years ago

I think there it may be useful to allow users to return a list of all folders and services hosted by an ArcGIS server. I came up with a function (tentatively named esriIndex although esriStructure could also work) that does this:

esriIndex <- function(url, parent = NULL, recurse = TRUE) {
  urlServer <-
    stringr::str_extract(url, ".+rest/services")

  urlInfo <-
    jsonlite::read_json(paste0(url, "?f=json"), simplifyVector = TRUE)

  folders <-
    tibble::tibble(
      name = urlInfo[["folders"]] |> as.character(),
      type = "Folder",
      url = paste0(urlServer, "/", urlInfo[["folders"]]),
      parent = parent
    )

  services <-
    tibble::tibble(
      name = urlInfo[["services"]][["name"]] |> as.character(),
      type = urlInfo[["services"]][["type"]] |> as.character(),
      url = paste0(urlServer, "/", urlInfo[["services"]][["name"]], "/", urlInfo[["services"]][["type"]], recycle0 = TRUE),
      parent = parent
    )

  urlIndex <-
    dplyr::bind_rows(
      folders,
      services
    )

  if (recurse) {
    if (length(folders[["name"]]) > 0) {
      urlIndex <-
        dplyr::bind_rows(
          urlIndex,
          purrr::pmap_dfr(
            folders,
            ~ esriIndex(url = ..3, parent = ..1)
          )
        )
    }
  }

  urlIndex
}

esriIndex(url = "https://carto.nationalmap.gov/arcgis/rest/services")
#> # A tibble: 11 × 4
#>    name                      type           url                           parent
#>    <chr>                     <chr>          <chr>                         <chr> 
#>  1 Utilities                 Folder         https://carto.nationalmap.go… <NA>  
#>  2 contours                  MapServer      https://carto.nationalmap.go… <NA>  
#>  3 geonames                  MapServer      https://carto.nationalmap.go… <NA>  
#>  4 govunits                  MapServer      https://carto.nationalmap.go… <NA>  
#>  5 map_indices               MapServer      https://carto.nationalmap.go… <NA>  
#>  6 selectable_polygons       MapServer      https://carto.nationalmap.go… <NA>  
#>  7 structures                MapServer      https://carto.nationalmap.go… <NA>  
#>  8 transportation            MapServer      https://carto.nationalmap.go… <NA>  
#>  9 Utilities/Geometry        GeometryServer https://carto.nationalmap.go… Utili…
#> 10 Utilities/PrintingTools   GPServer       https://carto.nationalmap.go… Utili…
#> 11 Utilities/RasterUtilities GPServer       https://carto.nationalmap.go… Utili…

Created on 2021-11-28 by the reprex package (v2.0.1)

A couple of notes:

elipousson commented 2 years ago

Happy new year, @jacpete. This isn't an urgent issue but I figured you wouldn't mind if I tagged you to check if you had a chance to consider this feature request. Suggestions or feedback are welcome!

jacpete commented 2 years ago

Hey @elipousson,

Thanks for the ping, end of the year got pretty crazy and I had forgotten about this issue. I will try to get you a response with my thoughts tonight after I get re-established in where I left off. Happy you're around to discuss stuff with.

jacpete commented 2 years ago

The new function looks great. It was something I had written down to add and I am glad someone beat me to it. Tested it out on a couple url's I use regularly and it worked as expected. My only note for the function in the current form is the inclusion of the new base pipe |>. While I personally don't have a problem with its inclusion as I always tend to run the latest version of R on my machines, we would have to bump the minimum R version for the package to 4.1 which could hurt users where their organization limits the version they have available. I don't see this as a deal breaker, but its something we should discuss before implementing. Other options would be to de-pipe the code or to switch to the {magrittr} pipe and add it as a dependency.

In response to the other comments:

  • esriUrl_isValid only validates MapServer and FeatureServer URLs so the url cannot be passed to esrimeta() without returning an error and the base url can't be extracted by the esri2sf::esriUrl_parseUrl() function for the same reason. It may be helpful to allow these functions to accept server or folder URLs.

I see what you mean here and I will work on a pull request to fix this.

  • I'd love to add the option of returning a list of layers for all services as well as the folders and services. I think this should be relatively straightforward with the esriLayers() function but I ran into some performance issues when testing it out. Suggestions on how to incorporate this are welcome.

I would also love to see this functionality combined in. Would you send an example of where it works and better yet where it doesn't and we can try to root out a fix so this functionality can be added in.

  • The original esri2sf and esri2df functions used the convention of all lowercase function names. I noticed that more recent additions use a camel case convention that matches the conventions of ArcGIS web services. I stuck with the latter in this draft but wasn't sure if all lowercase may be preferred.

This is definitely been influenced by my personal taste. I tend to mix camelCase with underscores being used for separating parts of a name. For example the esriUrl_ prefixed functions all deal with the same type of data (ESRI url strings) and then the actual function name is base of the full function name (esriUrl_isValid, esriUrl_parseUrl). I know this doesn't agree with the original esri2sf function that you rightly noted was all lowercase, but it does follow the camelCase of the functions in the zzz.R file that were there before I started adding stuff to this repo. Totally up for debate on what we decide and use as standards moving forward, just want to make sure we don't change too much of pieces that already exist as exported functions so we don't break peoples existing code (we can create aliases and use deprecation warnings if we do end up changing an exported function). My current (biased) suggestion is to use camelCase for new functions except for the core original function esri2sf and its direct counterpart esri2df. My main reason for avoiding lowercase names is readability but the '2' in these function names solves that issue. @yonghah feel free to chime in on any thoughts you have here as well. I am happy to keep adding functionality and fixing bugs in the package but when it comes to stylistic decisions I want to make sure your'e on board as well.

elipousson commented 2 years ago

Here is an updated version of esriIndex that includes a layers option (and no more pipes). Unfortunately, the layers option definitely keeps kicking back errors when I try to run it on a whole server.

esriIndex <- function(url, parent = NULL, recurse = TRUE, layers = FALSE) {
  urlServer <-
    stringr::str_extract(url, ".+rest/services")

  urlInfo <-
    jsonlite::read_json(paste0(urlServer, "?f=json"), simplifyVector = TRUE)

  folders <-
    tibble::tibble(
      name = as.character(urlInfo[["folders"]]),
      type = "Folder",
      url = paste0(urlServer, "/", urlInfo[["folders"]]),
      parent = parent
    )

  services <-
    tibble::tibble(
      name = as.character(urlInfo[["services"]][["name"]]),
      type = as.character(urlInfo[["services"]][["type"]]),
      url = paste0(urlServer, "/", urlInfo[["services"]][["name"]], "/", urlInfo[["services"]][["type"]], recycle0 = TRUE),
      parent = parent
    )

  urlIndex <-
    dplyr::bind_rows(
      folders,
      services
    )

  if (layers) {
    layers <- purrr::map2_dfr(
      index$url, index$name,
      ~ layerIndex(url = .x, parent = .y)
    )

    urlIndex <-
      dplyr::bind_rows(
        urlIndex,
        layers
      )
  }

  if (recurse) {
    if (length(folders[["name"]]) > 0) {
      urlIndex <-
        dplyr::bind_rows(
          urlIndex,
          purrr::pmap_dfr(
            folders,
            ~ esriIndex(url = ..3, parent = ..1)
          )
        )
    }
  }

  urlIndex
}

Here is the layerIndex function that I added to esriIndex:

layerIndex <- function(url, parent = NULL) {
  layerInfo <- jsonlite::read_json(paste0(url, "/layers?f=json"), simplifyVector = TRUE)

  if (!is.null(layerInfo) && (class(layerInfo[["layers"]]) == "data.frame")) {
    layers <- dplyr::select(
      layerInfo[["layers"]],
      id, name, type, geometryType
    )

    dplyr::mutate(
      layers,
      url = paste0(url,"/", id),
      parent = parent
    ) 
  } else {
    NULL
  }
}

I could set up a pull request for the esriIndex function without the layers parameter now or just wait until we have a chance to work out the problems.

jacpete commented 2 years ago

I am doing a lot of work to rewrite and generalize the esriUrl_* functions right now that I hope to have done tonight or tomorrow. Wait on your pull request until I do a pull request for those changes. Just want to make sure we aren't messing up each other in the process. Once I get that fixed I will take a deeper look into the esriLayers function and see if I can spot what's going on.

jacpete commented 2 years ago

@elipousson Would you be able to give more details about this comment:

  • I'd love to add the option of returning a list of layers for all services as well as the folders and services. I think this should be relatively straightforward with the esriLayers() function but I ran into some performance issues when testing it out. Suggestions on how to incorporate this are welcome.

What kind of performance issues did you get? Were there unexpected errors in the R code or was the server returning error codes? I was trying to test out what the issue could be but I have no examples of the issue you were having. Just as a note the '/layer' subpage is only available for MapServer and FeatureServer urls. If a feature url for one of these services is provided it will truncate the url to the service type.

#All Valid entries
esriLayers("https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/Demographics/ESRI_Census_USA/MapServer/3")
esriLayers("https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/TaxParcel/AssessorsBasemap/MapServer")
esriLayers("https://carto.nationalmap.gov/arcgis/rest/services/contours/MapServer")

#Not Valid (current return shown below)-but better error handling (explicit check for '/(FeatureServer|MapServer)/?$' needs added)
esriLayers("https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/Elevation/ESRI_Elevation_World/GPServer")
# $error
# $error$code
# [1] 400
# 
# $error$message
# [1] "Unable to complete  operation."
# 
# $error$details
# [1] "GPTask 'layers' does not exist or is inaccessible."
# 
# 
elipousson commented 2 years ago

Hello @jacpete! I expect you've been too busy to return to this project (I know I've had a hectic winter and spring) but wanted to check if you need any help with issue #41 or #40 or #43. I can't recall if all three need to be resolved in order to make it even possible to create an esriIndex function or if there is some way to make some more modest changes to get this working even if it stays experimental.

jacpete commented 2 years ago

Hey @elipousson. Thanks again for the prod. I finally got time to finish #40. And I believe it fixes the urlencoding and the token issues mentioned here. I know this was just the tip of the iceberg for your feature request, but I am happy to start tackling it piece by piece now. I think #39 should focus on the esriIndex function you proposed. Can you attach a latest draft of the function after updating the package to include the changes in #40? I'd love to get this added into the package. I am going to close #41 since I think the latest updates should have addressed some of the issues with esriLayers(), but if you are still having trouble with speed in your implementation of esriIndex we can discuss it further here or we can reopen it.

elipousson commented 2 years ago

That all sounds great! Thanks so much for making time for these updates.

I should have time next week or the following weekend to rewrite the esriIndex() function. I started doing something similar with the get_esri_layers function from my {overedge} package which takes a service URL and 1 or more layer integers then returns a named list (with the names based on the layer names from esri2sf::esrimeta()).

I'd also love to add support for buffer distances (since the FeatureServer API has native support for it) although I think that should be tracked in a separate issue and come in a separate pull request.

elipousson commented 2 years ago

I should have waited until next week but instead I stayed up later to update esriIndex and esriLayerIndex. I added a "return" parameter to optionally limit the index to services, folders, layers, or tables.

Updated (2022-05-12): This is a good start but it needs some work on how the URLs are being built and around error handling if the server doesn't have the type of resource included in the "return" parameter. I need to avoid the temptation to keep working on this but I do plan to return in the next week or so.

library(esri2sf)

esriLayerIndex <- function(url, parent = NULL, return = c("layers", "tables")) {
  return <- match.arg(return, c("layers", "tables"), several.ok = TRUE)

  layerIndex <- NULL

  if (grepl(pattern = "(MapServer|FeatureServer)$", url)) {
    serviceInfo <- esriLayers(url)

    if (!is.null(serviceInfo)) {
      if ("layers" %in% return) {
        layerIndex <-
          dplyr::select(
            serviceInfo$layers,
            id, name, type, geometryType
          )
      }

      if ("tables" %in% return) {
        layerIndex <-
          dplyr::bind_rows(
            layerIndex,
            dplyr::select(
              serviceInfo$tables,
              id, name, type
            )
          )
      }

      layerIndex <-
        dplyr::mutate(
          layerIndex,
          url = paste0(url, "/", id)
        )
    }
  }

  return(layerIndex)
}

esriIndex <- function(url, parent = NULL, return = c("services", "folders", "layers", "tables"), recurse = TRUE, layers = FALSE) {
  return <- match.arg(return, c("services", "folders", "layers", "tables"), several.ok = TRUE)

  if (esriUrl_isValidRoot(url) | esriUrl_isValidFolder(url)) {
    rootUrl <- url
  } else {
    parsed <-
      esriUrl_parseUrl(url)

    rootUrl <-
      paste0(parsed$scheme, paste0(c(parsed$host, parsed$instance, parsed$restIndicator), collapse = "/"))
  }

  urlInfo <-
    esrimeta(rootUrl)

  folders <- NULL

  if ("folders" %in% return) {
    folders <- urlInfo$folders

    folders <-
      dplyr::bind_cols(
        name = as.character(folders),
        type = "Folder",
        url = paste0(rootUrl, folders),
        parent = parent
      )
  }

  services <- NULL

  if ("services" %in% return) {
    services <- urlInfo$services

    if (is.null(parent)) {
      services_url <- paste0(rootUrl, services$name, "/", services$type, recycle0 = TRUE)
    } else {
      services_url <- paste0(rootUrl, "/", services$type, recycle0 = TRUE)
    }

    services <-
      dplyr::bind_cols(
        services,
        url = services_url,
        parent = parent
      )
  }

  urlIndex <-
    dplyr::bind_rows(
      folders,
      services
    )

  if (any(c("layers", "tables") %in% return) && !is.null(services) && (nrow(services) > 0)) {
    ValidServices <- vapply(services$url, esriUrl_isValidService, TRUE)

    layerIndex <-
      purrr::map2(
        services$url[ValidServices], services$name[ValidServices],
        ~ esriLayerIndex(url = .x, parent = .y, return = return)
      )

    urlIndex <-
      dplyr::bind_rows(
        urlIndex,
        layerIndex
      )
  }

  if (recurse && !is.null(folders) && (nrow(folders) > 0)) {
    urlIndex <-
      dplyr::bind_rows(
        urlIndex,
        purrr::map2_dfr(
          folders$url, folders$name,
          ~ esriIndex(url = .x, parent = .y, return = return)
        )
      )
  }

  return(urlIndex)
}

testIndex <- esriIndex(url = "https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/")

dplyr::glimpse(testIndex)
#> Rows: 43
#> Columns: 4
#> $ name   <chr> "Demographics", "Elevation", "Locators", "Louisville", "Network…
#> $ type   <chr> "Folder", "Folder", "Folder", "Folder", "Folder", "Folder", "Fo…
#> $ url    <chr> "https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/De…
#> $ parent <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Demographics", "De…

testIndex <- esriIndex("https://geodata.md.gov/imap/rest/services", return = "folders")
#> Error in value[[3L]](cond): Url is not a valid ESRI Service Url.
#> Error code: 403
#> Message: Access to this resource is not allowed

dplyr::glimpse(testIndex)
#> Rows: 43
#> Columns: 4
#> $ name   <chr> "Demographics", "Elevation", "Locators", "Louisville", "Network…
#> $ type   <chr> "Folder", "Folder", "Folder", "Folder", "Folder", "Folder", "Fo…
#> $ url    <chr> "https://sampleserver1.arcgisonline.com/ArcGIS/rest/services/De…
#> $ parent <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Demographics", "De…

testIndex <- esriIndex("https://geodata.md.gov/imap/rest/services/Transportation", return = c("services", "layers"))

dplyr::glimpse(testIndex)
#> Rows: 33
#> Columns: 3
#> $ name <chr> "Transportation/MD_AlternativeFuel", "Transportation/MD_Alternati…
#> $ type <chr> "FeatureServer", "MapServer", "FeatureServer", "MapServer", "Feat…
#> $ url  <chr> "https://geodata.md.gov/imap/rest/services/TransportationTranspor…

Created on 2022-05-12 by the reprex package (v2.0.1)

elipousson commented 2 years ago

This is taking things in a bit of a different direction but I fixed the issues with the prior draft of the index function and effectively rebuilt esriLayers and esrimeta using the {httr2} package. I started on this side track because I wanted to try out the "sitemap" parameter of the Catalog API service (which I think is what the esrimeta function is actually using) and {httr2} supports an XML response. I didn't test it but I think it may also be a little faster than the current versions of these functions and I think it may have options for handling authorization and throttling requests.

Obviously, switching to {httr2} would require a separate issue and likely a dedicated development branch but it may be worth exploring.

# Create an index of folders, services, layers, and tables for an ArcGIS Server
esriIndex <- function(url, parent = NULL, recurse = FALSE, ...) {
  esriResp <- esriCatalog(url, ...)

  index <- NULL
  urlIndex <- url

  if (!!length(esriResp[["folders"]])) {
    folders <-
      dplyr::bind_cols(
        "name" = unlist(esriResp$folders),
        "index" = "folder",
        "type" = NA
      )

    index <-
      dplyr::bind_rows(
        index,
        folders
      )
  }

  if (!!length(esriResp[["services"]])) {
    services <-
      dplyr::bind_rows(esriResp$services)

    services <-
      dplyr::bind_cols(
        services,
        "index" = "service"
      )

    index <-
      dplyr::bind_rows(
        index,
        services
      )
  }

  if (is.null(index)) {
    return(index)
  }

  urlbase <-
    regmatches(
      urlIndex,
      regexpr(pattern = ".+(?=/)", text = urlIndex, perl = TRUE)
    )

  index <-
    dplyr::mutate(
      index,
      url = NULL,
      url = dplyr::if_else(
        grepl(pattern = "/", x = name),
        urlbase,
        urlIndex
      ),
      url = dplyr::case_when(
        (index == "folder") ~ paste0(url, "/", name),
        TRUE ~ paste0(url, "/", name, "/", type)
      )
    )

  if (!is.null(parent)) {
    index <-
      dplyr::bind_cols(
        index,
        "parent" = parent
      )
  }

  if (recurse) {
    folderIndex <- subset(index, index == "folder")

    if (nrow(folderIndex) > 0) {
      folderIndex <-
        purrr::map2_dfr(
          folderIndex$url,
          folderIndex$name,
          ~ esriIndex(url = .x, parent = .y, recurse = TRUE)
        )

      index <-
        dplyr::bind_rows(
          index,
          folderIndex
        )
    }

    layerIndex <- subset(index, type %in% c("MapServer", "FeatureServer"))

    if (nrow(layerIndex) > 0) {
      layerIndex <-
        purrr::map2_dfr(
          layerIndex$url,
          layerIndex$name,
          ~ esriLayers(url = .x, parent = .y)
        )

      index <-
        dplyr::bind_rows(
          index,
          layerIndex
        )
    }
  }

  return(index)
}

# Create an index of layers and tables from an ArcGIS Service
esriLayers <- function(url, parent = NULL, ...) {
  esriResp <- esriCatalog(url, ...)

  index <- NULL

  if (!!length(esriResp[["layers"]])) {
    layers <-
      dplyr::bind_cols(
        dplyr::bind_rows(esriResp$layers),
        "index" = "layer"
      )

    index <-
      dplyr::bind_rows(
        index,
        layers
      )
  }

  if (!!length(esriResp[["tables"]])) {
    tables <-
      dplyr::bind_cols(
        dplyr::bind_rows(esriResp$tables),
        "index" = "table"
      )

    index <-
      dplyr::bind_rows(
        index,
        tables
      )
  }

  if (is.null(index)) {
    return(index)
  }

  index <-
    dplyr::bind_cols(
      index,
      "parent" = parent,
      "url" = paste0(url, "/", index$id)
    )

  return(index)
}

# Get a catalog of folders, services, tables, and layers
esriCatalog <- function(url, format = "json", option = NULL, outSR = NULL, ...) {
  format <- match.arg(format, c("json", "html", "kmz", "sitemap", "geositemap"))
  req <- httr2::request(url)
  req <- httr2::req_url_query(req = req, f = format)

  resp <- httr2::req_perform(req = req)

  if (format == "json") {

    if (!is.null(option) && (option == "footprints")) {

      req <- httr2::req_url_query(req = req, option = option)

      if (!is.null(outSR)) {
        req <- httr2::req_url_query(req = req, outSR = outSR)
      }
    }

    json <- httr2::resp_body_json(resp = resp, check_type = FALSE, ...)

    return(json)
  } else if (format %in% c("sitemap", "geositemap")) {
    sitemap <- httr2::resp_body_xml(resp, ...)

    sitemap <- xml2::as_list(sitemap)

    sitemap <- dplyr::bind_rows("url" = unlist(sitemap, use.names = FALSE))

    return(sitemap)
  }
}
jacpete commented 2 years ago

Got a chance to start looking through this. I started with esriCatalog() since it will become a basis for esriLayers and esriIndex. All three of these will need a token parameter added so that they work with locked down servers. This one one of the things I added to the package with #40. An example is like:

# Get a catalog of folders, services, tables, and layers
esriCatalog <- function(url, format = "json", token = "", option = NULL, outSR = NULL, ...) {
  format <- match.arg(format, c("json", "html", "kmz", "sitemap", "geositemap"))
  req <- httr2::request(url)
  req <- httr2::req_url_query(.req = req, f = format, token = token)

  resp <- httr2::req_perform(req = req)

  if (format == "json") {

    if (!is.null(option) && (option == "footprints")) {

      req <- httr2::req_url_query(req = req, option = option)

      if (!is.null(outSR)) {
        req <- httr2::req_url_query(req = req, outSR = outSR)
      }
    }

    json <- httr2::resp_body_json(resp = resp, check_type = FALSE, ...)

    return(json)
  } else if (format %in% c("sitemap", "geositemap")) {
    sitemap <- httr2::resp_body_xml(resp, ...)

    sitemap <- xml2::as_list(sitemap)

    sitemap <- dplyr::bind_rows("url" = unlist(sitemap, use.names = FALSE))

    return(sitemap)
  }
}

I have access to a secure server at through work that host some MapServers privately behind authorization and can confirm that esriCatalog(url = 'https://<host>/arcgis/rest/services', format = 'sitemap', token = "") gives a shorter list of URL's than esriCatalog(url = 'https://<host>/arcgis/rest/services', format = 'sitemap', token = generateToken(server = 'https://<host>', uid = '<uid>', pwd = '<pwd>')) (relying on the generateToken() function from the currently unmerged #48). This tells me that een though token is not listed as a parameter in Catalog API Service it is still valid and will provide access to the secure services information.

Notes: 1) Still need to walk through and test the modified esriLayers and esriIndex fxns. 2) I like the idea of another issue/pull request to move the rest of the package to httr2 from httr. I am fine making the modifications to the functions as you have above with httr2 in the pull request for esriIndex/esriCatalog and the separate issue to just modernize the rest of the package.

elipousson commented 2 years ago

That all sounds great. Thanks for taking a look at this so quickly. Would it be helpful for me to open a draft pull request for this (and incorporate the token parameter throughout) so we can test the functions as part of the package and incorporate the URL validity check functions where necessary? Or share updated code here but wait until you merge #48 before opening a new branch?

jacpete commented 2 years ago

I'd say go ahead and start a pull request and we can troubleshoot and check things together there. I did merge #48 so go ahead and rebase from master before you do your pull request. My thoughts currently:

elipousson commented 2 years ago

Sounds great! Excited to dig into this but may need to chip away a little bit at a time over the next couple weeks. I think the difference between the layers endpoint and the catalog API service could also explain the slight performance difference (if it actually exists and isn't just my imagination).

elipousson commented 2 years ago

I'll fix up the documentation and add an example to the README so this will be ready to close whenever #50 can be merged.