wojtekmach / req

Req is a batteries-included HTTP client for Elixir.
https://hexdocs.pm/req
922 stars 102 forks source link

Difference in (url encoding) behaviour compared with HTTPoison #270

Open thbar opened 7 months ago

thbar commented 7 months ago

While migrating a part of our snapshot crawler (https://github.com/etalab/transport-site/pull/3585), I did some largish scale testing, comparing the behaviour of HTTPoison and Req in detail.

One thing that came out is that urls with pipes | will result in errors, while HTTPoison for some reason (probably linked to how hackney works underneath) download them just fine.

Here is a reproduction on production urls:

data = [
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"
]

data
|> Enum.each(fn x ->
  IO.puts("==========================")
  IO.puts(x)
  IO.inspect(Req.get(x, compressed: false, decode_body: false))
  IO.inspect(Req.get(x |> String.replace("|", URI.encode("|")), compressed: false, decode_body: false))
end)

Typically Req without the replacement will return:

{:error,
 %Mint.HTTPError{
   reason: {:invalid_request_target,
    "/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
   module: Mint.HTTP1
 }}

While it will download the file just fine after a replacement.

I'm not sure who is in the right, but this could catch HTTPoison users off-guard!

wojtekmach commented 7 months ago

Thank you for the report. | is not allowed verbatim in URLs and needs to be escaped.

Req should have crashed sooner and not give invalid URL to Mint. It will when we switch from URI.parse to URI.new!, https://github.com/wojtekmach/req/issues/245. In this particular case we'd have received unhelpful error message:

iex> URI.new!("http://localhost/A|B")
** (URI.Error) cannot parse due to reason invalid_uri: ":"

and I have reported it upstream, https://github.com/erlang/otp/issues/7862.

Hackney probably has more relaxed URI parser.

I'll do a survey how other clients deal with this, whether they require escaping, because it might be more pragmatic in Req to let it slide.

thbar commented 7 months ago

Ok, thanks for the confirmation! I have implemented a quick fix and we will generalise it.

Apparently we have other things needed even for HTTPoison anyway (https://github.com/etalab/transport-site/blob/02c9c8c213758019cfb6525aac8cb9bfde09d383/apps/transport/lib/transport/import_data.ex#L334-L359).

Indeed I wonder how of a good idea it would be to allow this. curl and browsers will gently download the files.

It could catch people migrating to Req a bit off guard !

wojtekmach commented 7 months ago

I need to find URL examples where this would be undesirable but seems URL-encoding the whole thing might work too, i.e. assume they are not encoded:

iex> URI.encode("https://example.com/A|B")
"https://example.com/A%7CB"
thbar commented 7 months ago

I quite seem to recall I stumbled on cases where this lead to troubles. I will try to report back with précise urls.

Le mar. 14 nov. 2023 à 21:22, Wojtek Mach @.***> a écrit :

I need to find URL examples where this would be undesirable but seems URL-encoding the whole thing might work too, i.e. assume they are not encoded:

iex> URI.encode("https://example.com/A|B") "https://example.com/A%7CB"

— Reply to this email directly, view it on GitHub https://github.com/wojtekmach/req/issues/270#issuecomment-1811187935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACPHNZU53BOEY4PQUWAXLYEPHIHAVCNFSM6AAAAAA7LKZBSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJRGE4DOOJTGU . You are receiving this because you authored the thread.Message ID: @.***>

thbar commented 7 months ago

I have extracted a list of urls from our pool of resources, filtering where "encoded url" != "the url itself" (and filtering out the "pipe" case which we have already discussed).

Sorry, this is very raw, but at least provide a large set of urls in the wild with various servers implementations. Maybe it can be useful during testing, unsure yet!

I have not yet tested which urls are correctly handled by HTTPoison vs Req (did that a few weeks back, would have to run that again).

[
  {true, 973, "7d162b4d-db84-49e9-b4ab-bd4f9f87b004",
   "https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%2FBerlin",
   "https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%252FBerlin"},
  {true, 688, "3349a69a-bc9a-4de4-a2c2-9f9d32e4923c",
   "https://notify.ratpdev.com/api/networks/RD%20TPM/alerts/gtfsrt",
   "https://notify.ratpdev.com/api/networks/RD%2520TPM/alerts/gtfsrt"},
  {true, 948, "7f0f6bcc-77a1-4f52-957a-95060a1c6ac7",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
  {true, 566, "1d919c39-41d4-4453-998c-9417c45b2f28",
   "https://notify.ratpdev.com/api/networks/VIENNE%20MOBI/alerts/gtfsrt",
   "https://notify.ratpdev.com/api/networks/VIENNE%2520MOBI/alerts/gtfsrt"},
  {true, 956, "7a6bc31b-1507-46cb-8281-9946afbffdeb",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
  {true, 952, "2de1872e-64dd-4234-af73-1eb4cb48b116",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
  {true, 959, "4b3466a1-b586-408b-a262-e9694dbd72fb",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
  {true, 825, "afbe14f0-aef5-45f1-8e92-9757d2cf770a",
   "https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%3A40%3A38Z&se=2050-03-10T14%3A40%3A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%2BI1TJ2capp085ArXJYDA7i4IT04%3D",
   "https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%253A40%253A38Z&se=2050-03-10T14%253A40%253A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%252BI1TJ2capp085ArXJYDA7i4IT04%253D"},
  {true, 825, "cf7adb62-bbfe-4f1f-93f7-dbbad9fd60e4",
   "https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%3A12%3A00Z&se=2050-01-13T11%3A12%3A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%2BDFSnqH1IUi14k8qg1auk%2FU%3D",
   "https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%253A12%253A00Z&se=2050-01-13T11%253A12%253A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%252BDFSnqH1IUi14k8qg1auk%252FU%253D"},
  {true, 957, "4e91e313-b1ef-444d-a303-898d3e148c67",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"}
]

For my own reference (since this cannot be easily ran in this current form, as it refers to parts of our application), the code used to generate this is:

#! mix run
Code.require_file(__DIR__ <> "/irve/req_custom_cache.exs")

import Ecto.Query

defmodule Downloader do
  def cache_dir, do: Path.join(__ENV__.file, "../cache-dir") |> Path.expand()

  def get!(url) do
    req = Req.new() |> CustomCache.attach()
    %{body: body, status: 200} = Req.get!(req, url: url, custom_cache_dir: cache_dir())
    body
  end
end

base_url = "https://www.data.gouv.fr"

task = fn dataset ->
  url = base_url <> "/api/1/datasets/#{dataset.datagouv_id}/"
  # IO.puts url
  {dataset.id, Downloader.get!(url)}
end

DB.Dataset
|> where([d], d.is_active == true)
|> DB.Repo.all()
# |> Enum.take(2)
|> Task.async_stream(task, max_concurrency: 20, timeout: :infinity)
|> Enum.map(fn {:ok, {dataset_id, response}} ->
  response["resources"]
  |> Enum.map(fn r ->
    url = r["url"]
    url = String.replace(url, "|", URI.encode("|"))
    encoded_url = URI.encode(url)
    {encoded_url != url, dataset_id, r["id"], url, encoded_url}
  end)
end)
|> List.flatten()
|> Enum.filter(fn {diff, _, _, _, _} -> diff end)
|> IO.inspect(IEx.inspect_opts())
wojtekmach commented 4 months ago

@thbar could you double-check this? Maybe Mint changed in the meantime? I can't reproduce the failure anymore:

iex> Req.get!("https://httpbin.org/anything?a[b]=1|2").body["args"]
%{"a[b]" => "1|2"}
thbar commented 4 months ago

@wojtekmach I did a bit of investigation (first in our app & with the code at the top of this issue), but ultimately I bisected it here (see below).

The change is not related to Mint but, I believe, to Req 0.4.10 switch to URI.parse.

Script

version_specifier = System.fetch_env!("REQ_VERSION")

IO.puts("Starting with req #{version_specifier}")

Mix.install([
  {:req, version_specifier},
  # added because 0.4.10 cannot be installed without plug
  {:plug, "~> 1.15"}
])

Req.get!("https://httpbin.org/anything?a[b]=1|2").body["args"]

IO.puts("OK")

Behaviour on Req 0.4.9

❯ REQ_VERSION=0.4.9 elixir scripts/req.exs
Starting with req 0.4.9
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    scripts/req.exs:10: (file)

Behaviour on Req 0.4.10

❯ REQ_VERSION=0.4.10 elixir scripts/req.exs
Starting with req 0.4.10
OK
thbar commented 4 months ago

@wojtekmach maybe it would be worth just adding a test in the codebase with "https://httpbin.org/anything?a[b]=1|2", so that we ensure it does not change unexpectedly in the future? Just an idea.

All in all, it is an interesting thing to note for us, I will do more testing while upgrading.

wojtekmach commented 4 months ago

Sure I can add the regression test. Do you think we can close this issue then or you'd rather do some more testing in this area?

Btw I cannot reproduce:

$ elixir -e 'Mix.install([{:req, "0.4.9"}]) ; %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1") ; IO.puts("ok")'
ok

Perhaps you had a cache with prior finch or mint version? Could you run the following? Remember in your script you had a dependency on :plug so include it too, it's part of the cache key,

$ elixir -e 'Mix.install([{:req, "0.4.9"}]) ; IO.inspect(finch: Application.spec(:finch, :vsn), mint: Application.spec(:mint, :vsn))'
[finch: ~c"0.17.0", mint: ~c"1.5.2"]
AntoineAugusti commented 4 months ago

@wojtekmach This is not the same URL, the issue happens when there is a pipe.

elixir -e 'Mix.install([{:req, "0.4.9"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    (stdlib 3.17.2.4) erl_eval.erl:685: :erl_eval.do_apply/6
    (stdlib 3.17.2.4) erl_eval.erl:446: :erl_eval.expr/5
    (stdlib 3.17.2.4) erl_eval.erl:123: :erl_eval.exprs/5
    (elixir 1.15.5) lib/code.ex:543: Code.validated_eval_string/3
thbar commented 4 months ago

Yes, the change of behaviour is linked to the pipe indeed, as @AntoineAugusti pointed out! (thank you!)

wojtekmach commented 4 months ago

Right, my bad. I'm able to reproduce this locally:

~% elixir -e 'Mix.install([{:req, "0.4.9"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    (stdlib 5.2) erl_eval.erl:750: :erl_eval.do_apply/7
    (stdlib 5.2) erl_eval.erl:494: :erl_eval.expr/6
    (stdlib 5.2) erl_eval.erl:136: :erl_eval.exprs/6
~% elixir -e 'Mix.install([{:req, "0.4.11"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
ok

FWIW it seems you're good on the latest version but yeah I'm really curious what's the root cause. I'll post here when I find out.

thbar commented 4 months ago

Right, my bad. I'm able to reproduce this locally:

No problem - the url includes too many weird characters for a url anyway :-)

Happy to help out if we can btw!

Thanks for tracking this.

wojtekmach commented 4 months ago

OK, this is pretty wild. Turns out this is difference between HTTP/1.1 and HTTP/2. Req since 0.4.11 starts on HTTP/1.1 and tries to switch to HTTP/2 using ALPN and on HTTP/2 this is not a problem. If we force using HTTP/1 we see the crash. So it's something in Mint after all.

iex> elixir -e 'Mix.install([{:req, "0.4.11"}]) ; %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2", connect_options: [protocols: [:http1]])'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.11) lib/req.ex:978: Req.request!/2
    (stdlib 5.2) erl_eval.erl:750: :erl_eval.do_apply/7
    (stdlib 5.2) erl_eval.erl:494: :erl_eval.expr/6
    (elixir 1.17.0-dev) lib/code.ex:568: Code.validated_eval_string/3
thbar commented 4 months ago

@wojtekmach thanks for diving into this! Pretty wild indeed, and I would be happy to ensure we have some form of non-regression testing on that.

FYI we've upgraded on our side (see above), if anything problematic arises (given that we crawl a decently large number of urls), I will mention this back here!

Happy to help out in any case.