zerovm / zerocloud

Swift middleware for Zerocloud
Apache License 2.0
53 stars 14 forks source link

CGI support #83

Open pkit opened 10 years ago

pkit commented 10 years ago

Some CGI variables are exposed to the application to make it aware of the environment.

Currently it looks like this:

CONTENT_LENGTH="558"
CONTENT_TYPE="application/x-pickle"
HTTP_X_OBJECT_META_KEY2="val2"
HTTP_ETAG="48ffde7b7ae928b65147f7281530a4e3"
HTTP_X_TIMESTAMP="1404134925.00605"
HTTP_X_OBJECT_META_KEY1="val1"
DOCUMENT_ROOT="/dev/stdin"
SERVER_NAME="localhost"
SERVER_SOFTWARE="zerocloud"
GATEWAY_INTERFACE="CGI/1.1"
SCRIPT_FILENAME="swift://a/c/exe2"
SCRIPT_NAME="http"
REQUEST_METHOD="POST"
HTTP_HOST="localhost:80"
PATH_INFO="/a/c/o3"
SERVER_PORT="80"
SERVER_PROTOCOL="HTTP/1.0"
QUERY_STRING="param1=v1&param2=v2"

We can divide the vars into groups.

  1. Static ones, usually never change.

    SERVER_SOFTWARE="zerocloud"
    GATEWAY_INTERFACE="CGI/1.1"
  2. Related to Swift setup, will change if Swift will be installed differently.

    SERVER_PORT="80"
    SERVER_NAME="localhost"
    SERVER_PROTOCOL="HTTP/1.0"
  3. Related to request. The values are taken verbatim from the http request

    QUERY_STRING
    HTTP_HOST
    REMOTE_ADDR
    REMOTE_USER
    HTTP_USER_AGENT
    HTTP_REFERER
    HTTP_ACCEPT
    HTTP_ACCEPT_ENCODING
    HTTP_ACCEPT_LANGUAGE
  4. All others. These ones need most attention as we are "emulating" things here.

    SCRIPT_NAME - right now it's a path to executable taken from job description. I want to alter it and use the actual name/path of the executable here (like "python" or "/bin/wc" for example).

    SCRIPT_FILENAME - right now unused. I want it to be the path from job description (like "swift://account/cont/app.nexe" or "file://python:python")

    PATH_INFO - path to account if not connected to swift object, path to swift object if attached to object ( "/account" in former case "/account/container/object" - in latter)

    REQUEST_METHOD - it's "GET" for requests that do not have attached data files (the ones that have only "stdout" or "output" channels, and maybe network ones), and it's "POST" for requests that have attached data ("stdin", "input", "image" and so on)

    DOCUMENT_ROOT - device name of the attached object (if attached, otherwise unset)

    CONTENT_LENGTH - size of the attached object (if attached, otherwise unset)

    CONTENT_TYPE - content-type of the attached object (if attached, otherwise unset)

    HTTP_X_TIMESTAMP, HTTP_ETAG, HTTP_CONTENT_ENCODING, HTTP_X_OBJECT_META_* - metadata from attached object (if attached, otherwise unset)

  5. Additional things.

    command line args - probably we need to pass them as env variable also (very useful for daemon mode)

mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

  1. Static ones, usually never change.

    SERVER_SOFTWARE="zerocloud"
    GATEWAY_INTERFACE="CGI/1.1"

Looks good to me.

  1. Related to Swift setup, will change if Swift will be installed differently.

    SERVER_PORT="80"
    SERVER_NAME="localhost"
    SERVER_PROTOCOL="HTTP/1.0"

Also fine.

  1. Related to request. The values are taken verbatim from the http request

    QUERY_STRING
    HTTP_HOST
    REMOTE_ADDR
    REMOTE_USER
    HTTP_USER_AGENT
    HTTP_REFERER
    HTTP_ACCEPT
    HTTP_ACCEPT_ENCODING
    HTTP_ACCEPT_LANGUAGE

Looks good, but we are missing other HTTP headers sent by the client.

I'm testing how Apache does it right now and so far I can see that a Foo header appears as a HTTP_FOO environment variable. This is consistent with section 4.1.18 about Protocol-Specific Meta-Variables in the CGI standard:

http://www.ietf.org/rfc/rfc3875

When I try adding a Range header, well then Apache seems to handle it! That is, my little CGI script is invoked and the output of it is then chopped according to the byte range I specify in my Range header.

I had expected it to be the CGI script that should handle the Range header, but that's apparently not how old-fashioned CGI works.

I've read questions about the header with PHP and judging from some SO answers, I believe that mod_php lets the script handle the header.

I'll have to test this some more.

  1. All others. These ones need most attention as we are "emulating" things here.

    SCRIPT_NAME - right now it's a path to executable taken from job description. I want to alter it and use the actual name/path of the executable here (like "python" or "/bin/wc" for example).

So you want the name as seen inside the sandbox? That is probably okay. In real CGI it is something like

/~mg/cgi-bin/test.py

I would say that it is mostly used to generate URLs that refer back to the same script for future requests.

`SCRIPT_FILENAME` - right now unused. I want it to be the path

from job description (like "swift://account/cont/app.nexe" or "file://python:python")

I don't see this header in RFC 3875.

`PATH_INFO` - path to account if not connected to swift object,

path to swift object if attached to object ( "/account" in former case "/account/container/object" - in latter)

In traditional CGI, this contain extra path components. So if I request

http://localhost/~mg/cgi-bin/test.py/hey/extra/stuff

I see that PATH_INFO is set to '/hey/extra/stuff'. So for the /open/ protocol, I think this should be the name of the object that was requested (we're then treating 'open' as the script being executed, like test.py above).

`REQUEST_METHOD` - it's "GET" for requests that do not have

attached data files (the ones that have only "stdout" or "output" channels, and maybe network ones), and it's "POST" for requests that have attached data ("stdin", "input", "image" and so on)

Sounds good.

`DOCUMENT_ROOT` - device name of the attached object (if attached,
otherwise unset)

This is a little weird. The document root is normally a quite static thing and not something that changes depending on what script you request and certainly not depending on what arguments you send with your request.

Will a script not always "know" where the attached object is? In the jobs I've written, I knew that I needed to read from /dev/input. Are there other possible input paths inside the sandbox?

`CONTENT_LENGTH` - size of the attached object (if attached,
otherwise unset)

`CONTENT_TYPE` - content-type of the attached object (if attached,
otherwise unset)

I (only) see these two variables defined if I POST to my little test script. Here you want to reuse them to describe the attached object instead of the data coming from the user. That's non-standard, but it sounds okay.

I hope there cannot be a situation where we have both POST data coming in on stdin and an attached object?

`HTTP_X_TIMESTAMP`, `HTTP_ETAG`, `HTTP_CONTENT_ENCODING`,

HTTP_X_OBJECTMETA*` - metadata from attached object (if attached, otherwise unset)

Yeah, it will be nice to have the meta data directly there.

  1. Additional things.

    command line args - probably we need to pass them as env variable also (very useful for daemon mode)

I was first thinking that this should be passed in the QUERY_STRING... but that doesn't make sense since the cmdline args are internal to the job being executed (they're part of the job description).

pkit commented 10 years ago

Looks good, but we are missing other HTTP headers sent by the client.

I fear that most current web servers have no real standard on which headers are handled by the server and which ones - by CGI...

I've read questions about the header with PHP and judging from some SO answers, I believe that mod_php lets the script handle the header.

Yeah, exactly that kind of thing.

I don't see this header in RFC 3875.

But it's visible in most current servers, AFAIK.

So for the /open/ protocol, I think this should be the name of the object that was requested (we're then treating 'open' as the script being executed, like test.py above).

Yep, this is the exact value it has right now. But we will have a problem when we will try to implement "arbitrary RESTful interfaces" on Zerocloud.

Will a script not always "know" where the attached object is? In the jobs I've written, I knew that I needed to read from /dev/input. Are there other possible input paths inside the sandbox?

Because job description is quite static and can be compared to specific mod_* config in apache.conf the /dev/something part is really static, until you change the config.

I hope there cannot be a situation where we have both POST data coming in on stdin and an attached object?

Probably that will never happen, but it depends on how the "arbitrary RESTful interfaces" are going to be implemented. Probably the POST data from the REST API frontend will be first materialized as an object and then ZeroVM will act upon it in another session. But there are other ways to do it, for example packing it in tar on-the-fly.

I was first thinking that this should be passed in the QUERY_STRING... but that doesn't make sense since the cmdline args are internal to the job being executed (they're part of the job description).

Not quite right, you can obviously change any part of the job description in runtime. This is the problem with current "standard" interfaces like CGI - it doesn't quite fit the Zerocloud execution paradigm. But inventing a new good API for that is something I still would like to avoid, will be too complex, probably.

mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

Looks good, but we are missing other HTTP headers sent by the client.

I fear that most current web servers have no real standard on which headers are handled by the server and which ones - by CGI...

There seems to be a difference between handling the header and exposing it to the script. We can take all headers sent by the client and put them into the environment: add 'HTTP', replace '-' with '' and uppercase the whole thing. That's described in the RFC, which also describes that the server can leave out some connection-oriented headers if it likes.

I've read questions about the header with PHP and judging from some SO answers, I believe that mod_php lets the script handle the header.

Yeah, exactly that kind of thing.

Some more testing shows that Apache will "help" the CGI script if it doesn't seem to know what it's doing :)

More concretely, if it don't send out a Status header, Apache will add a '200 OK' header. If the client did a range request, Apache will then also take care of slicing the output.

If the script does add a Status header, Apache will do less. If I add a '206 Partial Content' Status header, Apache will stop messing with the output and the script is then in full control. With a '200 OK', Apache will both content encode the output and slice it according to what the range header says.

So all in all, I think we can expose more headers in the environment. We should also be able to make ZeroCloud add a '200 OK' Status header only if the CGI script hasn't already added a Status header.

pkit commented 10 years ago

There seems to be a difference between handling the header and exposing it to the script. We can take all headers sent by the client and put them into the environment: add 'HTTP', replace '-' with '' and uppercase the whole thing. That's described in the RFC, which also describes that the server can leave out some connection-oriented headers if it likes.

We can do that, but it will make things problematic, i.e. we will need to put safeguards against abusing headers. :)

So all in all, I think we can expose more headers in the environment. We should also be able to make ZeroCloud add a '200 OK' Status header only if the CGI script hasn't already added a Status header.

We are supporting CGI and CGI NPH right now. Which means that if job claims to support message/http we will not mess with Status header, if job claims only message/cgi support we will add HTTP/1.1 200 OK stuff. On the other hand we do not pass response headers "as-is", we extract only Content-Type and X-Object-Meta-* all other headers are ignored right now. If we want to stop ignoring them we need to differentiate between "single node, return stuff directly to user, run CGI" case, and all other cases.

mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

There seems to be a difference between handling the header and exposing it to the script. We can take all headers sent by the client and put them into the environment: add 'HTTP', replace '-' with '' and uppercase the whole thing. That's described in the RFC, which also describes that the server can leave out some connection-oriented headers if it likes.

We can do that, but it will make things problematic, i.e. we will need to put safeguards against abusing headers. :)

Yes, I suggest we take a look at how other servers do this and implement something similar.

So all in all, I think we can expose more headers in the environment. We should also be able to make ZeroCloud add a '200 OK' Status header only if the CGI script hasn't already added a Status header.

We are supporting CGI and CGI NPH right now.

Thanks for that keyword, I did not know that this behavior had a name (NPH: No Parsed Headers).

Which means that if job claims to support message/http we will not mess with Status header, if job claims only message/cgi support we will add HTTP/1.1 200 OK stuff.

Yeah, I see this documented in the Servlet.md document.

On the other hand we do not pass response headers "as-is", we extract only Content-Type and X-Object-Meta-* all other headers are ignored right now.

If we want to stop ignoring them we need to differentiate between "single node, return stuff directly to user, run CGI" case, and all other cases.

Is this because of the fan-out execution that we have behind the ZeroCloud proxy? I mean, you send a HTTP request and this can trigger >1 requests in ZeroCloud, all of which can set headers.

I actually think we should handle this case already today. Something like defining that a ZeroCloud job can have at most one node that sends data to stdout (= data is returned to the client who issues the GET or POST).

Today, we get a concatenation of the output from the stdout. It seems that the names of the groups determine the concatenation of the outputs, but I didn't see this specified anywhere.

It might make more sense to say that we send back at most one output. Users who wish to concatenate outputs would then have to make an additional fan-in group with a single node. The cost of that might be okay when we consider the clarity it brings: the single node that delivers output to the client will also be responsible for outputting any headers to the client.

Martin Geisler

http://google.com/+MartinGeisler

pkit commented 10 years ago

Yes, I suggest we take a look at how other servers do this and implement something similar.

It means reverse engineering apache/nginx source code, yuck. :)

Is this because of the fan-out execution that we have behind the ZeroCloud proxy?

Yes. But not only that. You can save output of message/cgi or message/http as an object. And object server doesn't know about that, it only knows how to parse the output and send it back. And when you save it, all the headers get lost, apart from metadata and content-type (maybe also content-disposition, or how is it called?)

Today, we get a concatenation of the output from the stdout. It seems that the names of the groups determine the concatenation of the outputs, but I didn't see this specified anywhere.

Yep, probably need to be specifically stated in docs. Now it's more like "send job with multiple nodes without a path and see what happens". :)

It might make more sense to say that we send back at most one output.

That was considered, but it really makes life more miserable. Earlier we also supported multiple channels concatenated in one node output, but dropped it. Right now some people still miss that feature (even without knowing it ever existed).

I think we need to make a clear distinction here. And I think we probably need to make some "CGI mode" available. I.e. mode that behaves a lot like classical CGI (with some zerocloud added bonuses). And it will probably mean that it's job description will be a fixed one. All in all I think we can divide the task into following sub-tasks:

  1. Make message/cgi and message/http to behave more like real CGI outputs.
  2. Implement proper decision making in proxy middleware, regarding CGI headers: sent to object? sent to user?
  3. Implement CGI-mode for job descriptions. I.e. probably: if you POST/GET/etc. to specific version (or with specific header) and the target of the url is a zerovm executable (or zerovm supported script) invoke specific job description that sets up stuff correctly and executes that one task as a CGI job, and then correctly returns the result (to user or otherwise).
mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

Yes, I suggest we take a look at how other servers do this and implement something similar.

It means reverse engineering apache/nginx source code, yuck. :)

Apache adds most headers, it only filters out Authorization and Proxy-Authorization headers:

https://github.com/apache/httpd/blob/53823ebd5c/modules/generators/mod_cgi.c#L805 https://github.com/apache/httpd/blob/454a28fac3/server/util_script.c#L194

Is this because of the fan-out execution that we have behind the ZeroCloud proxy?

Yes. But not only that. You can save output of message/cgi or message/http` as an object. And object server doesn't know about that, it only knows how to parse the output and send it back. And when you save it, all the headers get lost, apart from metadata and content-type (maybe also content-disposition, or how is it called?)

It sounds like we might just want to consider that a configuration error and refuse to execute the job? Like, we would demand a stdout channel with those content-types has no path associated.

Today, we get a concatenation of the output from the stdout. It seems that the names of the groups determine the concatenation of the outputs, but I didn't see this specified anywhere.

Yep, probably need to be specifically stated in docs. Now it's more like "send job with multiple nodes without a path and see what happens". :)

Guess how I tested this before writing my previous mail :)

It might make more sense to say that we send back at most one output.

That was considered, but it really makes life more miserable.

Well, the current situation is only non-miserable if the behavior makes sense and can be relied upon :-) In my brief testing, it seems like the order of the outputs match the alphabetical order of the group names. Is that true or was it just a coincidence?

Earlier we also supported multiple channels concatenated in one node output, but dropped it. Right now some people still miss that feature (even without knowing it ever existed).

I think we need to make a clear distinction here. And I think we probably need to make some "CGI mode" available. I.e. mode that behaves a lot like classical CGI (with some zerocloud added bonuses). And it will probably mean that it's job description will be a fixed one.

I'm not sure I follow you here. What is a "fixed" job description?

All in all I think we can divide the task into following sub-tasks:

  1. Make message/cgi and message/http to behave more like real CGI outputs.

Up until now, I had not even noticed that you could attach these content types to the stdout device. I had just declared the device without any content type and printed stuff. I think this means that 'message/cgi' is an implicit default content type for stdout?

  1. Implement proper decision making in proxy middleware, regarding CGI headers: sent to object? sent to user?

This was why I wanted to restrict the output to a single node -- then it makes sense to talk about HTTP headers. It's confusing to me if the nodes "internal" to the job also produce HTTP headers since they're not strictly part of a HTTP request :)

  1. Implement CGI-mode for job descriptions. I.e. probably: if you POST/GET/etc. to specific version (or with specific header) and the target of the url is a zerovm executable (or zerovm supported script) invoke specific job description that sets up stuff correctly and executes that one task as a CGI job, and then correctly returns the result (to user or otherwise).

I think I would try not to have different modes. My thinking is that if we create a CGI-mode that "makes sense", then what about the other mode? Will it be a mode that does not make sense? :)

I think it would be better overall if we can define some sensible semantics of what it means when a node issues HTTP headers for a device that doesn't go back to the user. The CGI-mode you talk about would then be a subset of that -- the subset where you happen to restrict yourself to something that looks like old-fashioned CGI. Does that make sense?

Martin Geisler

http://google.com/+MartinGeisler

pkit commented 10 years ago

I think it would be better overall if we can define some sensible semantics of what it means when a node issues HTTP headers for a device that doesn't go back to the user.

That's an easy one. If user wants to write an object back to Swift but the object MIME-type is only known in runtime: it will use CGI interface to create a proper 'Content-Type' header and that header will be used when object is PUT. Same thing about other object-related headers, like metadata or encoding.

The CGI-mode you talk about would then be a subset of that -- the subset where you happen to restrict yourself to something that looks like old-fashioned CGI. Does that make sense?

That would be a super-set. I.e. CGI app can produce a bunch of headers, only specific ones will be used for PUT, all other ones can be transferred to the user, if response is to be transferred to the user.

Apache adds most headers, it only filters out Authorization and Proxy-Authorization headers:

And Content-Length + Content-Type "for no specific reason". :)

It sounds like we might just want to consider that a configuration error and refuse to execute the job?

Not good, see the first paragraph. :)

In my brief testing, it seems like the order of the outputs match the alphabetical order of the group names. Is that true or was it just a coincidence?

Not a coincidence. But the guarantees are that the order will be constant between invocations and deterministic. For deterministic reasons.

What is a "fixed" job description?

Means that you not send it with the request, but it's implied from other things. Like the current GET behavior.

I think this means that 'message/cgi' is an implicit default content type for stdout?

Obviously not. :) The default Content-Type for stdout is text/plain Now the difference would be: text/plain:

Hello
World!

message/cgi

Status: 200 OK
Content-Type: text/plain

Hello
World!

But obviously the end result for the user will be the same. As proxy will add its own headers anyway. To both of them, and just add more headers to the former, than to the latter.

mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

Apache adds most headers, it only filters out Authorization and Proxy-Authorization headers:

And Content-Length + Content-Type "for no specific reason". :)

They are actually not filtered -- they're just added by hand for no specific reason :)

It sounds like we might just want to consider that a configuration error and refuse to execute the job?

Not good, see the first paragraph. :)

I cannot see that paragraph any longer in your mail :) And GitHub doesn't put proper In-Reply-To headers into the mails, so it appears that all the mails are replies to the first mail sent...

In other words, this discussion format is a little primitive.

In my brief testing, it seems like the order of the outputs match the alphabetical order of the group names. Is that true or was it just a coincidence?

Not a coincidence. But the guarantees are that the order will be constant between invocations and deterministic. For deterministic reasons.

Okay, I had not considered that the determinism would apply to this level too.

What is a "fixed" job description?

Means that you not send it with the request, but it's implied from other things. Like the current GET behavior.

Okay, thanks!

I think this means that 'message/cgi' is an implicit default content type for stdout?

Obviously not. :)

The default Content-Type for stdout is text/plain Now the difference would be: text/plain:

Hello
World!

message/cgi

Status: 200 OK
Content-Type: text/plain

Hello
World!

Okay, but what is the 'message/http' content type then?

I thought 'message/http' was the content type that told ZeroCloud that the script wanted to output HTTP headers -- that the script is a NPH script to use that terminology.

But obviously the end result for the user will be the same. As proxy will add its own headers anyway. To both of them, and just add more headers to the former, than to the latter.

Yeah, that makes sense.

Martin Geisler

http://google.com/+MartinGeisler

pkit commented 10 years ago

Okay, but what is the 'message/http' content type then?

message/http:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12

Hello
World!
mgeisler commented 10 years ago

Constantine Peresypkin notifications@github.com writes:

Okay, but what is the 'message/http' content type then?

message/http:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12

Hello
World!

Okay, thanks!

Martin Geisler

http://google.com/+MartinGeisler

pkit commented 9 years ago

This is the current typical environment for Zerocloud on Zebra I think it looks good enough and can be documented.

{
  "GATEWAY_INTERFACE": "CGI/1.1", 
  "HTTP_ACCEPT": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
  "HTTP_ACCEPT_ENCODING": "gzip, deflate", 
  "HTTP_ACCEPT_LANGUAGE": "en-US,en;q=0.5", 
  "HTTP_CACHE_CONTROL": "no-cache", 
  "HTTP_CONNECTION": "close", 
  "HTTP_HOST": "zebra.zerovm.org", 
  "HTTP_PRAGMA": "no-cache", 
  "HTTP_REFERER": "https://zebra.zerovm.org/index.html?account=user_id:user_name", 
  "HTTP_USER_AGENT": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0", 
  "HTTP_X_TRANS_ID": "txf593505dfd3a435581fa2-0053eb7205", 
  "HTTP_X_ZEROCLOUD_ID": "147cfb5673d7350f9", 
  "HTTP_X_ZEROVM_EXECUTE": "1.0", 
  "HTTP_X_ZEROVM_TIMEOUT": "125", 
  "LOCAL_CONTENT_LENGTH": "110", 
  "LOCAL_CONTENT_TYPE": "text/x-python; charset=UTF-8", 
  "LOCAL_DOCUMENT_ROOT": "/dev/input", 
  "LOCAL_HTTP_ETAG": "19816b82edb733f2572a722d5653c826", 
  "LOCAL_HTTP_X_TIMESTAMP": "1407937552.86101", 
  "LOCAL_OBJECT": "on", 
  "LOCAL_PATH_INFO": "/AUTH_account/test_env/test_env.py", 
  "PATH_INFO": "/AUTH_account", 
  "REMOTE_USER": "user_name:user_id,user_name,AUTH_account", 
  "REQUEST_METHOD": "POST", 
  "REQUEST_URI": "/v1/AUTH_account", 
  "SCRIPT_FILENAME": "file://python:python", 
  "SCRIPT_NAME": "python", 
  "SERVER_NAME": "zebra.zerovm.org", 
  "SERVER_PORT": "80", 
  "SERVER_PROTOCOL": "HTTP/1.0", 
  "SERVER_SOFTWARE": "zerocloud"
}