xquery-mode / cider-any

Evaluate any buffer in cider – Replaced by Oook
0 stars 1 forks source link

make binary transfers work #17

Open m-g-r opened 7 years ago

m-g-r commented 7 years ago

An xquery might return a binary file (e.g., an gzip of an xml). Currently, this seems not to be handled.

proofit404 commented 7 years ago

It is tricky for normal-mode to understand what this content is.

Does receiving part works correctly?

I would be great if you provide minimal reproduction example.

m-g-r commented 7 years ago

reproductino example:

in a shell: cd /tmp echo hello world > hello.txt gzip hello.txt

xquery 1: xquery version "1.0-ml"; xdmp:document-load("/tmp/hello.txt.gz")

xquery 2: xquery version "1.0-ml"; doc("/tmp/hello.txt.gz")

Files: hello.sh.txt xquery1.xqy.txt xquery2.xqy.txt

proofit404 commented 7 years ago

Finally I'm trying to investigate this problem.

proofit404 commented 7 years ago

Looks like it works fine with gzipped data.

I do following steps:

  1. Check that documents not present in the database yet
doc("/tmp/hello.txt.gz")
doc("/tmp/hello1.txt")

Both this queries return empty sequence.

  1. Prepare documents for load into database
$ cd /tmp
$ echo '"hello world"' > hello.txt
$ cp hello.txt hello1.txt
$ gzip hello.txt
$ cat /tmp/hello1.txt
"hello world"
$ cat /tmp/hello.txt.gz
oE
  Xhello.txtS(QG?
$ zcat /tmp/hello.txt.gz
"hello world"
  1. Load documents in to database
xdmp:document-load("/tmp/hello.txt.gz")
xdmp:document-load("/tmp/hello1.txt")

Both of this queries returns empty sequence.

  1. Read documents from database
doc("/tmp/hello1.txt")

returns

"hello world"

as expected.

doc("/tmp/hello.txt.gz")

returns

[B@223eb7be

This result was produced by this clojure form

(do
  (require '[uruk.core :as uruk])
  (set! *print-length* nil)
  (set! *print-level* nil)
  (let [host "localhost"
        port 8889
        db {:host "localhost" :port "8889" :user "proofit404" :password "<pass>" :content-base "TutorialDB"}]
    (with-open [session (uruk/create-default-session (uruk/make-hosted-content-source host port db))]
      (doall (map str (uruk/execute-xquery session "doc(\"/tmp/hello.txt.gz\")"))))))

If you copy-paste it into cider repl directly, it will produce same string as its output

("[B@60d43a83")

So it seems to me that everything works as expected.

m-g-r commented 7 years ago

But "[B@60d43a83" is clearly not the binary contents of hello.txt.gz. It is not even 42 bytes long. It is the printed version of the Clojure object: a byte-array. [B = byte-array, @60d43a83 = at address 0x60d43a83. That is not what we want.

The whole Clojure from is part of cider-any-uruk-eval-form in cider-any-uruk.el. The form maps str on the returned result to the XQuery, and that is the problem. It might be a binary string, and that should just be sent to Emacs untampered with. (Maybe we want that even for textual data. Better not have the encoding possibly changed by Clojure.)

Here, evaluate this in your Cider Repl

(defn gunzip
  ([string]
   (gunzip string "UTF-8"))
  ([string encoding]
   (let [input-stream (io/input-stream string)]
     (with-open [out (java.io.ByteArrayOutputStream.)]
       (org.apache.commons.io.IOUtils/copy (java.util.zip.GZIPInputStream. input-stream) out)
       (.close input-stream)
       (.toString out encoding)))))

(defn maybe-unzip [data]
  (if (= (type data) (Class/forName "[B"))
    ;; is byte-array => assume gzipped data and unzip
    (gunzip data)
    ;; something else (string, number, ...) => convert to String using str
    (str data)))

replace the "map str" in your form by "map maybe-unzip", and observe that you can now open the gzip files directly in Emacs by evaluationg the XQuery "doc("/tmp/hello.txt.gz")".

Of course, this is just a quick and dirty hack that just assumes that every binary data is a gzipped file, and will decode in to a string in Clojure, so that it can just be transferred to Emacs as a string.

It would be better, to just be able to retrieve binary data as such and hand it over as binary data to Emacs. E.g., there might be a PNG image or whatever stored in the MarkLogic database and we want to open it, as Emacs can open a lot of file formats and also has a binary mode.

Cheers, Max

Updated: improved version of maybe-unzip

proofit404 commented 7 years ago

Oh, now I'm understand what was this issue all about. Binary data transfer is way simpler than unzip process on the clojure side. Also it doesn't require apache java stack to be installed.

normal-mode is fine when you tries to display a sequence of xml documents. With images it wont work. I think we can implement something similar to ein cells on top of eieio multimethods.

Can you provide xquery example with image operations. I need to understand your work flow so I will do it right way.

m-g-r commented 7 years ago

Oh, now I'm understand what was this issue all about. Binary data transfer is way simpler than unzip process on the clojure side. Also it doesn't require apache java stack to be installed.

Cool!

Well, when I retrieve a binary file, I expect the XQuery to return just one reply. You can enforce that with: {:shape :single!} in the call to uruk/exequte-xquery. Just hand the untouched bytes over to Emacs I'd say.

It could be a special command, just for retrieving single files.

MarkLogic's QConsole also spits out a replacement text if you query multiple documents and one is a binary file:

doc("/tmp/hello.txt"),doc("/tmp/hello.txt.gz")
hello world
--------------------------------------------------------------------------------------
<binary node of 42 bytes> {0x1F8B0808928DFC57...}

Something like that would be great.

Cheers, Max

m-g-r commented 7 years ago

No, we don't use images. It was just a general example. But we do store binary data.

m-g-r commented 7 years ago

If it is simple, I'd be great. Otherwise other things are more important.

m-g-r commented 7 years ago

PPS: I've added the unzip code in https://github.com/xquery-mode/xdbc-selector/blob/master/cider-any-uruk-unzip-binaries.el It can be activated via a simple `(require 'cider-any-uruk-unzip-binaries)'.

With that the output for XQuery doc("/tmp/hello.txt"),doc("/tmp/hello"),doc("/tmp/hello.txt.gz"),doc("/tmp/fox.jpg") changes to: cider-any-uruk-unzip-binaries-after where before it was: cider-any-uruk-unzip-binaries-before

(Clearly, this is a hack and should be improved upon. But for now, I can use this as it is.)

proofit404 commented 7 years ago

That's cool! I think we can leave it as is. I will take care to rename this function as well when we agree on new package name in the new-api branch in the cider-any repository.

m-g-r commented 7 years ago

Ah, well it is really a hack. Still, it would be really nice to have real binary transfer, and when it is there, we can implement this much nicer and cleaner.

But still, yes, for now it works and is an improvement I think. So thanks.

proofit404 commented 7 years ago

I've done some research in this field. It has ambiguous result.

It is related to zipped data handling mechanism within Emacs.

So for example we have hello.txt.gz file discussed earlier. If we open this file with find-file command Emacs will show us buffer containing "hello world" string. Of course it unpack this file somehow.

This mechanism was implemented in jka-compr emacs library. It provides jka-compr-insert-file-contents function. Basically it does following steps:

  1. Look at filename and find appropriate executable program to unpack this archive
  2. Make local copy of this file.
  3. Apply found program to the copy.
  4. Repeat from step 1 until we process all nested archive formats (like in tar + gzip)
  5. Insert unpacked file content.

If we rename file or call find-file-literaly command or disable this mechanism with jka-compr-uninstall function, we will see real archive content instead of "hello world" string. It will looks like this:

^_\213^H^H\264\373 X^@^Chello.txt^@S\312H\315\311\311W(\317/\312IQ\342^B^@G^??\330^N^@^@^@

So Emacs doesn't have availability to process zipped data with Elisp. That's sad...

You can see how we can send binary arrays from cider to the emacs lisp.

Now we can retrieve same document from MarkLogic server with this code.

(setq yyy (mapcar #'identity (car (oook-eval-sync "doc(\"/tmp/hello.txt.gz\")"))))

Lets come back to the literally found file and store its bytes into variable too

(setq rrr (mapcar #'identity (buffer-string)))

Now we can compare those two lists of bytes.

(print rrr)

(31 139 8 8 180 251 32 88 0 3 104 101 108 108 111 46 116 120 116 0 83 202 72 205 201 201 87 40 207 47 202 73 81 226 2 0 71 127 63 216 14 0 0 0)

(print yyy)

(31 65533 8 8 9 65533 32 88 0 3 104 101 108 108 111 46 116 120 116 0 83 65533 72 65533 65533 65533 87 40 65533 47 65533 73 81 65533 2 0 71 127 63 65533 14 0 0 0)

As you can see it mostly the same except intermediate (no-ascii?) characters.

If we implement this problem solution we probably will do it this way:

  1. Encode byte array and send it over wire (it maybe huge)
  2. If it have non-printable characters tries to apply gzip same way as jka library does.
  3. If it fails for some reason print few bytes of this possibly huge file and shows it to the user.

At the end I think your solution on clojure side was the most right thing to do. In my opinion the best thing we can do about it just port your solution into oook master.

What do you think?