Datashape scalability - Githubissues

wesm commented 7 years ago

I have a high level design question concerning using text as a serialized representation of array metadata. In my opinion, it is not the best choice as a primary representation. Let me explain why

A problem that the big data community has run into concerns "large metadata" -- storing very wide or very complex tables in file formats like Parquet has resulted in performance problems because manipulating the metadata itself (serialized as a Thrift message) is unwieldy -- parsing and serializing is expensive, and doing small projections / selecting certain fields out of a large data set incurs the cost of the entire metadata.

In Apache Arrow, we have a more restricted type system than datashape, but we decided to use a "zero copy" or "no parse" technology, Flatbuffers (from Google), to represent metadata in a portable way. We chose Flatbuffers over alternatives (like Cap'n Proto -- created by the main author of Google's Protocol Buffers file format) because of community traction and cross-language support (it has first class support for C++, C#, C, Go, Java, JS, PHP, and Python, with Rust and other languages in the works).

Using a technology like Flatbuffers for the serialized representation enables O(1) inspection of schemas, without any copying or parsing overhead, regardless of how big the schema is. Have you thought about using something like this instead?

skrah commented 7 years ago

Thanks for the ideas! I would be curious about the performance aspect before addressing the other points. FlatBuffers has a benchmark that uses a very small struct:

https://google.github.io/flatbuffers/flatbuffers_benchmarks.html

libndtypes2 has a benchmark that uses a huge struct (make bench). The file is:

https://github.com/blaze/libndtypes2/blob/master/tools/bench.c

Parsing and deallocating the huge struct 1million times takes 7.2s on my slowish T400 machine. Using the small struct from the FlatBuffers benchmark in tools/bench.c, 1 million repetitions take 2.5s.

Unfortunately the FlatBuffers benchmarks don't build here, so I'm using their self-reported time (likely measured on a faster machine that the T400).

Encode (also 1 million times): 3.2s. Dealloc (also 1 million times): 0.08s.

So on the surface creating and deallocating the types seems quite fast in libndtypes2.

skrah commented 7 years ago

Correction, the benchmark used the old syntax and was broken (should be fixed now in master). The real numbers are:

Huge struct, 1 million times: 112s Small struct, 1 million times: 5.4s

So it is 5.4 vs 3.2 (self-reported -- machine?, exact benchmark?).

wesm commented 7 years ago

So these numbers are "time to generate a serialized string from an in-memory schema"? The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements -- the issues I'm describing would show up in use cases like "read the metadata for a given subset of 100 particular fields".

Memory efficiency is also a concern -- one of the benefits of Flatbuffers is that you can select what you want from e.g. a memory-mapped file without any unnecessary copying.

I can collaborate with you on some benchmarks for these use cases sometime in the next week or so if you'd like.

wesm commented 7 years ago

Sorry reading this again

So it is 5.4 vs 3.2

These numbers aren't comparable because the tasks aren't the same. To do a proper comparison we will have to write Flatbuffers IDL that encodes identical data to what's being parsed in libndtypes2. I can help with this

teoliphant commented 7 years ago

Thanks for the reference and connection. I want to make sure we are on the same page as data-shape is only about describing data in memory or disk and not a serialization strategy.

Are you suggesting that we use flat-buffers to store the data-shape meta-data rather than the current data-shape string? An standard for encoding data-shape meta-data to avoid the parsing step at all sounds great.

wesm commented 7 years ago

That's right. When you use zero-copy tools up and down the stack, the on disk representation and the in-memory one are the same. So it's suboptimal for metadata to take longer to "load" for larger schemas when general zero-copy "serialization" tools like this exist.

teoliphant commented 7 years ago

Great suggestion! I just wanted to make sure I understood and Stefan understood.

skrah commented 7 years ago

I'm still trying to figure this out -- Think about this potential datashape use case, e.g. typing a Numba gufunc. This is the existing syntax without datashape:

@guvectorize([(float64[:,:], float64[:,:], float64[:,:])], '(m,n),(n,p)->(m,p)')
def numba_matmul(x, y, res):
    col = np.arange(y.shape[0])
    for j in range(y.shape[1]):
        for k in range(y.shape[0]):
            col[k] = y[k, j]
        for i in range(x.shape[0]):
            s = 0
            for k in range(x.shape[1]):
                s += x[i, k] * col[k]

With datashape it could look like this:

@guvectorize('(M * N * float64, N * P * float64) -> M * P * float64')
def numba_matmul(x, y, res):
    ...

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

Can flatbuffers store ndarrays at all? How would the symbolic dimensions used above be represented?

skrah commented 7 years ago

The easiest way to see how complex the trees get is to print the AST:

make print_ast
echo "(M * N * float64, N * P * float64) -> M * P * float64" | ./print_ast -

wesm commented 7 years ago

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

I think you're arguing based on a misconception. If you want to convert the Flatbuffer to some other data structure (like the C structs you have defined in this project), you can certainly do that. But it is a zero-copy data structure that you can open and examine without any parsing overhead.

Want to work on a prototype together?

wesm commented 7 years ago

Here's an example type that can store ndarray metadata:

table FixedDimension {
  size: long;
}

table SymbolicDimension {
  name: string;
}

union Dimension {
  FixedDimension,
  SymbolicDimension
}

table Tensor {
  value_type: Type;
  shape: [Dimension];
}

(I'm not certain whether you can have an array of union values, you may need to wrap them in a non-union container)

skrah commented 7 years ago

Thanks for the example! So the suggestion is to replace all data structures defined in

https://github.com/blaze/libndtypes2/blob/master/ndtypes.h

with the flatbuffer schema and use flatbuffers for everything, including type tree traversal for gufuncs and pattern matching of types (this needs to be fast, surely there's some overhead in flatbuffers)?

wesm commented 7 years ago

It's definitely a better primary serialized representation instead of a string that you have to parse. It also gives you the ability to read the metadata in all of the languages that Flatbuffers support. You could use another tool like Protocol Buffers or Thrift for the same purpose, but as discussed the parsing overhead will grow burdensome with large schemas

As far as performance, we should write some microbenchmarks to understand the number of nanoseconds it takes to perform certain performance tasks. Do you have any code examples for what you're describing?

In C++ at least, the generated bindings are header only so everything gets inlined

teoliphant commented 7 years ago

Hey Stefan. If I understand correctly, flatbuffers would not replace any code in ndtypes.h for now nor would it replace the datashape strings we are using in numba or gumath. It could, however be used as a representation of datashape that is passed between computer processes.

This looks like a very useful optimization, but I would not place it on the current critical path for you. A PR would be welcome, however.

datnamer commented 7 years ago

Can't the type pattern matching be inlined by numba?

skrah commented 6 years ago

@wesm "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I've implemented no-parse serializing/deserializing, here's a benchmark:

https://github.com/plures/ndtypes/blob/master/python/bench.py

As expected, the largest speedup is for a ragged array with 10000000 offsets. The speedup for a large struct is nice, but not spectacular:

Parse 10_000_000 var offsets:
4.600749731063843

Deserialize 10_000_000 var offsets:
0.029728412628173828

Parse large type (100_000 repetitions):
15.476191997528076

Deserialize large type (100_000 repetitions):
3.6464648246765137

I'd be interested in the Arrow numbers for a type with 10000000 offsets, but I cannot figure out how to serialize the offsets.

skrah commented 6 years ago

I managed to pickle Arrow types, and here are the numbers (ndtypes master, pyarrow-0.9.0 wheel from pypi). I'm not sure if this is a fair comparison, but ndtypes does quite well.

from ndtypes import ndt
import pyarrow as pa
import pickle
import time

s = "{%s}" % ", ".join("x%s: int64" % n for n in range(100000))
t = ndt(s)

print("Ndtypes: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t

fields = [pa.field('x%s' % n, pa.int64()) for n in range(100000)]
t = pa.struct(fields)

print("\nArrow: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t

Ndtypes: pickle/unpickle huge struct:

0.07116484642028809

Arrow: pickle/unpickle huge struct:

2.2060387134552

wesm commented 6 years ago

I don't think that's a useful comparison. It would be more comparable to measure against a Flatbuffers representation

skrah commented 6 years ago

Why? I'm pickling/unpickling an in-memory ndt_t and an in-memory <class 'pyarrow.lib.StructType'>.

At least it looks that way to me.

wesm commented 6 years ago

Pickling is not a primary serialization path; it is merely a convenience for Python users to be able to pickle the pyarrow container types.

skrah commented 6 years ago

I'm trying to answer your question: "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I accept if this is not a useful comparison, but could you please post the fast way to serialize/deserialize so I can see if this issue has been addressed or if we still need to catch up with flatbuffers?

wesm commented 6 years ago

I'm off of open source this month -- I can possibly take a look next month

skrah commented 6 years ago

Ah sorry, let's postpone this then!

wesm commented 6 years ago

Just to go back to the original point: I was not trying to compare Apache Arrow with libndtypes. I was pointing out that you invented a structured data serialization scheme for the ndtypes metadata, and there are already libraries (Flatbuffers, Protocol Buffers, Apache Thrift) designed by companies like Facebook and Google that solve that problem with certain benefits like forward compatibility / schema evolution, and possibly also better performance (though this latter point will have to be analyzed further).

teoliphant commented 6 years ago

Thanks for clarifying, Wes. How to leverage Flatbuffers remains a valid point that we should consider. I appreciate your pointing out the similarities you see. With so much happening in open-source today, it is easy to miss-out on projects that would help achieve the ultimate goal.

Flatbuffers certainly have nice features like versioning which libndtypes doesn't really need yet. I think one question we will always have is if it is really easier to depend on Flatbuffers or just use libndt if you are trying to speed up parsing data-shape between libraries using it to describe typed containers.

I might be missing something, but I don't yet see the benefit of introducing a dependency on Flatbuffers.

skrah commented 6 years ago

One non-intrusive way of supporting flatbuffers would be to convert libndtypes/serialize/serialize.c and libndtypes/serialize/deserialize.c to fbserialize.cpp and fbdeserialize.cpp and add --with-flatbuffers to ./configure.

So the C++ dependency would be optional.

Depending on the design of flatbuffers, it is hopefully s straightforward as converting write_int64 etc. to fbwrite_int64 and such.

xnd-project / libndtypes

Datashape scalability #13