Packing operations for gemm / matmul

kpu commented 4 years ago

Most GEMM implementations have a packed representation. Some support packing in advance, like MKL does for float32 and oneDNN does for int8 if you know where to look while not officially supporting it. This is particularly useful for inference where usually one of the parameters is a constant. So a common use case is downloading parameters in some canonical format (row major or whatever), packing it ideally in-place (and throwing away the canonical form to save RAM), then passing the packed format to GEMM. The packed format is opaque and varies by hardware due to SIMD lengths etc. Throwing away the canonical form is key because otherwise it effectively doubles RAM requirements for the model.

In theory a compiler with whole-program knowledge could pack in advance and throw away the canonical form. But in practice it's theoretically possible some new code will read the canonical values. So packing is usually done explicitly by the user. Or there would need to be a slow path that unpacks the values to emulate a canonical read.

Would you consider adding optional packing operators for a and b?

(Moved from https://github.com/w3c/machine-learning-workshop/issues/85)

gramalingam commented 4 years ago

I also feel that

a compiler with whole-program knowledge could pack in advance and throw away the canonical form

I don't understand the point you make about

 it's theoretically possible some new code will read the canonical values.

Can you elaborate what this means?

kpu commented 4 years ago

Let's say I have a parameter matrix A. From the user perspective, A is still in row-major format or whatever we decide is the matrix format. If the compiler is smart it will realize that A is only ever used as an argument to GEMM and convert it to a packed format, throwing out the original format of A. But javascript can load javascript from a new URL if it wants to, and that javascript might find out A is also being printed / indexed / sliced / added and the user expects to see row-major behavior. So then the compiler has to convert it back to row-major, reindex on the fly (probably slow), or have remembered a row-major A the whole time.

At least in C++ toolkit development, packing is an explicit step managed by the user.

gramalingam commented 4 years ago

I think there are two optimization opportunities here: (1) Doing the format conversion once, at compile-time, instead of repeatedly for each inference, (2) Memory management, especially the possibility of reusing the same memory for the original format and the transformed format. I think the compiler can always do (1). Doing (2) could be trickier, as it depends on the conventions for memory-management and memory-ownership.

When you refer to packing operators, the idea is that the packed and the original will be distinct (memory buffers), right? Or, are you suggesting an in-place transformation on same memory-buffer?

kpu commented 4 years ago

In general the packed size can be different from the unpacked size. For example, oneDNN pads (8-bit dot should be a multiple of 4 for VNNI) and FBGEMM stores the generated assembly for that size as part of the packed matrix.

So packing is usually out-of-place but done one matrix at a time, which means the model size is never doubled.

This really gets to the question of whether the standard is meant to be implementations of kernels with a framework built on top or include the framework. If it's the framework built on top, then packing should be a separate step exposed to the framework.

gramalingam commented 4 years ago

Yes, it does depend on where the framework sits. I also think there is a more general question here about memory management and ownership. If we consider

 Operand constant(OperandDescriptor desc, ArrayBufferView value);

as long as there is a proper protocol to have "value" be shared etc., I think a compiler can handle the scenario (assuming it is in the implementation of the standard, below the standard API).

mratsim commented 4 years ago

Usually, packing is machine specific and MKL documentation says that the packed representation is not to be serialized.

The main goal is to prepack for use in batched GEMM.

Here is my prepacking implementation:

https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_prepacked.nim#L87-L292

Explanation:

The internals follow the BLIS paper and directly call the gebp macrokernel (https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm.nim#L48-L101)
This skips the packing steps from the full implementation: https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm.nim#L165-L170

In terms of perf, the implementation is somewhat slower than OpenBLAS single threaded (160GFlops vs 200GFlops on i9-9980XE with overclocking), but more scalable (17.5x on 18 cores vs 15.5x https://github.com/mratsim/weave/pull/94#issuecomment-571751545, OpenBLAS is 2.74 TFlops on my machine, my own code is 2.86, OneDNN is 3.05 TFlops).

wchao1115 commented 4 years ago

I'm not sure if I follow the whole thing but it sounds like this is an implementation-specific technique. WebNN defines a specification and not an implementation of ML constructs. It has a provision for the compilation step (model.createCompilation) that the browser is free to implement however they see fit with their internal usage.

kpu commented 4 years ago

Essentially what I want is a hint that says "I'm only ever going to use this as the second argument to a matrix multiply, or the transposed form of this". Or in functional programming terms, I want to curry matrix multiplication with one of the arguments then retain access to only the curried function.

The packing format is very implementation specific which is why it's opaque to the user. What implementations have in common is they do packing. It's baked into tensor cores on nvidia. MKL does it, numforge does it, oneDNN does it, gemmlop does it. The implementation is free to ignore my hint and just define packing to be identity.

If I'm honest, I don't know much about createCompilation but you're making me afraid it will break shortlisting: https://syncedreview.com/2017/01/03/a-practical-guide-to-neural-machine-translation/ .

wchao1115 commented 4 years ago

At model's compilation time, the implementation of WebNN could identify constant tensors in the model and perform necessary pre-processing steps (i.e. packing) on them depending on the hardware. Since the constant tensor data is known to the model at compile time, there is no need to defer that step to the graph execution, although that remains the implementation's choice to make. The GPU regularly pre-processes convolution weights ahead of execution, so this process is not new to me.

From the API standpoint, the nn.constant method already requires the constant data upfront. Is that a good enough hint for this?

wchao1115 commented 4 years ago

This issue was discussed in the WebML WG call on 9/17 and concluded that the constant operands along with the compile step in the API is sufficient for the implementor of the API to initiate packing operations if available in the hardware layer.

anssiko commented 4 years ago

Per the following group resolution, closing this issue:

the constant operands along with the compile step in the API is sufficient for the implementor of the API to initiate packing operations if available in the hardware layer

webmachinelearning / webnn

Packing operations for gemm / matmul #86