Feature request: copy from accumulator to frag_a / frag_b

wmmae / wmma_extension

An extension library of WMMA API (Tensor Core API)

https://arxiv.org/abs/2308.15152

MIT License

82 stars 14 forks source link

Feature request: copy from accumulator to frag_a / frag_b #1

Closed shamanDevel closed 1 year ago

shamanDevel commented 1 year ago

Hi, I'm currently trying to cast/copy the accumulator fragment directly to an input fragment without going through shared memory. Can I use this library for that purpose? Is there a pre-defined functionality for that already? I can't find it in the docs or sources though... I tried working with foreach, but this gives only the mapping for a single fragment inside the lambda, not for two. I'm currently stuck with this problem, any help would be appreciated. Thanks!

enp1s0 commented 1 year ago

Hi, @shamanDevel Thank you for your question, and sorry for the late reply. This library doesn't have the functionality. Although I have tried to implement it, I found it difficult since it needs data transfer within the warp, where the shuffling pattern is unsuitable for the CUDA warp shuffle. Therefore, I use shared memory for this purpose in my application. Sorry, I can't help you. Thank you,

shamanDevel commented 1 year ago

Ok, sounds fair. Let's hope that NVIDIA will provide this feature directly in their public API in the future.