Closed forhaoliu closed 11 months ago
Added blockwise parallel transformer for train 32x longer sequence than vanilla transformer and 4x longer than memeff / flashattention. Including blockwise attention, blockwise ffn and blockwise loss.
Tests passed.
Added blockwise parallel transformer for train 32x longer sequence than vanilla transformer and 4x longer than memeff / flashattention. Including blockwise attention, blockwise ffn and blockwise loss.