Closed airaria closed 1 year ago
我想请问一下,源码中如下,对长度超过max_length的进行了截断,但在NTK实现里又要求"if seq_len > self.max_seq_len_cached:",那是不是意味着永远不会超过self.max_seq_len_cached,那怎么支持NTK外推上下文呢?
if len(tokenized_prompt) > max_length:
half = int(max_length/2)
prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True)+tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)
Description
What does this PR do?
Adding NTK scaling patch function
apply_ntk_scaling_patch()
Refactoring the NTK scaling code (#705) by moving the relevant code into
patches.py
to make the inference code clean and simplify the usage.Adding the
alpha
parameter for NTK scaling to the patch function and inference scripts. See the usage below for explanations.Adding attention patch function
apply_attention_patch()
Adding support for inference with memory_efficient_attention. On a single 24G GPU, the maximum context size can be scaled up to about 5K without exceed the GPU memory (model loaded in fp16).
Add an option for storing KV_cache before applying RoPE.
Updating
inference_hf.py
,gradio_demo.py
andopenai_api_server.py
to showcase NTK scaling and memory_efficient_attention.Usage
Parameters
alpha
: If'auto'
,alpha
is calculated with the empirical formulaalpha = (seq_len / 1024 - 1) * 1.1
during generation, otherwisealpha
is set to the fixed float value given.use_memory_efficient_attention
: If use memory_efficient_attention from xformers or not. Default isFalse
.store_kv_before_rope
: If store KV_cache before applying RoPE or not . Default isFalse
.Advices
use_memory_efficient_attention=True
to save GPU memory when processing long texts.alpha
to a float value (>1) to apply NTK scaling to support long context. Emperically, we findalpha = (seq_len / 1024 - 1)
may be a good choice, whereseq_len
is the estimated context size (sum of the lengths of the input and the output).alpha
to'auto'
to let the model determine the value ofalpha
dynamically and adatively.store_kv_before_rope=True
ifalpha='auto'
and if you encounter performance degradation. See the discussion here.