Add patches for memory_efficient_attention and NTK scaling

Description

What does this PR do?

Adding NTK scaling patch function apply_ntk_scaling_patch()
1. Refactoring the NTK scaling code (#705) by moving the relevant code into patches.py to make the inference code clean and simplify the usage.
2. Adding the alpha parameter for NTK scaling to the patch function and inference scripts. See the usage below for explanations.
Adding attention patch function apply_attention_patch()
1. Adding support for inference with memory_efficient_attention. On a single 24G GPU, the maximum context size can be scaled up to about 5K without exceed the GPU memory (model loaded in fp16).
2. Add an option for storing KV_cache before applying RoPE.
Updating inference_hf.py, gradio_demo.py and openai_api_server.py to showcase NTK scaling and memory_efficient_attention.

Usage

alpha=2.0 # alpha can be a float, a string representing a float,  or 'auto'
use_memory_efficient_attention=True # True or False
store_kv_before_rope=False # True or False

# The following code should be placed before model initialization
from patches import apply_attention_patch, apply_ntk_scaling_patch
apply_attention_patch(
    use_memory_efficient_attention=use_memory_efficient_attention,
    store_kv_before_rope=store_kv_before_rope
)
apply_ntk_scaling_patch(alpha=alpha)

Parameters

alpha: If 'auto', alpha is calculated with the empirical formula alpha = (seq_len / 1024 - 1) * 1.1 during generation, otherwise alpha is set to the fixed float value given.
use_memory_efficient_attention: If use memory_efficient_attention from xformers or not. Default is False.
store_kv_before_rope: If store KV_cache before applying RoPE or not . Default is False.

Advices

Set use_memory_efficient_attention=True to save GPU memory when processing long texts.
Set alpha to a float value (>1) to apply NTK scaling to support long context. Emperically, we find alpha = (seq_len / 1024 - 1) may be a good choice, where seq_len is the estimated context size (sum of the lengths of the input and the output).
Set alpha to 'auto' to let the model determine the value of alpha dynamically and adatively.
Set store_kv_before_rope=True if alpha='auto' and if you encounter performance degradation. See the discussion here.

ymcui / Chinese-LLaMA-Alpaca

Add patches for memory_efficient_attention and NTK scaling #743

Description

Usage