feat: Apply different LoRA dynamically

snowyu commented 9 months ago

Feature Description

Can change LoRA dynamically after loading LLaMa model.

The Solution

See llama_model_apply_lora_from_file() function in llama.cpp.

https://github.com/ggerganov/llama.cpp/blob/e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb/llama.h#L353C1-L359C1

Considered Alternatives

None.

Additional Context

No response

Related Features to This Feature Request

[ ] Metal support
[ ] CUDA support
[ ] Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

giladgd commented 9 months ago

@snowyu Can you please provide an example of a proposed usage with llama.cpp showing how you would like to use it? Please provide links to files that you use and what you are generally trying to achieve.

I want to keep the API of this library relatively high-level while still offering advanced capabilities, so I wouldn't necessarily want to expose the llama_model_apply_lora_from_file function as-is, and instead understand your use case better to try to figure out what's the best way to provide support for this.

snowyu commented 9 months ago

Usually the base LLM model is more than 4GB in size. The corresponding LoRA adapters are relatively small: about a few hundred megabytes.

If there is LoRA dynamic loading, several LoRAs fine-tuned under the same basic model can be quickly switched in memory. There is no need to use multiple full huge LLM models.

// pseudocode
llama_model * model  = llama_load_model_from_file('ggml-base-model-f16.bin', mparams);
...

// switch to the Animal Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
   "animal-lora-adapter.bin",
   lora_scale,
   NULL,  //<--- optional lora_base model
   params.n_threads);

// switch to the Astronomy Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
   "animal-lora-adapter.bin",
   lora_scale,
   NULL, //<--- optional lora_base model
   params.n_threads);

lora_base Note: When using a quantized model, the quality may suffer. To avoid this, specify a f16/f32 model with lora_base to use as a base. The layers modified by LoRA adapter will be applied to the lora_base model and then quantized to the same format as the base model. Layers not modified by the LoRA adapter will remain untouched.

vlamanna commented 3 months ago

Just curious if there has been any progress on this?

I think it would be nice to be able to specify a LoRa adapter in the LlamaModelOptions or the be able to call a method on LlamaModel.

If that makes sense, I'd be willing to start looking into it.

giladgd commented 3 months ago

@vlamanna The beta of version 3 is now mature enough, so I've added support for loading a LoRA adapter as part of loading a model (#217); set the lora option on llama.loadModel({ ... }) to use it.

@snowyu Changing a LoRA on a model at runtime is not possible at the moment, as there's no way to unload an adapter after it has been applied to a model; every call to llama_model_apply_lora_from_file loads another adapter onto the current model state.

This feature will be available in the next beta version that I'll release soon.

github-actions[bot] commented 3 months ago

:tada: This issue has been resolved in version 3.0.0-beta.20 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

snowyu commented 3 months ago

@snowyu Changing a LoRA on a model at runtime is not possible at the moment, as there's no way to unload an adapter after it has been applied to a model; every call to llama_model_apply_lora_from_file loads another adapter onto the current model state.

@giladgd It can not be done on the lowlevel API. But It could be ok on high level API, like this:

// pseudocode
class LlamaModel {

loadLoRAs(loraFiles, scale, threads, baseModelPath?) {
  let needDeinit = false
  // check loaded loraModels
  for (let i=0; i< this.loraModels.length; i++) {
   const loraModel = this.loraModels[i]
    const ix = loraFiles.indexOf(loraModel.file)
    if (ix === -1) {needDeinit=true;break}
    loraFiles.splice(ix, 1)
  }
  //free model if already load other lora
  if (needDeinit) {
    // deinit and load base model again
    this.reloadModel()
  }
  for (const loraFile of loraFiles) {
    const model = _loadLoRA(loraFile, scale, threads, baseModelPath)
    if (model) this.loraModels.push(model)
  }
}

giladgd commented 3 months ago

@snowyu It can be done with the high-level API that I've added. Providing an API to modify the current loaded model via changing the LoRAs applied onto it on runtime is not preferable since all dependent states (such as contexts) should also be reloaded when switching LoRAs, so since there's no performance benefit to doing that (as unloading a LoRA is not possible to do at low level) then exposing such an API is not worth it as it would only make the usage of this library more complicated.

snowyu commented 3 months ago

@giladgd If you only consider the creation of APIs from the perspective of performance, this is indeed the case. But from the perspective of ease of use, it is worth exploring.

Let me talk about my usage scenario, a simple intelligent agent script engine, they can call each other, each agent may use a different LLM. LLM reloading is commonplace. My current pain is that I have to These LLMs are managed in the agent script engine:

Determine LLM’s caching strategy based on memory size
Determine whether to switch LLM or run multiple LLMs simultaneously (based on maximum VRAM and RAM optimal parameters)
Managing and maintaining recommended configurations of LLM

Although these should be the responsibility of the LLM engine and not the agent script engine.

giladgd commented 3 months ago

@snowyu We have plans to make the memory management transparent, so you can focus on what you'd like to do with models, and node-llama-cpp will offload and reload things back to memory as needed so you can achieve everything you'd like to do without managing any memory at all, and in the most performant way possible with the current hardware.

Over the past few months, I've laid the infrastructure for building such a mechanism, but there's still work to do to achieve this. This feature will be released as a non-breaking feature after the version 3 stable release (that's coming very soon), due to it taking much longer than I initially anticipated.

Perhaps you've noticed, for example, that you don't have to specify gpuLayers when loading a model and contextSize when creating a context anymore, as node-llama-cpp measures the current hardware and estimates how much resources things will consume to find the optimal balance between many parameters, while maximizing the performance of each one up to the limits of the hardware. This is part of the effort to achieve seamless memory management and default zero-config.

Allowing to modify a model state at runtime on the library level will make using this library more complicated (due to all of the hassle it incurs to keep things working or the performance tradeoffs it embodies), and I think is a lacking solution to the memory management hassle that I work on solving from its root.

snowyu commented 3 months ago

@giladgd

We have plans to make the memory management transparent, so you can focus on what you'd like to do with models, and node-llama-cpp will offload and reload things back to memory as needed so you can achieve everything you'd like to do without managing any memory at all, and in the most performant way possible with the current hardware.

It's great.looking forward to it.

Perhaps you've noticed, for example, that you don't have to specify gpuLayers when loading a model and contextSize when creating a context anymore.

Yes. I have. Do you think about adding the estimate of memory for mmprojector model?

Allowing to modify a model state at runtime on the library level will make using this library more complicated (due to all of the hassle it incurs to keep things working or the performance tradeoffs it embodies), and I think is a lacking solution to the memory management hassle that I work on solving from its root.

Totally agree.

giladgd commented 3 months ago

@snowyu

Yes. I have. Do you think about adding the estimate of memory for mmprojector model?

I don't know what model you are referring to.

I reverse-engineered llama.cpp to figure out how to estimate resource requirements using only the metadata of a model file without actually loading it; it isn't perfect, but the estimation is pretty close to the actual usage with many models I've tested this on.

To find out how accurate the estimation is for a given model, you can run this command:

npx node-llama-cpp@beta inspect measure <model path>

If you notice that the estimation is way off for some model and want to fix it, you can look at the llama.cpp codebase to figure out the differences in how memory is allocated for this model, and open a PR on node-llama-cpp to update the estimation algorithms on GgufInsights.

snowyu commented 3 weeks ago

@giladgd Sorry, I've been busy with my project lately. mmprojector is from Multimodal LLM, maybe you haven't used the llava part yet.

You may be interested in the Programmable Prompt Engine project I'm working on.

I hope to add node-llama-cpp as the default provider in the near future, but for now, I don’t see a good API entry point to start. I need a simple API:

// come from https://github.com/isdk/ai-tool.js/blob/main/src/utils/chat.ts
export const AITextGenerationFinishReasons = [
  'stop',           // model generated stop sequence
  'length',         // model generated maximum number of tokens
  'content-filter', // content filter violation stopped the model
  'tool-calls',     // model triggered tool calls
  'abort',          // aborted by user or timeout for stream
  'error',          // model stopped because of an error
  'other', null,    // model stopped for other reasons
] as const
export type AITextGenerationFinishReason = typeof AITextGenerationFinishReasons[number]
export interface AIResult<TValue = any, TOptions = any> {
  /**
   * The generated value.
   */
  content?: TValue;

  /**
   * The reason why the generation stopped.
   */
  finishReason?: AITextGenerationFinishReason;
  options?: TOptions
  /**
   * for stream mode
   */
  stop?: boolean
  taskId?: AsyncTaskId; // for stream chunk
}
// https://github.com/isdk/ai-tool-llm.js/blob/main/src/llm-settings.ts
export enum AIModelType {
  chat,  // text to text
  vision,  // image to text
  stt,  // audio to text
  drawing,  // text to image
  tts,  // text to audio
  embedding,
  infill,
}

// fake API
class AIModel {
  llamaLoadModelOptions: LlamaLoadModelOptions
  supports: AIModelType|AIModelType[]
  options: LlamaModelOptions // default options
  static async loadModel(filename: string, options?: {aborter?: AbortController, onLoadProgress, ...} & LlamaLoadModelOptions): Promis<AIModel>;
  async completion(prompt: string, options?: {stream?: boolean, aborter?: AbortController,...} & LlamaModelOptions): Promise<AIResult|ReadStream<AIResult>>
  fillInMiddle...
  tokenize...
  detokenize...
}

withcatai / node-llama-cpp