ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

Resuming pretraining from checkpoint #827

Closed lathashree01 closed 1 year ago

lathashree01 commented 1 year ago

Check before submitting issues

Type of Issue

Model training and fine-tuning

Base Model

LLaMA-7B

Operating System

Linux

Describe your issue in detail

I am doing pretraining of the LLAMA 7B model, but due to time limits in the cluster, it stopped. I restarted the program again.

I am getting error - "Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run."

In my ds config, I have min loss scale of 1e-10; my previous training loss stops at 1.092. I have attached a snapshot of it. I am doing training in fp16, and due to hardware limitations, I can't run on bf16; the job on cluster fails with error.

What can I do further to solve this issue? Any help would be greatly appreciated.

Thanks.

Dependencies (must be provided for code-related issues)

my ds config:

{
    "fp16": {
    "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 100,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1e-10
    },
    "bf16": {
    "enabled": false
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
    "allgather_partitions": true,
        "allgather_bucket_size": 1e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Execution logs or screenshots

image
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 year ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.