Glitchminer throws nan entropies when device_map='auto' for big model inference

AetherPrior commented 4 days ago

Hi all, I am trying to use glitchminer to inference a few of archangel's models and some of them are too big to fit on a single GPU. Consequently, I'm setting device_map='auto' to spread them across all GPUs in my single node.
However, upon running the GlitchMiner, I get nan values filled in for my entropies: Source:

    print(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="cuda",
        )
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    if 'archangel' in model_path:         
        template = '''{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ bos_token + '<|user|>\\n' + message['content'] + '\\n' }}
    {% elif message['role'] == 'system' %}
        {{ '<|system|>\\n' + message['content'] + '\\n' }}
    {% elif message['role'] == 'assistant' %}
        {{ '<|assistant|>\\n'  + message['content'] + ' ' + eos_token + '\\n' }}
    {% endif %}
{% endfor %}
'''
        tokenizer.chat_template = template
    # add stop token
    # tokenizer.add_special_tokens({'eos_token': tokenizer.eos_token})
    # model, tokenizer = accelerator.prepare(model, tokenizer)

    # Run GlitchMiner
    glitch_tokens, glitch_token_ids = GlitchMiner(
        model,
        tokenizer,
        num_iterations=125,
        batch_size=4,  # Adjusted batch size
        k=32,
        if_print=True,
    )

    glitch_count, verified_glitch_ids = strictly_glitch_verification(model, tokenizer, glitch_token_ids)
    glitch_tokens = [tokenizer.decode([g]) for g in verified_glitch_ids]

My output's like this for device_map='auto':

迭代次数: 0
  当前token: ' the', token id: 278, 是否为glitch token: 是, 熵值: 10.3734
  当前token: ',', token id: 29892, 是否为glitch token: 是, 熵值: 10.3734
  当前token: '▇', token id: 31589, 是否为glitch token: 是, 熵值: 10.3734
  当前token: '.', token id: 29889, 是否为glitch token: 是, 熵值: 10.3734
迭代次数: 1
  当前token: ' Mediabestanden', token id: 28574, 是否为glitch token: 是, 熵值: 0.3178
  当前token: 'oreferrer', token id: 3798, 是否为glitch token: 是, 熵值: 0.1225
  当前token: ' Расподела', token id: 28354, 是否为glitch token: 是, 熵值: 0.3178
  当前token: 'ederbörd', token id: 12731, 是否为glitch token: 是, 熵值: 0.1225
迭代次数: 2
  当前token: 'nederbörd', token id: 28633, 是否为glitch token: 是, 熵值: 0.1225
  当前token: ' Portály', token id: 20609, 是否为glitch token: 是, 熵值: nan
  当前token: '߬', token id: 31664, 是否为glitch token: 是, 熵值: 0.1225
  当前token: 'Obrázky', token id: 23313, 是否为glitch token: 是, 熵值: nan
迭代次数: 3

and when I replace device_map='auto' with device_map='cuda', I get this:

迭代次数: 0
  当前token: ' Portály', token id: 20609, 是否为glitch token: 是, 熵值: 7.1198
  当前token: ' челов', token id: 9831, 是否为glitch token: 否, 熵值: 3.7044
  当前token: 'ederbörd', token id: 12731, 是否为glitch token: 是, 熵值: 6.1594
  当前token: 'nederbörd', token id: 28633, 是否为glitch token: 是, 熵值: 7.5174
迭代次数: 1
  当前token: ' the', token id: 278, 是否为glitch token: 否, 熵值: 0.0086
  当前token: '

I assume this is because of gradient computations that do not support multi-gpu loaded models.
Was the entire setup was run on a single GPU?

wooozihui commented 4 days ago

Hi, what is your PyTorch version? The auto mode should work in torch=2.4.0 and transformers=4.44.2. Additionally, this issue might arise from using a small epsilon (1e-9) in the entropy calculation, although it should be functional in this version.

wooozihui commented 4 days ago

I've adjusted the epsilon to 1e-6; please check if it works 😊.

wooozihui commented 4 days ago

Hi, I suspect the issue might still be related to the versions of PyTorch and transformers. I tested the llama2-7b-chat model on both a four-GPU RTX 4090 setup and a single RTX 4090, and the results were consistent across both configurations, even using an epsilon of 1e-9.

AetherPrior commented 4 days ago

I see, my versions are:

torch==2.5.1
transformers==4.46.3

Let me try downgrading them and checking

wooozihui / GlitchMiner

Glitchminer throws nan entropies when device_map='auto' for big model inference #1