Closed LMRinGithub closed 1 month ago
What does "accumulated" mean in your issue? As you can see in the code, we calculate the Euclidean distance for each layer separately, and then select the layer with the maximum Euclidean distance.
Hi, have you solved your issue yet?
What does "accumulated" mean in your issue? As you can see in the code, we calculate the Euclidean distance for each layer separately, and then select the layer with the maximum Euclidean distance.
For accumulated, I mean maybe the difference is more important than the absolute value. For instance, Layer 1 Euclidean Distance: 0.1 Layer 2 Euclidean Distance: 0.8 Layer 3 Euclidean Distance: 0.81 Layer 4 Euclidean Distance: 0.7 If we use the method proposed by the paper, we will choose Layer 3 as the toxic layer. However, the difference between Layer 3 and Layer 2 is only 0.01, while the difference between Layer 2 and Layer 1 is 0.7, which is much larger than the chosen one. So I consider the weight between Layer 1 and Layer 2 is more important than the weight between Layer 2 and Layer 3. What do you think about this?
We use top 1 rank in our paper, i.e., Layer 3 in your toy example.
We acknowledge that the top-1 strategy is quite simple and may not be optimal. Our choice of this strategy was based on the following considerations: although information accumulates from the bottom layers, the greatest differentiation occurs at the top-1 layer. By directly operating on the top-1 layer, it's akin to adding a guardrail."
Your idea is great, and it might yield even better results. Just a friendly reminder: if you're considering multiple layers, please carefully select appropriate hyperparameters and update strategies. The more changes made to the original model, the more likely it is to introduce unintended side effects.
By the way, this paper (ReFT: Representation Finetuning for Language Models) examines the integration of different layers, which might serve as a valuable reference for your work.
Looking forward to your future contributions to this field.
Thank you for your response. It answered my question.
In the paper, I saw it says 'we consider the toxic layer to be the transformer layer that most effectively separates the distributions of safe and unsafe sequences'
Inside code, I saw `
for layer_index in range(1, len(hidden_states)): euclidean_distance = torch.dist( hidden_states[layer_index][j 2], hidden_states[layer_index][j 2 + 1], p=2)
I have a question there: Does the euclidean_distance accumulated or not?