Open ljk419511 opened 4 months ago
Thank you for your interest!
Your GPU utilization is indeed a bit low. That might because in the provided config the per_device_train_batch_size
is set to 8. If you double them to 16 and the utilization will be around 100%.
On the other hand, for our downstream experiments, we used the checkpoint after 10k steps, which means you can further save some time by early stopping (terminate the process when it reaches certain steps and the checkpoint is evaluated and saved). Also, it seems you have more than 4 GPUs so maybe you can try to allocate more GPUs for parallel training.
We will consider providing a docker image later. However, we are busy at the moment so you might need to wait for a while. Since you are already able to train the prober, I think the environment is not a big problem.
Thank you for your reply! But I still have some questions.
1. I wonder if these four commands are training the model step by step?
torchrun --nproc_per_node=4 run_casp.py scripts/configs/train_casp_moco.yaml
torchrun --nproc_per_node=4 run_prober.py scripts/configs/train_prober.yaml
accelerate launch --num_processes=4 big_model_quantized_probing.py scripts/configs/probe_quantized_codellama-34b-4bit-unfreeze.yaml
python score_and_filter_signature.py scripts/configs/filter_sig.yaml
Does it mean I need to execute the above four commands first to get a pretrained model if I want to do some inference?
2. I noticed you mentioned inference scripts.
We provide representive scripts for training and inference.
How should I start for inference?
3. If I want to do some inference (e.g., for Binary Summarization), what should be my inputs and the outputs of the model?
Sorry for the late response. We are quite busy at the moment.
run_casp.py
is for training a cross-modal retriever; run_prober.py
is for training the prober; big_model_quantized_probing.py
is for inferencing the signature; score_and_filter_signature.py
is for filtering the signature; and big_model_quantized_probing_continue.py
is for inference the rest of the function body given the filtered signatures. If you want to reproduce the whole pipeline, you will need to run these scripts step by step and possibly modify some configurations.probe_quantized_codellama-34b-4bit-unfreeze.yaml
) we are actually using the pre-trained checkpoint hosted on HF). If you have your own paired binary-source function corpus, we recommend using the corpus for prober training in order to adapt to that distribution and inference with the new prober.big_model_quantized_probing.py
, which essentially is the Hugging Face dataset containing a codeart
field. The codeart
field is a dictionary containing two subfields: code
and dep
. For how to preprocess and get such fields, please check our codeart repo https://github.com/ziansu/codeart.Hi, I want to ask the difference between retriever and prober. My initial understanding for whole work is that:
But the below note from the paper really makes me confused.
We leverage nucleus sampling to first let the prober generate a relatively large set of source function signatures with high randomness (top-p = 0.75). We use idea similar to retrieval by training a binary-signature dual-encoder to rank generated signatures and filter out the noisy ones. Ultimately, we use the prober to further complete the remaining signatures with smaller randomness (top-p = 0.5).
It feels as if retriever and prober are two modules that work independently.
It has confused me for a long time. If there is anything wrong with my understanding, please help me.
Hope for your guidance!
The retriever and prober indeed work independently during inference. However, there is some overlap in the training process.
As you already understand how similarity score is computed, our retrieval-augmented baseline only use the retrieved top-k source functions, i.e., the source functions with the highest similarity scores, as additional context (which is quite straightforward). Note that there is no need to decode the h_src since we already have the source functions within the datastore (which consists of all the source functions in the training set). Otherwise, we can not obtain the h_src which is the encoded source functions, in the first place.
There is no retrieval process for the prober, it is actually an encoder-decoder. The encoder part of the prober encodes the x_asm once and obtain the intermediate hidden states (the Zs as in Figure 2). Then, the intermediate hidden states will be fed into the SCFM for decoding. The decoding will have some randomness so that some sampling will provide diverse source code snippets.
The overlap in the training is due to both retriever and prober contains the codeart encoder (same architecture but prober's further alignment training make the parameters different).
Note that we only use disassembled code for retrieval/probing, you can probability also use decompiled code and more information (e.g., LMPA uses some information from dependent context for variable name recovery) from the binary code in the future.
Hi,
Thank you for your patient guidance and there are some problems still.
What is signature in this context? The Zs as in Figure 2? If so, why is it called source code signature rather than asm signature? And how is the signature here obtained? From encode part of prober?
We leverage nucleus sampling to first let the prober generate a relatively large set of source function signatures with high randomness (top-p = 0.75).
What doesa binary-signature dual-encoder
mean? Does it mean using binary-signature corpus to retrain a binary-signature dual-encoder?
We use idea similar to retrieval by training a binary-signature dual-encoder to rank generated signatures and filter out the noisy ones.
Does it mean the decoding stuff?
Ultimately, we use the prober to further complete the remaining signatures with smaller randomness (top-p = 0.5).
My purpose is use other dataset as input for prorec to get function summary. Now I'm stuck on not knowing how to use your pretrained prober to receive disassembled as input and get discomC with relevant context for output. And your code structure is so ultra-refined and precise that I often get lost. So do you have any suggestions for direct interence to get function summary by using the prorec structure?
Sincere appreciation and expectation for your reply!
int main(int argc, char *argv[])
. We simply want to sample better source code generated by the prober, but generating the full function can be time and GPU-memory consuming for a large amount of candidates, so in this work we first generate short signature of the function, do some filtering, and then generate the following tokens. In practice, if you have enough computation resource or acceleration (there should be some inference acceleration frameworks for multimodal models recently, but how to support prober within these frameworks will be a problem), it is possible to directly sample full source functions and filter them with the retriever.run_casp_signature.py
is similar to run_casp.py
, with only some difference in collate_fn
so we can use the original binary-source dataset for binary-signature training.For inferencing with the prober, the input should comply with the format of the codeart
field in our dataset, which is a json-dumped dictionary of code
and dep
. For the preprocessing of codeart format, you can check https://github.com/ziansu/codeart/tree/artifacts/codeart/preprocess. And maybe @XZ-X can help answer some questions related to preprocessing.
I am happy to help with preprocessing.
The preprocess code for CodeArt is at here, as Zian suggested before. Let me know if you have further questions.
Alternatively, we can also run the preprocessing part for you if that's a better solution. In that case, would you share with us the unstripped binaries? And we will share the preprocessed file. You can start your experiments from the training/inference part then.
So sorry for the late reply.
@XZ-X With the permission of the original author, I've put the dataset here.
I've finished formatting the data and got the jsonl file in the following format.
{
"metadata": {
"project_name": ,
"function_name": ,
"function_addr": ,
"binary_name":
},
"code": [
],
"data_dep": [
]
}
{
...
}
It seems that I cannot access the google drive link. Please double check the access control.
Did you mean you've successfully run the preprocessing pipeline and got the jsonl file?
I'm sorry. It should be okay now.
Yes, I think that I've finished the preprocessing and got the jsonl file with the format of the codeart field according to here. But the follow error seems to indicate that further processing of the jsonl data is still required.
Traceback (most recent call last): File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 461, in
main() File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 444, in main results = batch_inference( File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 203, in batch_inference [eval(example['codeart']) for example in examples] File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 203, in [eval(example['codeart']) for example in examples] KeyError: 'codeart'
Besides, I'd also like to ask what I should do as a follow-up for inferencing with the prober.
Looking forward to your reply.
You should convert the jsonl into a Hugging Face dataset.
from datasets import Dataset, DatasetDict
dataset = Dataset.from_dict{'codeart': [repr({'code': ..., 'dep': ...}) for ... in your_jsonl_list]})
dataset = DatasetDict({'test': dataset})
# you can either pub it to hub
dataset.push_to_hub('path_to_your_hub')
# or you can save to disk
dataset.save_to_disk('path_to_dir')
Then, you can run accelerate launch --num_processes=4 big_model_quantized_probing.py scripts/configs/probe_quantized_codellama-34b-4bit-unfreeze.yaml
with some modifications within the config .yaml file and potentially some lines in the .py file.
dataset_name
field in the .yaml file to your dataset name on the hub.big_model_quantized_probing.py
to dataset = load_from_disk('path_to_dir')
.Whether I use 1 RTX4090 or 4 RTX4090, It all throws the CUDA Out Of Memory error, only if batch_size=1 it will not. And the training progress is unbearably slow. Can't even tell the difference between 1 GPU and 4 GPUs from training speeds.
num_return_sequences
here to 10. I'm not sure if this modification is appropriate at the moment.big_model_quantized_probing.py
to generate the function signature, run score_and_filter_signature.py
to filter the signature, run big_model_quantized_probing_continue.py
to decode the remainder of the generated function. And I'm not quite sure if I'm on the right track.looking forward to your guidance!
When I typed in the following command,
torchrun --nproc_per_node=4 run_casp.py scripts/configs/train_casp_moco.yaml
I found that the training was progressing too slowly, taking over 20 hours.
Here's the GPU usage after I connected the wandb.
I want to know why it's so slow. Is it because it's supposed to be that way or is it some other problem?
By the way, could you please consider providing the corresponding docker image? I would be very grateful.