problem about training - Githubissues

ziansu / prorec

Official Implementation of NeurIPS 2024 paper - Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

3 stars 0 forks source link

problem about training #1

Open ljk419511 opened 4 months ago

ljk419511 commented 4 months ago

When I typed in the following command,

torchrun --nproc_per_node=4 run_casp.py scripts/configs/train_casp_moco.yaml

I found that the training was progressing too slowly, taking over 20 hours.

a8bf63d80b92925f131bbfd08310fe7

Here's the GPU usage after I connected the wandb. 4d8afe719631930ff94613b7760603c

I want to know why it's so slow. Is it because it's supposed to be that way or is it some other problem?

By the way, could you please consider providing the corresponding docker image? I would be very grateful.

ziansu commented 4 months ago

Thank you for your interest!

Your GPU utilization is indeed a bit low. That might because in the provided config the per_device_train_batch_size is set to 8. If you double them to 16 and the utilization will be around 100%.

On the other hand, for our downstream experiments, we used the checkpoint after 10k steps, which means you can further save some time by early stopping (terminate the process when it reaches certain steps and the checkpoint is evaluated and saved). Also, it seems you have more than 4 GPUs so maybe you can try to allocate more GPUs for parallel training.

We will consider providing a docker image later. However, we are busy at the moment so you might need to wait for a while. Since you are already able to train the prober, I think the environment is not a big problem.

ljk419511 commented 3 months ago

Thank you for your reply! But I still have some questions.

1. I wonder if these four commands are training the model step by step?

torchrun --nproc_per_node=4 run_casp.py scripts/configs/train_casp_moco.yaml torchrun --nproc_per_node=4 run_prober.py scripts/configs/train_prober.yaml accelerate launch --num_processes=4 big_model_quantized_probing.py scripts/configs/probe_quantized_codellama-34b-4bit-unfreeze.yaml python score_and_filter_signature.py scripts/configs/filter_sig.yaml

Does it mean I need to execute the above four commands first to get a pretrained model if I want to do some inference?

2. I noticed you mentioned inference scripts.

We provide representive scripts for training and inference.

How should I start for inference?

3. If I want to do some inference (e.g., for Binary Summarization), what should be my inputs and the outputs of the model？

ziansu commented 3 months ago

Sorry for the late response. We are quite busy at the moment.

About the commands, the run_casp.py is for training a cross-modal retriever; run_prober.py is for training the prober; big_model_quantized_probing.py is for inferencing the signature; score_and_filter_signature.py is for filtering the signature; and big_model_quantized_probing_continue.py is for inference the rest of the function body given the filtered signatures. If you want to reproduce the whole pipeline, you will need to run these scripts step by step and possibly modify some configurations.
Our pre-trained models are hosted on Hugging Face. Please check https://huggingface.co/PurCL for checkpoints. For the default inference configuration (probe_quantized_codellama-34b-4bit-unfreeze.yaml) we are actually using the pre-trained checkpoint hosted on HF). If you have your own paired binary-source function corpus, we recommend using the corpus for prober training in order to adapt to that distribution and inference with the new prober.
For inference data format, you can check line 202-204 in big_model_quantized_probing.py, which essentially is the Hugging Face dataset containing a codeart field. The codeart field is a dictionary containing two subfields: code and dep. For how to preprocess and get such fields, please check our codeart repo https://github.com/ziansu/codeart.

ljk419511 commented 1 month ago

Hi, I want to ask the difference between retriever and prober. My initial understanding for whole work is that:

We firstly align binary encoder and souce code encoder to form a dual_encoder which is also called retriever. It takes a binary(actually a decomC function) as a input, generates the TOP_K relevant hsrc(actually a source code embedding, or called function signature) according to sim(xasm, xsrc) = cos (hasm, hsrc). That is what the retriever need to do.
Then we align binary encoder and SCFM(actually a decoder) to decode the TOP_K hsrc getting from retriever to generate the TOP_K revelant source code contexts.
Retriever is a part of prober. When we give prober a decomC as a input, it finally synthesize revelant contexts.

But the below note from the paper really makes me confused.

We leverage nucleus sampling to first let the prober generate a relatively large set of source function signatures with high randomness (top-p = 0.75). We use idea similar to retrieval by training a binary-signature dual-encoder to rank generated signatures and filter out the noisy ones. Ultimately, we use the prober to further complete the remaining signatures with smaller randomness (top-p = 0.5).

It feels as if retriever and prober are two modules that work independently.

It has confused me for a long time. If there is anything wrong with my understanding, please help me.

Hope for your guidance!

ziansu commented 1 month ago

The retriever and prober indeed work independently during inference. However, there is some overlap in the training process.

How the dual-encoder dense retriever works

As you already understand how similarity score is computed, our retrieval-augmented baseline only use the retrieved top-k source functions, i.e., the source functions with the highest similarity scores, as additional context (which is quite straightforward). Note that there is no need to decode the h_src since we already have the source functions within the datastore (which consists of all the source functions in the training set). Otherwise, we can not obtain the h_src which is the encoded source functions, in the first place.

How the prober works

There is no retrieval process for the prober, it is actually an encoder-decoder. The encoder part of the prober encodes the x_asm once and obtain the intermediate hidden states (the Zs as in Figure 2). Then, the intermediate hidden states will be fed into the SCFM for decoding. The decoding will have some randomness so that some sampling will provide diverse source code snippets.

Overlaps in training

The overlap in the training is due to both retriever and prober contains the codeart encoder (same architecture but prober's further alignment training make the parameters different).

Additional notes

Note that we only use disassembled code for retrieval/probing, you can probability also use decompiled code and more information (e.g., LMPA uses some information from dependent context for variable name recovery) from the binary code in the future.

ljk419511 commented 1 month ago

Hi,

Thank you for your patient guidance and there are some problems still.

What is signature in this context? The Zs as in Figure 2? If so, why is it called source code signature rather than asm signature? And how is the signature here obtained? From encode part of prober?

We leverage nucleus sampling to first let the prober generate a relatively large set of source function signatures with high randomness (top-p = 0.75).
What doesa binary-signature dual-encoder mean? Does it mean using binary-signature corpus to retrain a binary-signature dual-encoder?

We use idea similar to retrieval by training a binary-signature dual-encoder to rank generated signatures and filter out the noisy ones.
Does it mean the decoding stuff?

Ultimately, we use the prober to further complete the remaining signatures with smaller randomness (top-p = 0.5).

My purpose is use other dataset as input for prorec to get function summary. Now I'm stuck on not knowing how to use your pretrained prober to receive disassembled as input and get discomC with relevant context for output. And your code structure is so ultra-refined and precise that I often get lost. So do you have any suggestions for direct interence to get function summary by using the prorec structure?

Sincere appreciation and expectation for your reply!

ziansu commented 1 month ago

"Signature" means something like int main(int argc, char *argv[]). We simply want to sample better source code generated by the prober, but generating the full function can be time and GPU-memory consuming for a large amount of candidates, so in this work we first generate short signature of the function, do some filtering, and then generate the following tokens. In practice, if you have enough computation resource or acceleration (there should be some inference acceleration frameworks for multimodal models recently, but how to support prober within these frameworks will be a problem), it is possible to directly sample full source functions and filter them with the retriever.
Your understanding is correct. The binary-signature dual-encoder is indeed a retrained one. The training script run_casp_signature.py is similar to run_casp.py, with only some difference in collate_fn so we can use the original binary-source dataset for binary-signature training.
Right. This means we continue decoding with the encoder outputs and previously generated function signatures.

For inferencing with the prober, the input should comply with the format of the codeart field in our dataset, which is a json-dumped dictionary of code and dep. For the preprocessing of codeart format, you can check https://github.com/ziansu/codeart/tree/artifacts/codeart/preprocess. And maybe @XZ-X can help answer some questions related to preprocessing.

XZ-X commented 1 month ago

I am happy to help with preprocessing.

The preprocess code for CodeArt is at here, as Zian suggested before. Let me know if you have further questions.

Alternatively, we can also run the preprocessing part for you if that's a better solution. In that case, would you share with us the unstripped binaries? And we will share the preprocessed file. You can start your experiments from the training/inference part then.

ljk419511 commented 3 days ago

So sorry for the late reply.

@XZ-X With the permission of the original author, I've put the dataset here.

I've finished formatting the data and got the jsonl file in the following format.

{
    "metadata": {
        "project_name": , 
        "function_name": , 
        "function_addr": , 
        "binary_name": 
    }, 
    "code": [

    ], 
    "data_dep": [

    ]
}
{
    ...
}

Do I need to do anything else with the jsonl data?
What should I do next?

XZ-X commented 3 days ago

It seems that I cannot access the google drive link. Please double check the access control.

Did you mean you've successfully run the preprocessing pipeline and got the jsonl file?

ljk419511 commented 2 days ago

I'm sorry. It should be okay now.

Yes, I think that I've finished the preprocessing and got the jsonl file with the format of the codeart field according to here. But the follow error seems to indicate that further processing of the jsonl data is still required.

Traceback (most recent call last): File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 461, in main() File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 444, in main results = batch_inference( File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 203, in batch_inference [eval(example['codeart']) for example in examples] File "/home/featurize/linjk/prorec/src/big_model_quantized_probing.py", line 203, in [eval(example['codeart']) for example in examples] KeyError: 'codeart'

Besides, I'd also like to ask what I should do as a follow-up for inferencing with the prober.

Looking forward to your reply.

ziansu commented 2 days ago

You should convert the jsonl into a Hugging Face dataset.

from datasets import Dataset, DatasetDict

dataset = Dataset.from_dict{'codeart': [repr({'code': ..., 'dep': ...}) for ... in your_jsonl_list]})
dataset = DatasetDict({'test': dataset})

# you can either pub it to hub
dataset.push_to_hub('path_to_your_hub')
# or you can save to disk
dataset.save_to_disk('path_to_dir')

Then, you can run accelerate launch --num_processes=4 big_model_quantized_probing.py scripts/configs/probe_quantized_codellama-34b-4bit-unfreeze.yaml with some modifications within the config .yaml file and potentially some lines in the .py file.

If you pushed the dataset to hub, then you can modify the dataset_name field in the .yaml file to your dataset name on the hub.
If you saved the dataset to disk, then a relative easy way to load the dataset is to change line 300-343 in big_model_quantized_probing.py to dataset = load_from_disk('path_to_dir').

ljk419511 commented 9 hours ago

Whether I use 1 RTX4090 or 4 RTX4090, It all throws the CUDA Out Of Memory error, only if batch_size=1 it will not. And the training progress is unbearably slow. Can't even tell the difference between 1 GPU and 4 GPUs from training speeds.

In order to reduce the GPU memory requirements, and taking into account your previous mention of generating shorter function signatures, I set num_return_sequences here to 10. I'm not sure if this modification is appropriate at the moment.
For inference, I think the order of execution of the script is, to run big_model_quantized_probing.py to generate the function signature, run score_and_filter_signature.py to filter the signature, run big_model_quantized_probing_continue.py to decode the remainder of the generated function. And I'm not quite sure if I'm on the right track.

looking forward to your guidance!