ymirsky / VulChecker

A deep learning model for localizing bugs in C/C++ source code (USENIX'23)
GNU General Public License v3.0
115 stars 12 forks source link

Questions on sub-ePDG depth and normalization #9

Open yunhao-qian opened 6 months ago

yunhao-qian commented 6 months ago

Hi, thank you for the excellent paper and open sourcing of the project. I am working on a new project that modifies and uses your code, and here are two questions I have.

  1. I notice a depth_limit option for controlling the depth of sub-ePDGs at runtime, but this option is unused in the hector train command line, which means that the sub-ePDG of a manifestation point always includes all its predecessor nodes. There might be situations where a malicious sub-ePDG is a subgraph of another sub-ePDG labelled benign. Are these situations expected, or we should avoid them early in the data preprocessing step?
  2. The VulChecker paper mentioned a batch normalization layer between the GNN and the MLP classifier, while model.py has an orphan BatchNorm module left unused. Was there any reason behind that change? Or has it been replaced because the normalization performed by hector feature_stats achieves a similar goal?
gmacon commented 6 months ago
  1. The depth_limit is set in hector train, but by a somewhat convoluted path: a Predictor is constructed with the parameter embedding_steps set from the command line option --embedding-steps (default value: 4), and then predictor.depth_limit is passed to TrainingData.from_parameters. In the Predictor, the depth_limit attribute is defined to just expose the embedding_steps paramter.
  2. My memory on this point is a little vague, but I believe the intent was to replace the batch normalization in the model with an ahead-of-time normalization that happens during data loading. Based on that, I think the unused BatchNorm module was left in by mistake.