Open YuningJ opened 5 years ago
Was this issue ever resolved?
Sorry about the late response.
cpe_name_dic.py contains a dictionary with software names as keys and correponding software versions as values, which are parsed from NVD CPE list.
Download the up-to-date official-cpe-dictionary_v2.3.xml from https://nvd.nist.gov/products/cpe and run following command to generate cpe_name_dic.py:
python data_collection/cpe_dic_parser.py
clean_version_and_measure.py contains function clean_version_dict(). Please refer to 6.
Raw_cvedetails_software_list contains all the product names archived by cvedetails.com. Run the following command to generate raw_cvedetails_software_list and clean_cvedetails_software_list:
python data_collection/obtain_cvedetails_products.py
An alternative for software list is the collection of the keys of cpe_name_dic.py.
Please refer to 3.
config.clean_cvedetails_software_file_name is the name of the python file which only contains raw_cvedetails_software_list
in 3.
Function clean_version_dict() computes the match rate of the extracted software name-version pairs from two reports. All the key functions called by clean_version_dict() are located in ./measurement
.
The code for measurement including function clean_version_dict() is still being optimised to make it more modular and easier to understand.
I used the file to test function generate_re_data_for_ner_output() during developing the code and please ignore it. Now the code snippet has been commented.
generate_re_data_for_ner_output() is used to generate all the available software name-version pairs for RE model, given software names and versions extracted by NER model, and is only called in function init_test_data() in initial_RE.py.
In config.py,
hash_file = word_emb_path + 'words' + str(word_emb_dim_ner) + '.lst'
emb_file = word_emb_path + 'embeddings' + str(word_emb_dim_ner) + '.txt'
word_emb_path_and_name = word_emb_path + 'word_emb' + str(word_emb_dim_re) + '.txt'
hash_file (words300.lst) is the word list file -- each line is a word in the corpus. emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. word_emb_path_and_name file -- each line is word concanated with its embeddings.
You can train the word embeddings easily using CPU in following steps.
python data_collection/cvedetails_crawler.py
python data_collection/corpus.py
In #8, it seems there is some missing code:
Running python data_collection/cvedetails_crawler.py
fails to write anything in the specified DATASET directory.
In the below code:
def crawl_reports_by_refs(self):
dict_to_write = dict()
ref_files = os.listdir(self.cve_ref_dir)
args = []
with utils.add_path(self.cve_ref_dir):
for each_file in ref_files:
category_module = __import__(each_file.replace('.py', ''))
cve_ref_dict = category_module.cve_ref_dict
for cve_id in cve_ref_dict:
args.append((cve_id, cve_ref_dict[cve_id], dict_to_write))
self.pool.map_async(crawl_report, args)
with open(self.data_dir + 'dataset.py', 'w') as f_write:
f_write.write('version_dict = ' + str(dict_to_write))
dict_to_write
is never written to, and the function eventually writes out an empty dictionary.
relatedly: in self.pool.map_async(crawl_report, args)
:
def crawl_report(args):
cve_id, ref_link_list, dict_to_write = args
for ref_link in ref_link_list:
# if ref_link != 'http://www.securitytracker.com/id/1041303':
# continue
report_category = get_report_category(ref_link)
if report_category == -1:
continue
cve_link = 'https://cve.mitre.org/cgi-bin/cvename.cgi?name=' + cve_id
target_content_dic = dict()
if report_category == 1:
target_content_dic = {'cve_id': [cve_id], 'title': '',
'content': dict_to_write[cve_id]['cve'][cve_link]['content']}
else:
target_content_dic = obtain_report_type_and_crawl(ref_link, report_category)
if target_content_dic in [dict(), None]:
continue
# update_version_dict_and_report_dict_by_ref(cve_id, ref_link, report_category, target_content_dic,
# version_dict, dict_to_write)
An update function is commented out, and it's definition does not exists in the codebase.
Any advice? The DATASET is required for building the subsequent corpus.
The function used for writing to dict_to_write
is added in data_collection/cvedetails_crawler.py
now.
Two corpus files are provided in directory corpus/
:
raw_corpus.txt
is the concatenation of the content of the unstructured reports crawled.clean_corpus.txt
also did some extra data cleaning work.clean_corpus.txt
is used in this work. The performance using raw_corpus.txt
is not tested for hardware limitations.
Sorry about the late response.
1. cpe_name_dic.py
cpe_name_dic.py contains a dictionary with software names as keys and correponding software versions as values, which are parsed from NVD CPE list.
Download the up-to-date official-cpe-dictionary_v2.3.xml from https://nvd.nist.gov/products/cpe and run following command to generate cpe_name_dic.py:
python data_collection/cpe_dic_parser.py
2. clean_version_and_measure.py
clean_version_and_measure.py contains function clean_version_dict(). Please refer to 6.
3. raw_cvedetails_software_list
Raw_cvedetails_software_list contains all the product names archived by cvedetails.com. Run the following command to generate raw_cvedetails_software_list and clean_cvedetails_software_list:
python data_collection/obtain_cvedetails_products.py
An alternative for software list is the collection of the keys of cpe_name_dic.py.
4. corpus_and_embedding.clean_cvedetails_software_list
Please refer to 3.
5. config.clean_cvedetails_software_file_name
config.clean_cvedetails_software_file_name is the name of the python file which only contains
raw_cvedetails_software_list
in 3.6. clean_version_dict
Function clean_version_dict() computes the match rate of the extracted software name-version pairs from two reports. All the key functions called by clean_version_dict() are located in
./measurement
.The code for measurement including function clean_version_dict() is still being optimised to make it more modular and easier to understand.
7. dir1 = '/Users/yingdong/Desktop/vulnerability_data/project_data/ner_re_dataset/ner_data_input/memc_full_duplicate' in data_preparation.py'
I used the file to test function generate_re_data_for_ner_output() during developing the code and please ignore it. Now the code snippet has been commented.
generate_re_data_for_ner_output() is used to generate all the available software name-version pairs for RE model, given software names and versions extracted by NER model, and is only called in function init_test_data() in initial_RE.py.
8. words300.lst
In config.py,
hash_file = word_emb_path + 'words' + str(word_emb_dim_ner) + '.lst'
emb_file = word_emb_path + 'embeddings' + str(word_emb_dim_ner) + '.txt'
word_emb_path_and_name = word_emb_path + 'word_emb' + str(word_emb_dim_re) + '.txt'
hash_file (words300.lst) is the word list file -- each line is a word in the corpus. emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. word_emb_path_and_name file -- each line is word concanated with its embeddings.
You can train the word embeddings easily using CPU in following steps.
- Crawl the vulnerability reports -- Run
python data_collection/cvedetails_crawler.py
- Build the corpus of vulnerability reports -- Run
python data_collection/corpus.py
- Try state-of-the-art word emb approaches -- Try FastText (https://github.com/facebookresearch/fastText) or Word2Vec (https://github.com/danielfrg/word2vec), and many more...
emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. embedding dimension is 300, the vector is very big. each line can't respent each vector.
For #8 (generateing hash_file), when running python data_collection.py
, I am getting the following:
category_module = __import__(category)
ModuleNotFoundError: No module named 'memc'
Where is this module expected to be defined?
Hello, can you write a complete training process? I haven't succeeded after reading it for a long time
Hello, the file for work-embedding seemed to be missing. I received the following error: FileNotFoundError: [Errno 2] No such file or directory: '/Users/yingdong/Desktop/ying/data/corpus_and_embeddings/embeddings/words300.lst'
To be more detailed, the following files/packages are missing:
Furthermore, you hardcoded path dir1 = '/Users/yingdong/Desktop/vulnerability_data/project_data/ner_re_dataset/ner_data_input/memc_full_duplicate' in data_preparation.py. Was it another missing file or?
Thanks. Your sharing and help are appreciated. Best, Yuni