Missing work-embedding file

YuningJ commented 5 years ago

Hello, the file for work-embedding seemed to be missing. I received the following error: FileNotFoundError: [Errno 2] No such file or directory: '/Users/yingdong/Desktop/ying/data/corpus_and_embeddings/embeddings/words300.lst'

To be more detailed, the following files/packages are missing:

cpe_name_dic.py
clean_version_and_measure.py
raw_cvedetails_software_list
"corpus_and_embedding" package and related file such as corpus_and_embedding.clean_cvedetails_software_list
config.clean_cvedetails_software_file_name
clean_version_dict

Furthermore, you hardcoded path dir1 = '/Users/yingdong/Desktop/vulnerability_data/project_data/ner_re_dataset/ner_data_input/memc_full_duplicate' in data_preparation.py. Was it another missing file or?

Thanks. Your sharing and help are appreciated. Best, Yuni

whuiras commented 5 years ago

Was this issue ever resolved?

yingdongucas commented 5 years ago

Sorry about the late response.

1. cpe_name_dic.py

cpe_name_dic.py contains a dictionary with software names as keys and correponding software versions as values, which are parsed from NVD CPE list.

Download the up-to-date official-cpe-dictionary_v2.3.xml from https://nvd.nist.gov/products/cpe and run following command to generate cpe_name_dic.py:

python data_collection/cpe_dic_parser.py

2. clean_version_and_measure.py

clean_version_and_measure.py contains function clean_version_dict(). Please refer to 6.

3. raw_cvedetails_software_list

Raw_cvedetails_software_list contains all the product names archived by cvedetails.com. Run the following command to generate raw_cvedetails_software_list and clean_cvedetails_software_list:

python data_collection/obtain_cvedetails_products.py

An alternative for software list is the collection of the keys of cpe_name_dic.py.

4. corpus_and_embedding.clean_cvedetails_software_list

Please refer to 3.

5. config.clean_cvedetails_software_file_name

config.clean_cvedetails_software_file_name is the name of the python file which only contains raw_cvedetails_software_list in 3.

6. clean_version_dict

Function clean_version_dict() computes the match rate of the extracted software name-version pairs from two reports. All the key functions called by clean_version_dict() are located in ./measurement.

The code for measurement including function clean_version_dict() is still being optimised to make it more modular and easier to understand.

7. dir1 = '/Users/yingdong/Desktop/vulnerability_data/project_data/ner_re_dataset/ner_data_input/memc_full_duplicate' in data_preparation.py'

I used the file to test function generate_re_data_for_ner_output() during developing the code and please ignore it. Now the code snippet has been commented.

generate_re_data_for_ner_output() is used to generate all the available software name-version pairs for RE model, given software names and versions extracted by NER model, and is only called in function init_test_data() in initial_RE.py.

8. words300.lst

In config.py, hash_file = word_emb_path + 'words' + str(word_emb_dim_ner) + '.lst' emb_file = word_emb_path + 'embeddings' + str(word_emb_dim_ner) + '.txt' word_emb_path_and_name = word_emb_path + 'word_emb' + str(word_emb_dim_re) + '.txt'

hash_file (words300.lst) is the word list file -- each line is a word in the corpus. emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. word_emb_path_and_name file -- each line is word concanated with its embeddings.

You can train the word embeddings easily using CPU in following steps.

Crawl the vulnerability reports -- Run python data_collection/cvedetails_crawler.py
Build the corpus of vulnerability reports -- Run python data_collection/corpus.py
Try state-of-the-art word emb approaches -- Try FastText (https://github.com/facebookresearch/fastText) or Word2Vec (https://github.com/danielfrg/word2vec), and many more...

whuiras commented 5 years ago

In #8, it seems there is some missing code:

Running python data_collection/cvedetails_crawler.py fails to write anything in the specified DATASET directory.

In the below code:

    def crawl_reports_by_refs(self):
        dict_to_write = dict()
        ref_files = os.listdir(self.cve_ref_dir)
        args = []
        with utils.add_path(self.cve_ref_dir):
            for each_file in ref_files:
                category_module = __import__(each_file.replace('.py', ''))
                cve_ref_dict = category_module.cve_ref_dict
                for cve_id in cve_ref_dict:
                    args.append((cve_id, cve_ref_dict[cve_id], dict_to_write))
        self.pool.map_async(crawl_report, args)
        with open(self.data_dir + 'dataset.py', 'w') as f_write:
            f_write.write('version_dict = ' + str(dict_to_write))

dict_to_write is never written to, and the function eventually writes out an empty dictionary.

relatedly: in self.pool.map_async(crawl_report, args):

def crawl_report(args):
    cve_id, ref_link_list, dict_to_write = args
    for ref_link in ref_link_list:
        # if ref_link != 'http://www.securitytracker.com/id/1041303':
        #     continue
        report_category = get_report_category(ref_link)
        if report_category == -1:
            continue
        cve_link = 'https://cve.mitre.org/cgi-bin/cvename.cgi?name=' + cve_id
        target_content_dic = dict()
        if report_category == 1:
            target_content_dic = {'cve_id': [cve_id], 'title': '',
                                  'content': dict_to_write[cve_id]['cve'][cve_link]['content']}
        else:
            target_content_dic = obtain_report_type_and_crawl(ref_link, report_category)
        if target_content_dic in [dict(), None]:
            continue

        # update_version_dict_and_report_dict_by_ref(cve_id, ref_link, report_category, target_content_dic,
        #                                            version_dict, dict_to_write)

An update function is commented out, and it's definition does not exists in the codebase.

Any advice? The DATASET is required for building the subsequent corpus.

yingdongucas commented 5 years ago

The function used for writing to dict_to_write is added in data_collection/cvedetails_crawler.py now.

Two corpus files are provided in directory corpus/:

raw_corpus.txt is the concatenation of the content of the unstructured reports crawled.
clean_corpus.txt also did some extra data cleaning work.

clean_corpus.txt is used in this work. The performance using raw_corpus.txt is not tested for hardware limitations.

wanshangfeng commented 4 years ago

Sorry about the late response.

1. cpe_name_dic.py

cpe_name_dic.py contains a dictionary with software names as keys and correponding software versions as values, which are parsed from NVD CPE list.

Download the up-to-date official-cpe-dictionary_v2.3.xml from https://nvd.nist.gov/products/cpe and run following command to generate cpe_name_dic.py:

python data_collection/cpe_dic_parser.py

2. clean_version_and_measure.py

clean_version_and_measure.py contains function clean_version_dict(). Please refer to 6.

3. raw_cvedetails_software_list

Raw_cvedetails_software_list contains all the product names archived by cvedetails.com. Run the following command to generate raw_cvedetails_software_list and clean_cvedetails_software_list:

python data_collection/obtain_cvedetails_products.py

An alternative for software list is the collection of the keys of cpe_name_dic.py.

4. corpus_and_embedding.clean_cvedetails_software_list

Please refer to 3.

5. config.clean_cvedetails_software_file_name

config.clean_cvedetails_software_file_name is the name of the python file which only contains raw_cvedetails_software_list in 3.

6. clean_version_dict

Function clean_version_dict() computes the match rate of the extracted software name-version pairs from two reports. All the key functions called by clean_version_dict() are located in ./measurement.

The code for measurement including function clean_version_dict() is still being optimised to make it more modular and easier to understand.

7. dir1 = '/Users/yingdong/Desktop/vulnerability_data/project_data/ner_re_dataset/ner_data_input/memc_full_duplicate' in data_preparation.py'

I used the file to test function generate_re_data_for_ner_output() during developing the code and please ignore it. Now the code snippet has been commented.

generate_re_data_for_ner_output() is used to generate all the available software name-version pairs for RE model, given software names and versions extracted by NER model, and is only called in function init_test_data() in initial_RE.py.

8. words300.lst

In config.py, hash_file = word_emb_path + 'words' + str(word_emb_dim_ner) + '.lst' emb_file = word_emb_path + 'embeddings' + str(word_emb_dim_ner) + '.txt' word_emb_path_and_name = word_emb_path + 'word_emb' + str(word_emb_dim_re) + '.txt'

hash_file (words300.lst) is the word list file -- each line is a word in the corpus. emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. word_emb_path_and_name file -- each line is word concanated with its embeddings.

You can train the word embeddings easily using CPU in following steps.

Crawl the vulnerability reports -- Run python data_collection/cvedetails_crawler.py

Build the corpus of vulnerability reports -- Run python data_collection/corpus.py

Try state-of-the-art word emb approaches -- Try FastText (https://github.com/facebookresearch/fastText) or Word2Vec (https://github.com/danielfrg/word2vec), and many more...

emb_file is the embedding file -- each line is a vector/emb. of the word in the correponding line of hash file. embedding dimension is 300, the vector is very big. each line can't respent each vector.

ijt32031 commented 3 years ago

For #8 (generateing hash_file), when running python data_collection.py, I am getting the following:

 category_module = __import__(category)
ModuleNotFoundError: No module named 'memc'

Where is this module expected to be defined?

liushuanglonghuaji commented 3 years ago

Hello, can you write a complete training process? I haven't succeeded after reading it for a long time

yingdongucas / inconsistency_detection