Data processing Error - Githubissues

AlphaLaser commented 2 years ago

Hello

I installed the PHEME dataset and arranged it in this format:

{

'eid': '0', # Rumours
'info': {

    '552783238415265792': { # TID
        'text': 'Breaking....cartoons',
        'time': '01-07-15 11:06:08'},

    '552783667052167168': { # TID
        'text': 'France....witnesses',
        'time': '01-07-15 11:07:51'}

...

}

This is the format required by the function get_timeline in data_process.py

data = {
     eid:label,info:{tid:{text,time}}
}

However, this only accounts for one label (Rumours)

How do I add another eid for non_rumours?

znhy1024 commented 2 years ago

Hi,

Rumour and Non-Rumour have the same format.

AlphaLaser commented 2 years ago

So a different dictionary for both ?

znhy1024 commented 2 years ago

You may put them in one dictionary.

AlphaLaser commented 2 years ago

But isn't there only one eid label per dictionary?

Also, when I pass the dataset to the get_timeline function, I get the expected output.

However, when I pass the input for the second function, I get an error. What is the problem here?

My input format:

{
'eid': '0', # Rumours
'info': {

    '552783238415265792': { # TID
        'text': 'Breaking....cartoons',
        'time': '01-07-15 11:06:08'},

    '552783667052167168': { # TID
        'text': 'France....witnesses',
        'time': '01-07-15 11:07:51'}

...

,

timeline = ['01-07-15 11:06:08', .... ,'01-07-15 11:07:51']
}

Here is the function from data_process.py and the format you requested in your code.

# data = {
#     eid:label,info:{tid:{text,time}},timeline:[time]
# }

def timeline_convert_merge_post(data,interval=10):

    for eid,_ in data.items():
        timeLine = data[eid]['timeline']
        texts = data[eid]['texts']

        merge_index = list(range(len(timeLine)))[0::interval]
        merge_texts,merge_times = [],[]
        for i,index in enumerate(merge_index):
            try:
                next_index = merge_index[i+1]
            except:
                next_index = index+len(timeLine)+2
            assert next_index != index
            merge_text = [x for x in texts[index:next_index]]
            merge_time = [x for x in timeLine[index:next_index]]

            merge_texts.append(merge_text)
            merge_times.append(merge_time)

        data[eid]['merge_seqs'] = {'merge_times':merge_times,'merge_texts':merge_texts}
    return data

I'm getting an error when I pass the input to this function. Am I doing something wrong ?

znhy1024 commented 2 years ago

The dictionary contains all instances for a dataset: data = { eid0:{label:label,info:{tid:{text,time}}}, eid1:{label:label,info:{tid:{text,time}}}, ...... }

For the error, you may need to convert the time to timestamp. If it doesn't work, could you please post the error information?

AlphaLaser commented 2 years ago

Hi

This is the format that I have updated to:

{
'eid0': {
    'label': '0',
    'info': {
        '552783238415265792': {
            'texts': 'Breaking: At least 10 dead, 5 injured after tO gunman open fire in offices of Charlie  Hebdo,satirical mag that published Mohammed cartoons',
            'timeline': '01-07-15 11:06:08'
                }
        }
}, 

'eid1': {'label': '0',
 'info': {'552783667052167168': {'texts': 'France: 10 people dead after shooting at HQ of satirical weekly newspaper #CharlieHebdo, according to witnesses http://t.co/FkYxGmuS58',
 'timeline': '01-07-15 11:07:51'}}}, 

'eid2': {'label': '0',
 'info': {'552783745565347840': {'texts': 'Ten killed in shooting at headquarters of French satirical weekly Charlie Hebdo, says French media citing witnesses #c4news',
 'timeline': '01-07-15 11:08:09'}}}

...

}

However, I'm still getting an error from the timeline_convert_merge_post function.

Here's the error

KeyError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_11008/3508134719.py in ----> 1 timeline_convert_merge_post(n_data)

~\AppData\Local\Temp/ipykernel_11008/1868279692.py in timeline_convert_mergepost(data, interval) 2 3 for eid, in data.items(): ----> 4 timeLine = data[eid]['timeline'] 5 texts = data[eid]['texts'] 6

KeyError: 'timeline'

AlphaLaser commented 2 years ago

However, I tried another format and it worked on the function. Is this the output I'm supposed to get ?

{'eid0': {
'label': '0',
'texts': 'Breaking: At least 10 dead, 5 injured after tO gunman open fire in offices of Charlie  Hebdo,satirical mag that published Mohammed cartoons',
'timeline': '01-07-15 11:06:08',
'merge_seqs': {
'merge_times': [
['0', '1', '-', '0', '7', '-', '1', '5', ' ', '1'],
['1', ':', '0', '6', ':', '0', '8']
],
'merge_texts': [
['B', 'r', 'e', 'a', 'k', 'i', 'n', 'g', ':', ' '],
    ['A',
     't',
     ' ',
     'l',
     'e',
     'a',
     's',
     't',
     ' ',
     '1',
     '0',
     ' ',
     'd',
     'e',
     'a',
     'd',
     ',',
     ' ',
     '5']]},

...

}

znhy1024 commented 2 years ago

Before passing the input to function timeline_convert_merge_post, the data format should be:

data = {eid0:{label,info:{tid:{text,time}},timeline:[time...]},...}

where each key-value pair is an instance that contains multiple posts relevant to a claim.

In your input format, the key-value pair is only one post. To fix it, please put all instances into the data dictionary at first.

Convert the dataset to a data dict: data = { eid0:{label,info:{tid0:{text0,time0},tid1:{text1,time1},...}},...}
For each key-value pair in the above dict, call function get_timeline, then you could update the data dict as: data = { eid0:{label,info:{tid0:{text0,time0},tid1:{text1,time1},...},timeline:[time0,time1,...]},...}
Finally, pass the updated data dict to function timeline_convert_merge_post

Note that the time for a post need to be converted to timestamp, e.g., not 01-07-15 11:06:08 but 1420599968

AlphaLaser commented 2 years ago

Thank you, that's very helpful.

So there need to be only 2 eids, Right ? One for Rumours and one for non-rumours

This should be the formate before passing it to get_timeline. Right ?

data = {
    "eid0": { # Rumours
        "label": "0",
        "info": {
            ... # All rumour texts and dates
        }
    },

    "eid1": { # Non Rumours
        "label": "1",
        "info": {
            ... # All non-rumour texts and dates

        }
    }
}

znhy1024 commented 2 years ago

Not only 2 eids. eid is the id for one instance, i.e., one claim (Rumour or non-Rumour) and its relevant posts. In PHEME dataset, there are 1972 rumours and 3830 non-rumour, thus the number of eid should be 5802.

AlphaLaser commented 2 years ago

Ohh, I think I have a better idea of it now.

Thank you. I'll try it and send an update :)

AlphaLaser commented 2 years ago

Since this is the structure of PHEME, Shouldn't there be 5802 TIDs instead of 5802 EIDs

As each instance/claim includes only one source tweet attached to it.

So shouldn't there be 5 EIDs for 5 classes and 5802 TIDs for 5802 Tweets?

znhy1024 commented 2 years ago

One claim in PHEME dataset has a source tweet and multiple reaction tweets, thus all information of these tweets should be put in the info dict of one eid's value.

One event (e.g., charliehebdo) in PHEME dataset has multiple rumour claim and non-rumour claims, we merge all these claims into the data dict.

Therefore, we consider all information in each folder as an instance.

To better understand the PHEME dataset, refer to link.

AlphaLaser commented 2 years ago

Hello!

Thank you so much for your help till now

I think I finally got the format right because all 3 functions are working now.

This is my final output after passing it to all 3 functions:


data = {

"eid0" : {
'label': '0',

 'merge_seqs': {
    'merge_times': [
        ['01-07-15 11:06:08',
        '01-07-15 11:24:15',
        '01-07-15 11:31:37',
        '01-07-15 11:38:37',
        '01-07-15 11:45:32',
        '01-07-15 12:32:16',
        '01-07-15 12:32:50',
        '01-07-15 12:32:54',
        '01-07-15 12:43:31',
        '01-07-15 13:00:29']
],

  'merge_vecs': [[ 0.0, 0.0, 0.0, 0.12591666134496624, 0.0 ....... 0.0, 0.0, 0.0 ]]}}

}},

"eid1" : {

...

However, you mentioned converting the time to timestamps. How do I do that since the functions do not automatically accomplish this task ? (Also, is it fine to have several 0.0 values in merge_vecs ?

znhy1024 commented 2 years ago

You could check the datetime library which has a function for converting the time to timestamps. It's OK to have 0.0 values since the tf-dfi vector is sparse.

AlphaLaser commented 2 years ago

It worked. Thank you!

I saved it as a JSON file, saved it as pickle and updated it in config.json-

"evaluate_only": false,
"data": "data/PHEME.pkl",
"data_ids":"data/BEARD_ids.pkl", // What file do I put here
"device": "cuda",
"dataset":"PHEME",
"model_dir": "saved_models/"

However, data_ids.pkl is required for Main.py to run. What file will go there? Such a file doesn't exist anywhere in the repository.

znhy1024 commented 2 years ago

The format of data id file should be:

{ 'val': [eid,eid,eid....], 'fold0':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold1':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold2':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold3':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold4':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]} }

You may need to split the dataset and obtain the ids.pkl file following the above format.

AlphaLaser commented 2 years ago

Does each fold represent a different category?

How is val different ?

znhy1024 commented 2 years ago

We hold out 20% of instances as validation. And the rest are randomly split 5 times (folds here) with a ratio of 3:1 for training/test. Refer to our paper for experiment setup.

AlphaLaser commented 2 years ago

Got it!

I have formatted the ids like this:

data_ids = {

    "val": ["eid736", "eid876" ... ],  # 1160 elements
    "fold0": {
        'test': ["eid8", "eid837", "eid3445" ... ], # 232 elements
        'train': ["eid947", "eid2736", "eid745" ... ] # 696 elements
    },

    "fold1": {
        'test': [ ... ], # 232 elements
        'train': [ ... ] # 696 elements
    },

    "fold2": {
        'test': [ ... ] # 232 elements
        'train': [ ... ] # 696 elements
    },

    "fold3": {
        'test': [ ... ], # 232 elements
        'train': [ ... ] # 696 elements
    },

    "fold4": {
        'test': [ ... ], # 234 elements (Because 2 extra elements were left)
        'train': [ ... ] # 696 elements
    }
}

Then I converted it to json and saved it as a .pkl file.

json_output = json.dumps(data_ids)

with open('data_ids.pkl', 'wb') as outfile:
    pickle.dump(json_output, outfile)

Is this fine ?

znhy1024 commented 2 years ago

I think yes. You could try it and check the printed output.

AlphaLaser commented 2 years ago

I'm getting the following error

pid: 384 2022-11-14 18:16:43 {'active_model': 'HEARD', 'models': {'HEARD': {'early_stop_lr': 1e-05, 'early_stop_patience': 6, 'hyperparameters': {'learning_rate': {'RD': 0.0002, 'HC': 0.0002}, 'max_seq_len': 100, 'max_post_len': 300, 'batch_size': 16, 'epochs': 12, 'lstm_dropout': 0.1, 'fc_dropout': 0.3, 'beta': {'HC': 1.0, 'T': 1.0, 'N': 1.0}, 'hidden_size_HC': 64, 'hidden_size_RD': 128, 'in_feats_HC': 1, 'in_feats_RD': 1000, 'sample_integral': 100, 'sample_pred': 100, 'weight_decay': 0.0001, 'interval': 3600.0, 'decay_patience': 3, 'lstm_layers': 1}, 'evaluate_only': False, 'data': 'Data/data.pkl', 'data_ids': 'Data/pheme_ids.pkl', 'device': 'cpu', 'dataset': 'PHEME', 'model_dir': 'saved_models/'}}} [+]Start training fold0: 2022-11-14 18:16:46 start training 0 epoch: 2022-11-14 18:16:46 Traceback (most recent call last): File "/content/drive/MyDrive/HEARD/Main.py", line 42, in main() File "/content/drive/MyDrive/HEARD/Main.py", line 27, in main model,best_params = handle.train_HEARD(fold,train_loader,val_loader) File "/content/drive/MyDrive/HEARD/Train.py", line 107, in train_HEARD for Batch_data in train_loader: File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1376, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1402, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 461, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/MyDrive/HEARD/Dataset.py", line 61, in getitem merge_tids = seqs['merge_tids'][:self.max_seq_len] KeyError: 'merge_tids'

According to README.md, the data format doesn't contain any key called merge_tids ?

# From README.md

{
 "eid": {
         "label": "1", # 1 for rumor, 0 for non-rumor
         "merge_seqs": { 
             "merge_times": [[timestamp,timestamp,...], [timestamp,timestamp,...], ...],
             'merge_vecs': [[...], [...], ...], # tf-idf vecs[1000] for each interval, so the shape of merge_vecs should be [num of intervals,1000] 
             }}
 ...
 }

znhy1024 commented 2 years ago

Hi,

Thanks! The corresponding change has been committed. You could add the merge_ids key in the dict following the same merging strategy as merge_times and merge_vecs( They are from the same Tweet).

znhy1024 / HEARD

Data processing Error #2