Closed AlphaLaser closed 1 year ago
Hi,
Rumour and Non-Rumour have the same format.
So a different dictionary for both ?
You may put them in one dictionary.
But isn't there only one eid label per dictionary?
Also, when I pass the dataset to the get_timeline
function, I get the expected output.
However, when I pass the input for the second function, I get an error. What is the problem here?
My input format:
{
'eid': '0', # Rumours
'info': {
'552783238415265792': { # TID
'text': 'Breaking....cartoons',
'time': '01-07-15 11:06:08'},
'552783667052167168': { # TID
'text': 'France....witnesses',
'time': '01-07-15 11:07:51'}
...
,
timeline = ['01-07-15 11:06:08', .... ,'01-07-15 11:07:51']
}
Here is the function from data_process.py
and the format you requested in your code.
# data = {
# eid:label,info:{tid:{text,time}},timeline:[time]
# }
def timeline_convert_merge_post(data,interval=10):
for eid,_ in data.items():
timeLine = data[eid]['timeline']
texts = data[eid]['texts']
merge_index = list(range(len(timeLine)))[0::interval]
merge_texts,merge_times = [],[]
for i,index in enumerate(merge_index):
try:
next_index = merge_index[i+1]
except:
next_index = index+len(timeLine)+2
assert next_index != index
merge_text = [x for x in texts[index:next_index]]
merge_time = [x for x in timeLine[index:next_index]]
merge_texts.append(merge_text)
merge_times.append(merge_time)
data[eid]['merge_seqs'] = {'merge_times':merge_times,'merge_texts':merge_texts}
return data
I'm getting an error when I pass the input to this function. Am I doing something wrong ?
The dictionary contains all instances for a dataset:
data = { eid0:{label:label,info:{tid:{text,time}}}, eid1:{label:label,info:{tid:{text,time}}}, ...... }
For the error, you may need to convert the time to timestamp. If it doesn't work, could you please post the error information?
Hi
This is the format that I have updated to:
{
'eid0': {
'label': '0',
'info': {
'552783238415265792': {
'texts': 'Breaking: At least 10 dead, 5 injured after tO gunman open fire in offices of Charlie Hebdo,satirical mag that published Mohammed cartoons',
'timeline': '01-07-15 11:06:08'
}
}
},
'eid1': {'label': '0',
'info': {'552783667052167168': {'texts': 'France: 10 people dead after shooting at HQ of satirical weekly newspaper #CharlieHebdo, according to witnesses http://t.co/FkYxGmuS58',
'timeline': '01-07-15 11:07:51'}}},
'eid2': {'label': '0',
'info': {'552783745565347840': {'texts': 'Ten killed in shooting at headquarters of French satirical weekly Charlie Hebdo, says French media citing witnesses #c4news',
'timeline': '01-07-15 11:08:09'}}}
...
}
However, I'm still getting an error from the timeline_convert_merge_post
function.
Here's the error
KeyError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_11008/3508134719.py in
----> 1 timeline_convert_merge_post(n_data) ~\AppData\Local\Temp/ipykernel_11008/1868279692.py in timeline_convert_mergepost(data, interval) 2 3 for eid, in data.items(): ----> 4 timeLine = data[eid]['timeline'] 5 texts = data[eid]['texts'] 6
KeyError: 'timeline'
However, I tried another format and it worked on the function. Is this the output I'm supposed to get ?
{'eid0': {
'label': '0',
'texts': 'Breaking: At least 10 dead, 5 injured after tO gunman open fire in offices of Charlie Hebdo,satirical mag that published Mohammed cartoons',
'timeline': '01-07-15 11:06:08',
'merge_seqs': {
'merge_times': [
['0', '1', '-', '0', '7', '-', '1', '5', ' ', '1'],
['1', ':', '0', '6', ':', '0', '8']
],
'merge_texts': [
['B', 'r', 'e', 'a', 'k', 'i', 'n', 'g', ':', ' '],
['A',
't',
' ',
'l',
'e',
'a',
's',
't',
' ',
'1',
'0',
' ',
'd',
'e',
'a',
'd',
',',
' ',
'5']]},
...
}
Before passing the input to function timeline_convert_merge_post, the data format should be:
data = {eid0:{label,info:{tid:{text,time}},timeline:[time...]},...}
where each key-value pair is an instance that contains multiple posts relevant to a claim.
In your input format, the key-value pair is only one post. To fix it, please put all instances into the data dictionary at first.
data = { eid0:{label,info:{tid0:{text0,time0},tid1:{text1,time1},...}},...}
data = { eid0:{label,info:{tid0:{text0,time0},tid1:{text1,time1},...},timeline:[time0,time1,...]},...}
Note that the time for a post need to be converted to timestamp, e.g., not 01-07-15 11:06:08 but 1420599968
Thank you, that's very helpful.
So there need to be only 2 eids, Right ? One for Rumours and one for non-rumours
This should be the formate before passing it to get_timeline
. Right ?
data = {
"eid0": { # Rumours
"label": "0",
"info": {
... # All rumour texts and dates
}
},
"eid1": { # Non Rumours
"label": "1",
"info": {
... # All non-rumour texts and dates
}
}
}
Not only 2 eids. eid is the id for one instance, i.e., one claim (Rumour or non-Rumour) and its relevant posts. In PHEME dataset, there are 1972 rumours and 3830 non-rumour, thus the number of eid should be 5802.
Ohh, I think I have a better idea of it now.
Thank you. I'll try it and send an update :)
Since this is the structure of PHEME,
Shouldn't there be 5802 TIDs
instead of 5802 EIDs
As each instance/claim includes only one source tweet attached to it.
So shouldn't there be 5 EIDs
for 5 classes and 5802 TIDs
for 5802 Tweets?
One claim in PHEME dataset has a source tweet and multiple reaction tweets, thus all information of these tweets should be put in the info dict of one eid's value.
One event (e.g., charliehebdo) in PHEME dataset has multiple rumour claim and non-rumour claims, we merge all these claims into the data dict.
Therefore, we consider all information in each
To better understand the PHEME dataset, refer to link.
Hello!
Thank you so much for your help till now
I think I finally got the format right because all 3 functions are working now.
This is my final output after passing it to all 3 functions:
data = {
"eid0" : {
'label': '0',
'merge_seqs': {
'merge_times': [
['01-07-15 11:06:08',
'01-07-15 11:24:15',
'01-07-15 11:31:37',
'01-07-15 11:38:37',
'01-07-15 11:45:32',
'01-07-15 12:32:16',
'01-07-15 12:32:50',
'01-07-15 12:32:54',
'01-07-15 12:43:31',
'01-07-15 13:00:29']
],
'merge_vecs': [[ 0.0, 0.0, 0.0, 0.12591666134496624, 0.0 ....... 0.0, 0.0, 0.0 ]]}}
}},
"eid1" : {
...
However, you mentioned converting the time to timestamps. How do I do that since the functions do not automatically accomplish this task ? (Also, is it fine to have several 0.0
values in merge_vecs
?
You could check the datetime library which has a function for converting the time to timestamps. It's OK to have 0.0 values since the tf-dfi vector is sparse.
It worked. Thank you!
I saved it as a JSON file, saved it as pickle and updated it in config.json-
"evaluate_only": false,
"data": "data/PHEME.pkl",
"data_ids":"data/BEARD_ids.pkl", // What file do I put here
"device": "cuda",
"dataset":"PHEME",
"model_dir": "saved_models/"
However, data_ids.pkl is required for Main.py
to run. What file will go there? Such a file doesn't exist anywhere in the repository.
The format of data id file should be:
{ 'val': [eid,eid,eid....], 'fold0':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold1':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold2':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold3':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]}, 'fold4':{'test':[eid,eid,eid....],'train':[eid,eid,eid....]} }
You may need to split the dataset and obtain the ids.pkl file following the above format.
Does each fold
represent a different category?
How is val
different ?
We hold out 20% of instances as validation. And the rest are randomly split 5 times (folds here) with a ratio of 3:1 for training/test. Refer to our paper for experiment setup.
Got it!
I have formatted the ids like this:
data_ids = {
"val": ["eid736", "eid876" ... ], # 1160 elements
"fold0": {
'test': ["eid8", "eid837", "eid3445" ... ], # 232 elements
'train': ["eid947", "eid2736", "eid745" ... ] # 696 elements
},
"fold1": {
'test': [ ... ], # 232 elements
'train': [ ... ] # 696 elements
},
"fold2": {
'test': [ ... ] # 232 elements
'train': [ ... ] # 696 elements
},
"fold3": {
'test': [ ... ], # 232 elements
'train': [ ... ] # 696 elements
},
"fold4": {
'test': [ ... ], # 234 elements (Because 2 extra elements were left)
'train': [ ... ] # 696 elements
}
}
Then I converted it to json and saved it as a .pkl
file.
json_output = json.dumps(data_ids)
with open('data_ids.pkl', 'wb') as outfile:
pickle.dump(json_output, outfile)
Is this fine ?
I think yes. You could try it and check the printed output.
I'm getting the following error
pid: 384 2022-11-14 18:16:43 {'active_model': 'HEARD', 'models': {'HEARD': {'early_stop_lr': 1e-05, 'early_stop_patience': 6, 'hyperparameters': {'learning_rate': {'RD': 0.0002, 'HC': 0.0002}, 'max_seq_len': 100, 'max_post_len': 300, 'batch_size': 16, 'epochs': 12, 'lstm_dropout': 0.1, 'fc_dropout': 0.3, 'beta': {'HC': 1.0, 'T': 1.0, 'N': 1.0}, 'hidden_size_HC': 64, 'hidden_size_RD': 128, 'in_feats_HC': 1, 'in_feats_RD': 1000, 'sample_integral': 100, 'sample_pred': 100, 'weight_decay': 0.0001, 'interval': 3600.0, 'decay_patience': 3, 'lstm_layers': 1}, 'evaluate_only': False, 'data': 'Data/data.pkl', 'data_ids': 'Data/pheme_ids.pkl', 'device': 'cpu', 'dataset': 'PHEME', 'model_dir': 'saved_models/'}}} [+]Start training fold0: 2022-11-14 18:16:46 start training 0 epoch: 2022-11-14 18:16:46 Traceback (most recent call last): File "/content/drive/MyDrive/HEARD/Main.py", line 42, in
main() File "/content/drive/MyDrive/HEARD/Main.py", line 27, in main model,best_params = handle.train_HEARD(fold,train_loader,val_loader) File "/content/drive/MyDrive/HEARD/Train.py", line 107, in train_HEARD for Batch_data in train_loader: File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1376, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1402, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 461, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/MyDrive/HEARD/Dataset.py", line 61, in getitem merge_tids = seqs['merge_tids'][:self.max_seq_len] KeyError: 'merge_tids'
According to README.md
, the data format doesn't contain any key called merge_tids ?
# From README.md
{
"eid": {
"label": "1", # 1 for rumor, 0 for non-rumor
"merge_seqs": {
"merge_times": [[timestamp,timestamp,...], [timestamp,timestamp,...], ...],
'merge_vecs': [[...], [...], ...], # tf-idf vecs[1000] for each interval, so the shape of merge_vecs should be [num of intervals,1000]
}}
...
}
Hi,
Thanks! The corresponding change has been committed. You could add the merge_ids key in the dict following the same merging strategy as merge_times and merge_vecs( They are from the same Tweet).
Hello
I installed the PHEME dataset and arranged it in this format:
This is the format required by the function
get_timeline
indata_process.py
However, this only accounts for one label (Rumours)
How do I add another eid for non_rumours?