yangheng95 / ABSADatasets

Public & Community-shared datasets for Aspect-based sentiment analysis and Text Classification
MIT License
192 stars 64 forks source link

instructions on how to train using my own dataset #34

Closed harrywang closed 1 year ago

harrywang commented 1 year ago

Hi,

I have prepared our data into train and test as two txt files using the annotation tools:

Screen Shot 2022-08-17 at 11 19 23 AM

then, I try to use https://github.com/yangheng95/PyABSA/blob/release/demos/aspect_polarity_classification/train_apc_chinese.py to train using our own data and could not figure out how to structure my files.

I have created the structure below (the py file is the training script)

datasets
└── 666.hotel
    ├── hotel.test.txt
    └── hotel.train.txt
train.py

I changed the dataset path as follows:

config = APCConfigManager.get_apc_config_chinese()
config.evaluate_begin = 4
config.dropout = 0.5
config.l2reg = 1e-8
config.model = APCModelList.FAST_LCF_BERT
# config.spacy_model = 'zh_core_web_sm'
# chinese_sets = ABSADatasetList.Chinese
# chinese_sets = ABSADatasetList.Chinese
chinese_sets = './datasets/666.hotel/hotel.train.txt'
sent_classifier = Trainer(config=config,  # set config=None to use default model
                          dataset=chinese_sets,  # train set and test set will be automatically detected
                          checkpoint_save_mode=1,
                          auto_device=True  # automatic choose CUDA or CPU
                          ).load_trained_model()

but the error says:

RuntimeError: Fail to locate dataset: ['./datasets/666.hotel/hotel.train.txt']. If you are using your own dataset, you may need rename your dataset according to https://github.com/yangheng95/ABSADatasets#important-rename-your-dataset-filename-before-use-it-in-pyabsa

Could you please help give some guidance?

Thanks!

yangheng95 commented 1 year ago

You need to use your dataset by passing its folder path instead of file path

harrywang commented 1 year ago

I tried

chinese_sets = './datasets/666.hotel'

and the error is:

Try to load ['./datasets/666.hotel'] dataset from local
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?337dc701-20e9-465c-b0f9-e1b5ce0b520f)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb Cell 2 in <cell line: 10>()
      [6](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=5) # config.spacy_model = 'zh_core_web_sm'
      [7](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=6) # chinese_sets = ABSADatasetList.Chinese
      [8](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=7) # chinese_sets = ABSADatasetList.Chinese
      [9](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=8) chinese_sets = './datasets/666.hotel'
---> [10](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=9) sent_classifier = Trainer(config=config,  # set config=None to use default model
     [11](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=10)                           dataset=chinese_sets,  # train set and test set will be automatically detected
     [12](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=11)                           checkpoint_save_mode=1,
     [13](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=12)                           auto_device=True  # automatic choose CUDA or CPU
     [14](vscode-notebook-cell:/Users/harrywang/sandbox/hotel-service-bot/apc-train-chinese/train.ipynb#W1sZmlsZQ%3D%3D?line=13)                           ).load_trained_model()

File ~/sandbox/hotel-service-bot/venv/lib/python3.8/site-packages/pyabsa/functional/trainer/trainer.py:118, in Trainer.__init__(self, config, dataset, from_checkpoint, checkpoint_save_mode, auto_device, path_to_save, load_aug)
    116     dataset = DatasetItem('custom_dataset', dataset)
    117     self.config.dataset_name = dataset.dataset_name
--> 118 self.dataset_file = detect_dataset(dataset, task=self.task, load_aug=load_aug)
    119 self.config.dataset_file = self.dataset_file
    121 self.config = init_config(self.config, auto_device)

File ~/sandbox/hotel-service-bot/venv/lib/python3.8/site-packages/pyabsa/functional/dataset/dataset_manager.py:203, in detect_dataset(dataset_path, task, load_aug)
    201 if len(dataset_file['train']) == 0:
    202     if os.path.isdir(d) or os.path.isdir(search_path):
--> 203         print('No train set found from: {}, detected files: {}'.format(dataset_path, ', '.join(os.listdir(d) + os.listdir(search_path))))
    204     raise RuntimeError(
...
    207             'https://github.com/yangheng95/ABSADatasets#important-rename-your-dataset-filename-before-use-it-in-pyabsa')
    208     )
    209 if len(dataset_file['test']) == 0:

FileNotFoundError: [Errno 2] No such file or directory: ''
yangheng95 commented 1 year ago

Your dataset shold locate under the apc_dataset or atepc_dataet, depends on what task you are working on.

harrywang commented 1 year ago

this works - I did not know this - I cloned the ABSADatasets and put my data there and it loads OK. Thanks!