数据处理，逻辑有问题

jkkl commented 2 years ago

这段想把单条多个情感极性的样本，转为多条的代码有bug

提供一个badcase： [-999, -999, -999, Positive] 这种样本，会被过滤掉~

yangheng95 commented 2 years ago

@jkkl OK, 你有修复的想法吗

jkkl commented 2 years ago

下述代码有一种潜在情况，没有处理，如果两个aspect是连在一起的，且情感一样，上述代码只会添加一条一样，而不会拆成两个。例如：【屏幕尺寸摄像头都不错】

jkkl commented 2 years ago

先这样吧，目前看够用了。。。

jkkl commented 2 years ago

@jkkl OK, 你有修复的想法吗


    print_head = 10
    for s, t, p in data:
        if len(s) > 0:
            # prepare the atepc dataset, refer to https://github.com/yangheng95/PyABSA/issues/78
            polarity_padding = [str(SENTIMENT_PADDING)] * len(t)
            start_sentiment_index = -1
            end_sentiment_index = -1
            for p_idx in range(len(p)):
                if p[p_idx] != str(SENTIMENT_PADDING) and (p_idx == 0 or p[p_idx] != p[p_idx-1]):
                    start_sentiment_index = p_idx
                elif p[p_idx] == str(SENTIMENT_PADDING) and p_idx != 0 and p[p_idx-1] != str(SENTIMENT_PADDING):
                    end_sentiment_index = p_idx
                    one_sentiment_labels = polarity_padding[:start_sentiment_index] + p[start_sentiment_index: end_sentiment_index] + polarity_padding[end_sentiment_index:]
                    prepared_data.append((s, t, one_sentiment_labels))
                    if print_head > 0:
                        print('\n'.join([' '.join(s), ' '.join(t), ' '.join(one_sentiment_labels)]))
                        print_head -= 1
            if start_sentiment_index > end_sentiment_index:
                # 处理尾部情况
                one_sentiment_labels = polarity_padding[:start_sentiment_index] + p[start_sentiment_index:]
                prepared_data.append((s, t, one_sentiment_labels))
                if print_head > 0:
                    print('\n'.join([' '.join(s), ' '.join(t), ' '.join(one_sentiment_labels)]))
                    print_head -= 1
    print('Prepared data len from file :{} nums'.format(len(prepared_data)))
    return prepared_data

jkkl commented 2 years ago

@yangheng95 类似的这一块也bug：修改方案参考如下：


                POLARITY_PADDING = [SENTIMENT_PADDING] * len(polarity)
                start_sentiment_index = -1
                end_sentiment_index = -1
                example_id = i_batch * self.opt.infer_batch_size + i
                for idx in range(len(polarity)):
                    if polarity[idx] != SENTIMENT_PADDING and (idx == 0 or polarity[idx] != polarity[idx - 1]):
                        start_sentiment_index = idx
                    elif polarity[idx] == SENTIMENT_PADDING and idx != 0 and polarity[idx -1] != SENTIMENT_PADDING:
                        end_sentiment_index = idx
                        one_sentiment_labels = POLARITY_PADDING[:start_sentiment_index] + polarity[start_sentiment_index: end_sentiment_index] + POLARITY_PADDING[end_sentiment_index:]
                        extraction_res.append((all_tokens[i + (self.opt.infer_batch_size * i_batch)], pred_iobs, one_sentiment_labels, example_id))
                if start_sentiment_index > end_sentiment_index:
                    # 处理尾部情况
                    one_sentiment_labels = POLARITY_PADDING[:start_sentiment_index] + polarity[start_sentiment_index:]
                    extraction_res.append((all_tokens[i + (self.opt.infer_batch_size * i_batch)], pred_iobs, one_sentiment_labels, example_id))

        return extraction_res, sentence_res

yangheng95 commented 2 years ago

如果可以的话，关于这两个bug请分别提供一两个bad case以供调试

jkkl commented 2 years ago

如果可以的话，关于这两个bug请分别提供一两个bad case以供调试

我自己的case是我喜欢小狗

就构造一个aspect在最尾部的case就行。例如：【秉 O -999 承 O -999 了 O -999 时 O -999 尚 O -999 高 O -999 贵 O -999 的 O -999 外 B-ASP Positive 形 I-ASP Positive 设 I-ASP Positive 计 I-ASP Positive 】在构造样本的时候，这条case会被过滤掉。在预测时，这个case在情感识别阶段，只会把【计】送进去，虽然aspect能识别对。

yangheng95 / PyABSA

数据处理，逻辑有问题 #156