zhaoyingjun / chatbot

ChatGPT带火了聊天机器人,主流的趋势都调整到了GPT类模式,本项目也与时俱进,会在近期更新GPT类版本。基于本项目和自己的语料可以训练出自己想要的聊天机器人,用于智能客服、在线问答、闲聊等场景。
3.5k stars 1.02k forks source link

TypeError: cannot use a bytes pattern on a string-like object #59

Open AlucardNosferatu opened 4 years ago

AlucardNosferatu commented 4 years ago

/usr/bin/python3.5 /home/scrooge/chatbot/seqGanChatbot/execute.py /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) 2019-11-29 03:38:04.465325: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA Preparing Chitchat gen_data in ./gen_data/ Tokenizing disc_data in ./gen_data/train.answer Traceback (most recent call last): File "/home/scrooge/chatbot/seqGanChatbot/execute.py", line 355, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv)) File "/home/scrooge/chatbot/seqGanChatbot/execute.py", line 349, in main al_train() File "/home/scrooge/chatbot/seqGanChatbot/execute.py", line 156, in al_train vocab, rev_vocab, dev_set, train_set = gens.prepare_data(gen_config) File "/home/scrooge/chatbot/seqGanChatbot/gen/generator.py", line 72, in prepare_data gen_config.train_dir, vocab, gen_config.vocab_size) File "/home/scrooge/chatbot/seqGanChatbot/utils/data_utils.py", line 202, in prepare_chitchat_data data_to_token_ids(train_path + ".answer", answer_train_ids_path, vocabulary, tokenizer) File "/home/scrooge/chatbot/seqGanChatbot/utils/data_utils.py", line 188, in data_to_token_ids normalize_digits) File "/home/scrooge/chatbot/seqGanChatbot/utils/data_utils.py", line 153, in sentence_to_token_ids words = basic_tokenizer(sentence) File "/home/scrooge/chatbot/seqGanChatbot/utils/data_utils.py", line 52, in basic_tokenizer words.extend(re.split(_WORD_SPLIT, space_separated_fragment)) File "/usr/lib/python3.5/re.py", line 203, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: cannot use a bytes pattern on a string-like object

Process finished with exit code 1

AlucardNosferatu commented 4 years ago

The spot where the issue occurs:

def basic_tokenizer(sentence): """Very basic tokenizer: split the sentence into a list of tokens.""" words = []

sentence = tf.compat.as_bytes(sentence)

for space_separated_fragment in sentence.strip().split(): words.extend(re.split(_WORD_SPLIT, space_separated_fragment)) return [w for w in words if w]

in data_utils.py

AlucardNosferatu commented 4 years ago

Runtime Environment: Ubuntu 16.04 Python 3.5 TensorFlow 1.5.0 PyCharm Community data were downloaded from the link of baidunetdisk given in README.MD

AlucardNosferatu commented 4 years ago

I found a very similar repo about SeqGAN: https://github.com/vpegasus/seqGan_chatbot/blob/master/utils/data_utils.py

and from his code I found the inexistence of regular expression part of this function:

def basic_tokenizer(sentence): """Very basic tokenizer: split the sentence into a list of tokens.""" words = [] for space_separated_fragment in sentence.strip().split(): words.extend(_WORD_SPLIT.split(space_separated_fragment)) return [w for w in words if w]

I haven't test his code yet, I am wondering how this removal make sense...