wkgcass / public-chat

My public repository for chat-in-issue (public repos do not have billing when executing Github Actions)
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

Will LLM do word segmentation for Chinese? #10

Open wkgcass opened 1 year ago

wkgcass commented 1 year ago

/chat: Will LLM do word segmentation for Chinese? Or do they simply read each Chinese character and run the process?

github-actions[bot] commented 1 year ago

/ai-says:

LLM can perform Chinese word segmentation by recognizing the spaces between words in Chinese text. It uses natural language processing techniques to identify the boundaries between words and then processes each word separately. However, the exact method used for Chinese word segmentation may vary depending on the specific application and language model used by the LLM.

github-actions[bot] commented 1 year ago

/err: token usage: {"prompt_tokens":33,"completion_tokens":62,"total_tokens":95}

wkgcass commented 1 year ago

/chat: But Chinese usually do not have white spaces between words

github-actions[bot] commented 1 year ago

/ai-says:

You are correct that Chinese text typically does not include whitespace characters between words. To perform Chinese word segmentation, LLM would need to use specialized algorithms designed to analyze the structure of Chinese text and identify each individual word. This is a complex task that requires the use of linguistic rules and statistical models, and different LLMs may employ different methods to accomplish it. Some models may rely on machine learning techniques, while others may use traditional rules-based approaches. The performance of a given LLM in Chinese word segmentation will depend on the specific model and algorithm it uses, as well as the quality and complexity of the text it is analyzing.

github-actions[bot] commented 1 year ago

/err: token usage: {"prompt_tokens":118,"completion_tokens":126,"total_tokens":244}

wkgcass commented 1 year ago

/chat: But from this paper we see that word segemntation is not necessary for Chinese deep learning

github-actions[bot] commented 1 year ago

/ai-says:

You're right that the paper you referenced suggests that word segmentation may not always be necessary for certain Chinese deep learning tasks, such as language modeling or sentiment analysis. This is because deep learning models are able to extract patterns and features from raw text input without relying on explicit segmentation.

However, it's important to note that not all NLP tasks can be performed without word segmentation in Chinese. For example, tasks such as named entity recognition or part-of-speech tagging typically require segmented Chinese text. Even in tasks where word segmentation isn't strictly necessary, it can still be beneficial in improving the accuracy and interpretability of the model.

Ultimately, the choice to perform word segmentation in Chinese text processing will depend on the specific task, the quality and complexity of the input data, and the available algorithmic and computational resources.

github-actions[bot] commented 1 year ago

/err: token usage: {"prompt_tokens":292,"completion_tokens":162,"total_tokens":454}

wkgcass commented 1 year ago

/chat: Does GPT-3 do word segmentation for Chinese before training?

github-actions[bot] commented 1 year ago

/ai-says:

It's not clear precisely how GPT-3 (Generative Pre-trained Transformer 3) handles Chinese text processing during training, but it's likely that the model uses some form of word segmentation.

One reason for this is that the pre-training data used to train GPT-3 likely includes both raw text and pre-segmented text. For example, GPT-3 was trained on a large corpus of web text that was sourced from multiple languages and domains, including Chinese social media and news articles. It's possible that some of this Chinese text had already been pre-segmented into words or phrases.

Additionally, the GPT-3 model itself is based on the transformer architecture, which has the ability to attend to different positions in the input sequence. This means that the model can learn to recognize and process various sub-word units, including individual characters or character sequences that correspond to words in Chinese.

Overall, while the specifics of how GPT-3 processes Chinese text during training are not public knowledge, it's likely that the model uses some combination of pre-segmented text and sub-word units to learn representations of Chinese language.

github-actions[bot] commented 1 year ago

/err: token usage: {"prompt_tokens":480,"completion_tokens":229,"total_tokens":709}