Open xp1632 opened 2 months ago
[x] 1. Finetuning LLM: https://learn.deeplearning.ai/courses/finetuning-large-language-models/lesson/5/data-preparation
[ ] 2. Generative AI with Large Language Models https://coursera.org/share/ce9b14669661dabbb26a990b80e81a13
[ ] 3. Hugging Face NLP courses https://huggingface.co/learn/nlp-course/chapter7/6
[ ] 4. IBM Machine Learning Professional Certificate https://www.coursera.org/professional-certificates/ibm-machine-learning
[ ] 5. Hugging Face: Building Generative AI Applications with Gradio https://learn.deeplearning.ai/courses/huggingface-gradio/lesson/1/introduction
[ ] 6. Langchain: LangChain for LLM Application Development https://learn.deeplearning.ai/courses/langchain/lesson/1/introduction
Course Link: https://learn.deeplearning.ai/courses/finetuning-large-language-models/lesson/5/data-preparation Library for Lamini: https://lamini-ai.github.io/tuning/quick_start/#basic-tuning
Course Notes:
Stage 1: Pretraining to get a base model
Stage 2: Finetuning of training further
- update the entire model, not just part of it
- Behavior change: model will learn to focus and respond more consistently
- Gain Knowledge: Increase knowledge of new specific concepts
- Extraction of texts --->get keywords
- or Expansion of information ----> writing such as emails, code
2.6.1 Two types of instruction prompt templates
Better data means: high quality, diversity, real, more
Step 1: collect instruction-response pairs
Step 2: concatenate pairs, [add prompt template]
Step 3: tokenize data: which turns text into numbers via encoding depending on the word frequency
Step 4: spilt into train/test
AutoTokenizer
Code :
import pandas as pd
import datasets
from pprint import pprint
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
text = "Hi, how are you?"
encoded_text = tokenizer(text)["input_ids"]
// encoded_text : [12764, 13, 849, 403, 368, 32]
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)
// Decoded tokens back into text: Hi, how are you?
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])
// Encoded several texts: [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])
// Using both padding and truncation: [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]
Same as other neural network
Calculate loss and update weights of LLM parameters
Step 1: Load json dataset
Step 2: Set up model, training config, tokenizer
Step 3: Inference of model , the inference is just letting the current model to give result based on
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize
input_ids = tokenizer.encode(
text,
return_tensors="pt",
truncation=True,
max_length=max_input_tokens
)
# Generate
device = model.device
generated_tokens_with_prompt = model.generate(
input_ids=input_ids.to(device),
max_length=max_output_tokens
)
# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(text):]
return generated_text_answer
trainer = Trainer(
model=base_model,
model_flops=model_flops,
total_steps=max_steps,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
- after 3 steps, the model is not fully trained
- this slightly-fine tuned model doesn't provide great result
- thus we further train the entire dataset twice and get a better result
moderation
in training dataset to avoid off-topic questionsQ: can you laugh? A: Let's keep this conversation relevant to coding
How to evaluate generative models?
ARC
for school; HellaSwag
for common sense; MMLU
for computer science and more;TruthfulQA
for falsehoods result misspelling
, too long
, repetitive
I think the benchmark evaluation suite is a good start
https://github.com/huggingface/peft
LoRA
to train less parameters:https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
Course Link: https://www.coursera.org/learn/generative-ai-with-llms/lecture/9uWab/course-introduction
Week1 - 1. Generative AI and LLMs
Foundation models
/base models
:
We'll use FLAN-T5 in this course
Prompts as input, Model does inference, and output completion
LLM use cases:
chatbot next word prediction, Write essay, summarize text
natural language --> code
entity recognition:
augmenting LLMs by connecting to external data sources/API
Transformer
works:https://www.coursera.org/learn/generative-ai-with-llms/lecture/3AqWI/transformers-architecture
Transformer architecture - Attention is All you need
Transformers let the networks also learn the meaning of each word in the context and its relevance(attention) with each other
Attention map for different words with different weights in the sentence:
The word Book
has strong attention with word student
and teacher
This self-attention
approves the model's ability to encode the language
How does the transformer model work?
A simple diagram of transformer model's infrastructure
Step1: Tokenizer
the words/phrase into numbers since the model only deals with digits, the important thing is that we'll use the same tokenizer to generate the text
Step2: Then we pass the encoded Token IDs to the embedding layer
, this layer is a trainable vector embedding space where all token IDs are represented as high-dimensional vectors and occupy unique positions in the space .
These vectors learn to encode the meaning and context of individual tokens in the input sequence
Word2Vec also uses this concept
The related words are located close to each other in the embedding space:
We can calculate the distance between the word by measuring the distance between vectors:
Step3: Add positional encoding to preserve the information of word order and its relevance of the position of the word in the sentence.
Step4: Pass to Self-attention
layer
Here the self-attention weight of different word in the sentence is learned, the contextual dependencies between the words are detected and stored during training
This self-attention
doesn't only happen once, multi-headed ( normally 12-100) self-attention weights are learned to understand the different aspects of the language, for example, one head knows the relevance between human entities, another one knows the relevance among all verbs , etc. However the weight of each head is randomly initialized and varies from model to model.
-Step5: Feed all intention weight we get from the dataset to a fully connected feed-forward network then to a softmax layer and get the probability of each word
We'll see how Transformer
model works for a sequence-to-sequence
translation task
the steps in this chapter is slightly different from the steps we write to illustrate the mechanism of Transformer
because there are more details in the real usecase
Step1: Tokenize the word in the sequence:
-Step2 Encoded Token IDs are passed to the embedding layer then add positional information
Encoder
The data that leaves the encoder is a deep representation of the structure and meaning of the input sequence
Step4Decoder get the information passed by the encoder and learnt the knowledge
Step5 : Then a start of sequence token is added to the input of Decoder
, and it triggers the Decoder
to predict the next token based on the contextual understanding it gets from Encoder
Step6 After going through a fully connected feedforward network, the softmax
output layer gives a list of possible for all possible tokens as next token, then we pick from it and have the first token:
-Step7: We continue this loop by passing the output token again to the input to trigger next token:
-Step8: : the final output token is detokenized and we get the output sequence:
Encoder
: take input sequences
/prompts
, produce deep presentation of structure and meaningDecoder
: triggered by input token
, use the contextual understand from encoder and generate new tokensEncoder Only Models
such as Bert
without other layers, the input sequence and out sequence have the same length
with additional layer, this kind of model such as Bert
can do classification task such as sentiment(观点) analysis
Encoder Decoder Models
such as as Bart
, CodeT5
sequence to sequence
tasks such as translation, where input sequence and output sequence can be different lengthsDecoder Only Models
: GPT family of models
, Bloom
, Jurassic
most commonly used today, as these model scale, they can now generalize to most tasks
pdf: https://arxiv.org/pdf/1706.03762
"Attention is All You Need" is a research paper published in 2017 by Google researchers, which introduced the Transformer model, a novel architecture that revolutionized the field of natural language processing (NLP) and became the basis for the LLMs we now know - such as GPT, PaLM and others. The paper proposes a neural network architecture that replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely attention-based mechanism.
The Transformer model uses self-attention to compute representations of input sequences, which allows it to capture long-term dependencies and parallelize computation effectively. The authors demonstrate that their model achieves state-of-the-art performance on several machine translation tasks and outperforms previous models that rely on RNNs or CNNs.
The Transformer architecture consists of an encoder and a decoder, each of which is composed of several layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.
The Transformer model also uses residual connections and layer normalization to facilitate training and prevent overfitting. In addition, the authors introduce a positional encoding scheme that encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.
I've understand the structure of this model:
I'll leave the details of positional decoding and how they calculate the weight in self-attention out for now, and only check it when it's needed in the future
Text as Prompt
--- Inference by LLM model
---> Completion
Context window for prompts is typically a few thousand words
Prompt Engineering
is to revise and improve the prompt several times and let the model behave as you expected
In prompts, include examples of the task you want to model to carry out
is a powerful strategy for better result
Zero Shot Inference
: 零样本推理
largest of current llms are good at it, smaller model such as the earlier version of GPT2 can generate some following words but doesn't understand the requirement
One Shot Inference
: 一样本推理
give one sample review in the prompt for the model to refer to:
Few shot inference
: 多样本推理
for some even smaller model, we can give multiple samples for one task:
we can use samples to help the model give the expected result
but if the model is still not working well even with 5 or 6 samples, we should fine-tune the model instead
we can modify the parameters(configurations) of the model to influence the prediction of next token
This is the GUI of modifying the parameters of the model in Huggingface and adjusting how LLM behaves:
These parameters are different from the training parameters
which are learned during training time
Instead, these configuration parameters are invoked(调用) at inference time
Output of Transformer: Probability distribution of the entire dictionary's words that the model uses
Picking strategy: - greedy
Select the word/token with the highest probability
works well for short generation but it is susceptible to repeated words
Picking strategy: - random sampling
To generate more natural and creative words and avoid repeating words, random sampling
is the easiest way to introduce variability
We use parameters K
and P
to control the creativity of the generated result but still keep it sensible
top-k
: numbers to choose from the list
after applying random-weighted strategy, select an output from top-k results, this parameter is to keep the generated result sensible
top-p
: add-up(cumulative) probability for the choose result
in this example, when p =0.30, cake and donut are chosen because their probability adds up as 0.30
temperature
controls the randomness of the model output temperature
alters the predictions that the model will make temperature
as if it boils the water 1. Define the Scope
as narrowly/accurately as we can
the capability of different models highly depends on model's size and architecture
we should choose it carefully based on the task we want LLM to accomplish
-Good at many tasks or one specific type of task
2. Define the Scope
as narrowly/accurately as we can
In most cases we'll choose from the existing base model
instead of building it from scratch
hallucination
and not able to deal with complex math
and solutionsdo not use personal account for amazon in this course
there are certain preset templates of a prompt for different models, such as for FLAN-T5 : https://github.com/google-research/FLAN/tree/main/flan/v2
we should try zero-shot, one-shot, few-shot to try base model and find the suitable ones that worthy of fine tuning
also more than four shots
will not help
consideration for choosing a model:
for most of the case, I'll choose an existing pre-trained foundation model
it depends on the details of the specific task
In Huggingface (https://huggingface.co/tasks), models are categorized by different tasks and described with model card:
the reason that different models are suitable for different task is that: different models are trained in a different way
self-supervised learning
of TB or more of unstructured textual data from many sources(Internet or specific text corpus(语料库)),
token
, the encoder
generates a respective TokenID
and Vector Representation
Pretraining requires a large amount of GPU and computing
1-3%
of Tokens will be taken from original tokens and used in pretraining, we should consider this when deciding the size of the collecting dataset
Autoencoding models
Masked Language Modeling(MLM)
Tokens in the sequence are randomly masked The training objective(goal) is to reconstruct the original sentence (or we call it
denoising
)
- BERT (Bidirectional Encoder Representations from Transformers)
Autoencoding models build bidirectional context
of the input sequence, meaning the model has a full context of a token, not just the words that come before.
BERT allows the model to capture nuances(细微差别) of the language, such as sarcasm in text, which is crucial for accurately determining sentiment
Autoregressive models
Causal Language Modeling
, which is a unidirectional
training in contrast to the bidirectional
training like BERTalso known as Sequence to sequence models
the details of how different models train vary
A typical model T5
was trained by Span(范围) Corruption(腐化)
In Encoder, random sequence tokens are masked, and a Sentinel (哨兵) token
will replace it to keep the order/location information as a placeholder
The decoder then are tasked with reconstructing the masked token sequences auto-regressively
the output is the Sentinel token
which keeps its location information and the predicted token sequences
Sequence-to-sequence is generally useful in cases where we have a body of text as input and a body of text as output
The approximate GPU RAM to store and train 1B parameters:
We require about 6 times the amount of GPU RAM that the model weights alone take up.
Quantization(量子化)
to store the data using FP16, BFLOAT16, INT8
instead of (FP32)32-bit floating point
INT8
saves a lot of memory but also loses a lot of precisions
Quantization reduces the precision of the model weights and saves the memory by projecting the original 32-bit floating point into lower precision spaces
BFLOAT16 is chosen for many models such as FLAN-T5
we can distribute the dataset in different GPUs with same model
however, in this case, the model parameter is still redundant
Or we can also divide the models parameters into different shardings(分区) to avoid overlap:
We can see the model parameters overlaps are improved by FSDP, but keep in mind GPUs have to communicate with each other
Sharding factor
in FSDPTo improve the performance of the model, we can either increase the size of dataset or the number of parameters of the model
but we also need to consider the budget
more powerful processors computer faster:
in reality, we cannot just increase the budget to get better performance:
there are also power laws when we set other two elements fixed:
Chinchilla
paper, they find the sweet points to get an optimal performance while the training dataset size and model size are balanced Chinchilla
hints that a lot of very larger models may be over-parameterized and under-trainedChinchilla
: Compute optimal dataseize = 20 * model paramtersIf our target domain uses vocabulary and language structures that are not commonly used in daily life
, we need the pre-training to train the model from scratch
For example, Legal language
, and Medical language
the terms are unlikely to appear in the training text of existing LLMs, so the model will have problems understanding the terms and using them correctly
finance domain
adapted LLM model51% financial data
and 49%
public dataChinchilla
scaling law for guidance and made tradeoffs
pre-training
and fine-tuning
In pre-training
, we use a vast amount of unstructured textual data and do self-supervised learning
Fine-tuning
is a supervised learning process where we use a dataset of labeled examples to update the weights of the LLM
task-specific examples
full fine-tuning
updates all parameters, we can also benefit from memory optimization and parallel computing strategiesfine-tuning
?there are existing prompt templates
for different task
the result of these templates is instructions and examples from the dataset
First we get our prepared instruction dataset:
We divide the dataset into training
, validation
, test
During fine-tuning, prompts from the training dataset are passed to LLM, LLM generates completions and we compare the predicted result with the label specified in the training data
We compute the loss based on the different probability distributions of two tokens
We use the calculated loss to update the weights of LLM model in standard backpropagation
We do this for many batches of prompt- completion pairs and over several epochs to update the weights so the model's performance on the task improves
Measure LLM performance by the holdout validation dataset and calculate validation accuracy
After completing fine-tuning, perform a final performance evaluation using test dataset and get test_accuracy
Result:
The fine-tuning process results in a new version of the base model, called instruct model
which is better at the tasks we are interested in.
Catastrophic forgetting
issueFine-tunning the base model for single tasks may lead to Catastrophic forgetting
cause of this phenomenon: full fine-tuning
modifies the weights of the original LLM,
letting LLM models perform great on a single fine-tunning task,
but degrade performance on other tasks
the model can no longer carry out this task
50-100,000
examples across many tasksfull fine-tuning
, we perform parameter efficient fine-tuning (PEFT)
, which preserves the weights of the original LLM and trains only a small number of task-specific adapter layers and parameterswe train the model based on a mixed dataset
that contains various tasks, so the performance of the model improves for all tasks simutaneously
thus avoid the issue of catastrophic forgetting
Drawback
: we may need many examples (50-100,000) of each for training
FLAN
Instruction fine-tuning modelFine- tuned LAnaguage Net
, also has a meaning of (果馅饼) as a dessert after the main courseSo this instruction fine-tuning is also the last step after pre-training
main course
for example, FLAN-T5
is the fine-tuned version of foundation model T5
FLAN-T5
is trained across 473 datasets and 146 task categoriesA prompt template of samsum :
different ways of saying the same instruction are included to help the model generalize and perform better
dialogue
is inserted into the dialogue field:
-summary
is used as label:
Samsum
dataset that contains day to day activities are not sufficient
dialoguesum
dataset to fine-tune the model for this summarization task of chatbot conversation:
dialogesum
dialogesum
This issue focuses on the technical courses we take about LLM, we'll put the paper part in https://github.com/xp1632/DFKI_working_log/issues/70
ChainForge https://chainforge.ai/ , https://github.com/ianarawjo/ChainForge ChainForge is an open-source visual programming environment for prompt engineering. With ChainForge, you can evaluate the robustness of prompts and text generation models in a way that goes beyond anecdotal evidence.
Low-Code LLM Low-code LLM: Graphical User Interface over Large Language Models https://www.semanticscholar.org/paper/Low-code-LLM%3A-Graphical-User-Interface-over-Large-Cai-Mao/490776b4c01b5950275a3541183f6b9e3818c207