LLM+Visual Programming_Technical Courses

xp1632 / DFKI_working_log

0 stars 0 forks source link

LLM+Visual Programming_Technical Courses #64

Open xp1632 opened 3 months ago

xp1632 commented 3 months ago

This issue focuses on the technical courses we take about LLM, we'll put the paper part in https://github.com/xp1632/DFKI_working_log/issues/70

ChainForge https://chainforge.ai/ , https://github.com/ianarawjo/ChainForge ChainForge is an open-source visual programming environment for prompt engineering. With ChainForge, you can evaluate the robustness of prompts and text generation models in a way that goes beyond anecdotal evidence.
Low-Code LLM Low-code LLM: Graphical User Interface over Large Language Models https://www.semanticscholar.org/paper/Low-code-LLM%3A-Graphical-User-Interface-over-Large-Cai-Mao/490776b4c01b5950275a3541183f6b9e3818c207

xp1632 commented 2 months ago

LLM specific courses

[x] 1. Finetuning LLM: https://learn.deeplearning.ai/courses/finetuning-large-language-models/lesson/5/data-preparation
[ ] 2. Generative AI with Large Language Models https://coursera.org/share/ce9b14669661dabbb26a990b80e81a13
[ ] 3. Hugging Face NLP courses https://huggingface.co/learn/nlp-course/chapter7/6
[ ] 4. IBM Machine Learning Professional Certificate https://www.coursera.org/professional-certificates/ibm-machine-learning
[ ] 5. Introduction to Generative AI Learning Path Specialization https://www.coursera.org/specializations/introduction-to-generative-ai
[ ] 5. Hugging Face: Building Generative AI Applications with Gradio https://learn.deeplearning.ai/courses/huggingface-gradio/lesson/1/introduction
[ ] 6. Langchain: LangChain for LLM Application Development https://learn.deeplearning.ai/courses/langchain/lesson/1/introduction

xp1632 commented 2 months ago

Course 1: Finetunning LLM:

Course Link: https://learn.deeplearning.ai/courses/finetuning-large-language-models/lesson/5/data-preparation Library for Lamini: https://lamini-ai.github.io/tuning/quick_start/#basic-tuning

Course Notes:

2.1 why finetuning LLM:

to let a model of general purpose such as GPT-4 --> more specific usecase Copilot
steers the model to more consistent output

2.2 what finetuning brings?

a model who can learn new information
requires more high-quality data, domain-specific knowledge
your own LLM that reduces unwanted information

Check [Pytorch], [Huggingface], [Llama Library]

2.3 Where does finetuning fit in?

Stage 1: Pretraining to get a base model
- pretraining from a model that have zero knowledge
- get giant corpus of text data (The Pile dataset)
- after self-supervised learning
- the model learns language and knowledge
- but not fully useful for a chatbot usage to provide valid information

Stage 2: Finetuning of training further

- update the entire model, not just part of it
- Behavior change: model will learn to focus and respond more consistently
- Gain Knowledge: Increase knowledge of new specific concepts

2.4 Task to finetuning

- Extraction of texts --->get keywords
- or Expansion of information  ----> writing such as emails, code

Task clarity is key indicator of success

2.5 Steps in First time finetuning

Pick one task that a large LLM is doing ~OK
get ~1000 better than okay inputs and outputs pairs of the task
finetune a small LLM on this data

2.6 Instruction-finetuning

teaches models to behave more like a chatbot
Instruction-finetuning datatsets: customer support conversations, FAQs
If we don't have Q&A data, we can turn non-Q&A data to Q&A via LLM generation pipiline such asa 'Alpaca'

2.6.1 Two types of instruction prompt templates

prompt_template_with_input that takes both instruction and inputs,
- inputs are further context for the instruction
prompt_template_without_input

2.7 Data prepration for training:

Better data means: high quality， diversity, real, more
Step 1: collect instruction-response pairs
Step 2: concatenate pairs, [add prompt template]
Step 3: tokenize data: which turns text into numbers via encoding depending on the word frequency
Step 4: spilt into train/test
AutoTokenizer Code :

import pandas as pd
import datasets

from pprint import pprint
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
text = "Hi, how are you?"
encoded_text = tokenizer(text)["input_ids"]
// encoded_text : [12764, 13, 849, 403, 368, 32]

decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)
// Decoded tokens back into text:  Hi, how are you?

list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])
// Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]

Padding(Add 0)/Trucncation(Cut short) the token to get the same length

encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

// Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]

2.8 Training dataset

Same as other neural network
Calculate loss and update weights of LLM parameters
Step 1: Load json dataset
Step 2: Set up model, training config, tokenizer
Step 3: Inference of model , the inference is just letting the current model to give result based on


def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

Step 4: start training
- After setting the model parameters, steps,flops
- we start training with total_steps= 3

trainer = Trainer(
    model=base_model,
    model_flops=model_flops,
    total_steps=max_steps,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

- after 3 steps, the model is not fully trained 
- this slightly-fine tuned model doesn't provide great result
- thus we further train the entire dataset twice and get a better result

2.8.1 we can use moderation in training dataset to avoid off-topic questions
such as Q: can you laugh? A: Let's keep this conversation relevant to coding

2.9 Evaluation and iteration

How to evaluate generative models?
- [Method 1]: a high-quality, accurate, Not seen in training data test dataset that evaluated by a domain expert
- [Method 2]: Benchmarks Evaluation Suite that covers different areas, such as ARC for school; HellaSwag for common sense; MMLU for computer science and more;TruthfulQA for falsehoods result
- [Method 3]: Error Analysis : understand base model behaviour and categorize errors, iterate on data to fix the problems. Common errors are : misspelling, too long, repetitive
I think the benchmark evaluation suite is a good start

Parameter- Efficient Finetuning

https://github.com/huggingface/peft

`LoRA` to train less parameters:

https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

xp1632 commented 1 month ago

Course 2: Generative AI with Large Language Models

Course Link: https://www.coursera.org/learn/generative-ai-with-llms/lecture/9uWab/course-introduction

Week1 - 1. Generative AI and LLMs

Foundation models/base models:
We'll use FLAN-T5 in this course
Prompts as input, Model does inference, and output completion
LLM use cases:
chatbot next word prediction, Write essay, summarize text
natural language --> code
entity recognition:
augmenting LLMs by connecting to external data sources/API

Important Concept of how `Transformer`works:

https://www.coursera.org/learn/generative-ai-with-llms/lecture/3AqWI/transformers-architecture

Transformer architecture - Attention is All you need
Transformers let the networks also learn the meaning of each word in the context and its relevance(attention) with each other
Attention map for different words with different weights in the sentence:
The word Book has strong attention with word student and teacher
This self-attention approves the model's ability to encode the language
How does the transformer model work?
A simple diagram of transformer model's infrastructure
Step1: Tokenizer the words/phrase into numbers since the model only deals with digits, the important thing is that we'll use the same tokenizer to generate the text

Step2: Then we pass the encoded Token IDs to the embedding layer, this layer is a trainable vector embedding space where all token IDs are represented as high-dimensional vectors and occupy unique positions in the space .
These vectors learn to encode the meaning and context of individual tokens in the input sequence
Word2Vec also uses this concept
The related words are located close to each other in the embedding space:
We can calculate the distance between the word by measuring the distance between vectors:
Step3: Add positional encoding to preserve the information of word order and its relevance of the position of the word in the sentence.
Step4: Pass to Self-attention layer
Here the self-attention weight of different word in the sentence is learned, the contextual dependencies between the words are detected and stored during training
This self-attention doesn't only happen once, multi-headed ( normally 12-100) self-attention weights are learned to understand the different aspects of the language, for example, one head knows the relevance between human entities, another one knows the relevance among all verbs , etc. However the weight of each head is randomly initialized and varies from model to model.

-Step5: Feed all intention weight we get from the dataset to a fully connected feed-forward network then to a softmax layer and get the probability of each word

One single token will have higher score than the rest, this is the most likely selected token

xp1632 commented 1 month ago

Generating text with transformers

We'll see how Transformer model works for a sequence-to-sequence translation task
the steps in this chapter is slightly different from the steps we write to illustrate the mechanism of Transformer because there are more details in the real usecase
Step1: Tokenize the word in the sequence:

-Step2 Encoded Token IDs are passed to the embedding layer then add positional information

Step3 Self-Attention weights are learned in the Encoder

The data that leaves the encoder is a deep representation of the structure and meaning of the input sequence
Step4Decoder get the information passed by the encoder and learnt the knowledge
Step5 : Then a start of sequence token is added to the input of Decoder, and it triggers the Decoder to predict the next token based on the contextual understanding it gets from Encoder
Step6 After going through a fully connected feedforward network, the softmax output layer gives a list of possible for all possible tokens as next token, then we pick from it and have the first token:

-Step7: We continue this loop by passing the output token again to the input to trigger next token:

until the model predict the end of a sequence token:

-Step8: : the final output token is detokenized and we get the output sequence:

Creativity of the Result depends on different ways to choose predicted next token from the softmax layer list

Summary for translation example using generative ai:

Encoder: take input sequences/prompts, produce deep presentation of structure and meaning
Decoder: triggered by input token, use the contextual understand from encoder and generate new tokens
This process is in a loop until a stop token is reached

We can split the Encoder and Decoder in Transformer

Encoder Only Models such as Bert
without other layers, the input sequence and out sequence have the same length
with additional layer, this kind of model such as Bert can do classification task such as sentiment(观点) analysis

Encoder Decoder Models such as as Bart, CodeT5

perform well in sequence to sequence tasks such as translation, where input sequence and output sequence can be different lengths
this type of model can also perform general text generation task

Decoder Only Models : GPT family of models, Bloom, Jurassic
most commonly used today, as these model scale, they can now generalize to most tasks

xp1632 commented 1 month ago

Foundation paper of transformer: Attention is all you need:

pdf: https://arxiv.org/pdf/1706.03762

"Attention is All You Need" is a research paper published in 2017 by Google researchers, which introduced the Transformer model, a novel architecture that revolutionized the field of natural language processing (NLP) and became the basis for the LLMs we now know - such as GPT, PaLM and others. The paper proposes a neural network architecture that replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely attention-based mechanism.

The Transformer model uses self-attention to compute representations of input sequences, which allows it to capture long-term dependencies and parallelize computation effectively. The authors demonstrate that their model achieves state-of-the-art performance on several machine translation tasks and outperforms previous models that rely on RNNs or CNNs.

The Transformer architecture consists of an encoder and a decoder, each of which is composed of several layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.

The Transformer model also uses residual connections and layer normalization to facilitate training and prevent overfitting. In addition, the authors introduce a positional encoding scheme that encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.

I've understand the structure of this model:
I'll leave the details of positional decoding and how they calculate the weight in self-attention out for now, and only check it when it's needed in the future

xp1632 commented 1 month ago

Prompt Engineering

Text as Prompt--- Inference by LLM model---> Completion
Context window for prompts is typically a few thousand words
Prompt Engineering is to revise and improve the prompt several times and let the model behave as you expected
In prompts, include examples of the task you want to model to carry out is a powerful strategy for better result

In-Context Learning: Providing examples inside the context window

Zero Shot Inference: 零样本推理
largest of current llms are good at it, smaller model such as the earlier version of GPT2 can generate some following words but doesn't understand the requirement
One Shot Inference: 一样本推理
give one sample review in the prompt for the model to refer to:
Few shot inference : 多样本推理
for some even smaller model, we can give multiple samples for one task:
we can use samples to help the model give the expected result
but if the model is still not working well even with 5 or 6 samples, we should fine-tune the model instead

Larger models are good at a lot of tasks with thousand of parameters, smaller models are good at a small amount of tasks
Find the right llm model for your usecase

xp1632 commented 1 month ago

Generative configurations

we can modify the parameters(configurations) of the model to influence the prediction of next token
This is the GUI of modifying the parameters of the model in Huggingface and adjusting how LLM behaves:
These parameters are different from the training parameters which are learned during training time
Instead, these configuration parameters are invoked(调用) at inference time

Max new token configuration and required time

Output of Transformer: Probability distribution of the entire dictionary's words that the model uses
Picking strategy: - greedy
Select the word/token with the highest probability
works well for short generation but it is susceptible to repeated words
Picking strategy: - random sampling
To generate more natural and creative words and avoid repeating words, random sampling is the easiest way to introduce variability

Generative configuration: different sampling scheme: top-k and top-p

We use parameters K and P to control the creativity of the generated result but still keep it sensible
top-k : numbers to choose from the list
after applying random-weighted strategy, select an output from top-k results, this parameter is to keep the generated result sensible
top-p: add-up(cumulative) probability for the choose result
in this example, when p =0.30, cake and donut are chosen because their probability adds up as 0.30

Generative configuration: temperature

temperature controls the randomness of the model output
this parameter influences the shape of the probability distribution that the model calculates for the next token
the higher the temperature, the higher the randomness
changing the temperature alters the predictions that the model will make

we can easily image the temperature as if it boils the water
if the water is cooler and still, the distribution will be more dense, gives more reliable result
if the water is boiling, the distribution is more sparse and random, gives more creative result

xp1632 commented 1 month ago

Steps in Generative AI project lifecycle

Diagram for lifecycle:

1. Define the Scope as narrowly/accurately as we can
the capability of different models highly depends on model's size and architecture
we should choose it carefully based on the task we want LLM to accomplish

-Good at many tasks or one specific type of task

2. Define the Scope as narrowly/accurately as we can
In most cases we'll choose from the existing base model instead of building it from scratch

3. Access the performance of the chosen model and prompt/finetune/reinforcement learning based on human feedback

these procedures can be iterated several times:

4. Deploy/integrate the model into the application After getting a matching model

Limitation of LLM such as `hallucination` and `not able to deal with complex math` and solutions

xp1632 commented 1 month ago

Generative AI Use case AWS labs

do not use personal account for amazon in this course
there are certain preset templates of a prompt for different models, such as for FLAN-T5 : https://github.com/google-research/FLAN/tree/main/flan/v2
we should try zero-shot, one-shot, few-shot to try base model and find the suitable ones that worthy of fine tuning
also more than four shots will not help

xp1632 commented 1 month ago

Pre-training LLM:

consideration for choosing a model:
for most of the case, I'll choose an existing pre-trained foundation model

How to choose a pre-trained model?

it depends on the details of the specific task
In Huggingface (https://huggingface.co/tasks), models are categorized by different tasks and described with model card:
the reason that different models are suitable for different task is that: different models are trained in a different way

How are LLM models trained (pre-training)？

Via self-supervised learning of TB or more of unstructured textual data from many sources(Internet or specific text corpus(语料库)),
- LLM model learns a deep statistical representation of language
- LLM model internalizes the patterns and structure in the language

During pertaining, the model weights are updated to minimize the loss of training object
- For each token, the encoder generates a respective TokenID and Vector Representation

Pretraining requires a large amount of GPU and computing

If the training data is scraped from Internet, we also need a filter to ensure the quality of the data and avoid bias
Only 1-3% of Tokens will be taken from original tokens and used in pretraining, we should consider this when deciding the size of the collecting dataset

xp1632 commented 1 month ago

Different training for different type of models which carry out different tasks:

Encoder Only Models

Is also known as Autoencoding models
pre-trained by Masked Language Modeling(MLM)

Tokens in the sequence are randomly masked The training objective(goal) is to reconstruct the original sentence (or we call it denoising)

- BERT (Bidirectional Encoder Representations from Transformers)

Autoencoding models build bidirectional context of the input sequence, meaning the model has a full context of a token, not just the words that come before.
BERT allows the model to capture nuances(细微差别) of the language, such as sarcasm in text, which is crucial for accurately determining sentiment

Decoder Only Models:

also known as Autoregressive models
this model is pre-trained by Causal Language Modeling, which is a unidirectional training in contrast to the bidirectional training like BERT

The training process is as follows: the start token is provided, and the rest is masked, the next token (left to right) is predicted based on the previous sequence of tokens

This process is iterated to predict the next token in one direction

Decoder only models are good at text generation such as chatbot, but it highly depends on the model size

Encoder-Decoder Models

also known as Sequence to sequence models
the details of how different models train vary
A typical model T5 was trained by Span(范围) Corruption(腐化)
In Encoder, random sequence tokens are masked, and a Sentinel (哨兵) token will replace it to keep the order/location information as a placeholder
The decoder then are tasked with reconstructing the masked token sequences auto-regressively
the output is the Sentinel token which keeps its location information and the predicted token sequences
Sequence-to-sequence is generally useful in cases where we have a body of text as input and a body of text as output

xp1632 commented 1 month ago

Comparison of model architectures and pre-training objectives

xp1632 commented 1 month ago

Computational challenges of training LLMs

the LLM models are becoming larger and larger
CUDA (Compute Unified Device Architecture) out of memory is an error we often see during training LLM models

The approximate GPU RAM to store and train 1B parameters:
We require about 6 times the amount of GPU RAM that the model weights alone take up.

To reduce the memory usage, we can use Quantization(量子化) to store the data using FP16, BFLOAT16, INT8 instead of (FP32)32-bit floating point

INT8 saves a lot of memory but also loses a lot of precisions

Quantization reduces the precision of the model weights and saves the memory by projecting the original 32-bit floating point into lower precision spaces
BFLOAT16 is chosen for many models such as FLAN-T5

Conclusion: we can scale down the size of the model by quantization

xp1632 commented 1 month ago

Multi-GPU compute strategies

As the model grows larger, we need to split the model across multiple GPUs for training

Distributed Data Parallel (DPP)

we can distribute the dataset in different GPUs with same model
however, in this case, the model parameter is still redundant

Fully Sharded(分区) Data Parallel(FSDP)

Or we can also divide the models parameters into different shardings(分区) to avoid overlap:
We can see the model parameters overlaps are improved by FSDP, but keep in mind GPUs have to communicate with each other

`Sharding factor` in FSDP

Performance comparison on different sizes of model

when the model has 11.3B parameters, DDP and full replication faces out of memory issue, but sharding strategies works well
but also remember, toooo many GPUs also introduce communication volume as shown in the sub-figure on the right

xp1632 commented 1 month ago

How big a model we need it to be? - Scaling laws and compute-optimal models

To improve the performance of the model, we can either increase the size of dataset or the number of parameters of the model
but we also need to consider the budget
more powerful processors computer faster:

Number of petaflop/s -days to pre-train various LLMs

Optimal Strategy and power law

we have three elements that influences the model performance on table, to increase the performance, we can find a power-law relationship when we fixed two parties:

in reality, we cannot just increase the budget to get better performance:
there are also power laws when we set other two elements fixed:

so in Chinchilla paper, they find the sweet points to get an optimal performance while the training dataset size and model size are balanced

Chinchilla hints that a lot of very larger models may be over-parameterized and under-trained

smaller models trained on more data could perform as well as larger models

Conclusion from `Chinchilla` : Compute optimal dataseize = 20 * model paramters

xp1632 commented 1 month ago

!!! Pre-training for domain adaptation:

If our target domain uses vocabulary and language structures that are not commonly used in daily life, we need the pre-training to train the model from scratch
For example, Legal language, and Medical language the terms are unlikely to appear in the training text of existing LLMs, so the model will have problems understanding the terms and using them correctly

BloombergGPT: `finance domain` adapted LLM model

paper: https://arxiv.org/abs/2303.17564
They chose data consisting of 51% financial data and 49% public data

They also used Chinchilla scaling law for guidance and made tradeoffs
the pink vertical line is where their budget lie, you can see BloomGPT is quite close to the optimal power law line

xp1632 commented 1 month ago

Instruction Fine-tuning

Fine-tuning is a method to improve the performance of existing models for specific use cases

Recap on In-context learning (ICL) which provides examples/shots in the prompt to help improve the performance of the model

the above strategy has drawbacks:

Difference between LLM `pre-training` and `fine-tuning`

In pre-training, we use a vast amount of unstructured textual data and do self-supervised learning
Fine-tuning is a supervised learning process where we use a dataset of labeled examples to update the weights of the LLM

How instruction fine-tunning works

Instruction fine-tunning trains the model by demonstrating how the model should respond to a specific instruction

We give the model task-specific examples
allow the model to learn to generate responses that follow given instructions

While full fine-tuning updates all parameters, we can also benefit from memory optimization and parallel computing strategies

!!!

How to prepare instruction-based training data for `fine-tuning`?

there are existing prompt templates for different task
the result of these templates is instructions and examples from the dataset

xp1632 commented 1 month ago

Process of instruction LLM fine-tuning

First we get our prepared instruction dataset:
We divide the dataset into training, validation, test
During fine-tuning, prompts from the training dataset are passed to LLM, LLM generates completions and we compare the predicted result with the label specified in the training data

We compute the loss based on the different probability distributions of two tokens
We use the calculated loss to update the weights of LLM model in standard backpropagation
We do this for many batches of prompt- completion pairs and over several epochs to update the weights so the model's performance on the task improves
Measure LLM performance by the holdout validation dataset and calculate validation accuracy
After completing fine-tuning, perform a final performance evaluation using test dataset and get test_accuracy

Result:
The fine-tuning process results in a new version of the base model, called instruct model which is better at the tasks we are interested in.

xp1632 commented 1 month ago

Fine-tuning on a single task:

With fine-tunning, good results can be achieved with relatively few examples (500-1000 examples)
in contrast to the billions of pieces of text during pre-training

`Catastrophic forgetting` issue

Fine-tunning the base model for single tasks may lead to Catastrophic forgetting
cause of this phenomenon: full fine-tuning modifies the weights of the original LLM,
letting LLM models perform great on a single fine-tunning task,
but degrade performance on other tasks
the model can no longer carry out this task

How to avoid catastrophic forgetting

we might not need to deal with this issue
Method 1: Or we can perform multiple tasks at the same time, good multiple-task fine-tuning requires 50-100,000 examples across many tasks
Method 2: Or instead of full fine-tuning, we perform parameter efficient fine-tuning (PEFT) , which preserves the weights of the original LLM and trains only a small number of task-specific adapter layers and parameters

xp1632 commented 1 month ago

Multi-task, instruction fine-tuning

we train the model based on a mixed dataset that contains various tasks, so the performance of the model improves for all tasks simutaneously
thus avoid the issue of catastrophic forgetting
Drawback: we may need many examples (50-100,000) of each for training

`FLAN` Instruction fine-tuning model

FLAN is the abbreriviation of Fine- tuned LAnaguage Net, also has a meaning of (果馅饼) as a dessert after the main course

So this instruction fine-tuning is also the last step after pre-training main course
for example, FLAN-T5 is the fine-tuned version of foundation model T5

`FLAN-T5` is trained across 473 datasets and 146 task categories

SAMsum: sample dialogue prompt training dataset to fine-tune FLAN-T5

A prompt template of samsum :
different ways of saying the same instruction are included to help the model generalize and perform better

dialogue is inserted into the dialogue field: -summary is used as label:

xp1632 commented 1 month ago

How to improve specifically the performance of FLAN-t5 on summarization task

if we want to summarize the key actions from the dialogue of a chat-bot
sole Samsum dataset that contains day to day activities are not sufficient

we need the additional dialoguesum dataset to fine-tune the model for this summarization task of chatbot conversation:

Before fine-tuning of FLAN-T5 with `dialogesum`

important information is lost in summarization comparing with Human baseline
also there's fabricated information

xp1632 / DFKI_working_log