前言

前言老规矩，请看完前面的章节，anyway其实你不看也没什么大问题~ 😄

Retrieval Augmented Generation (RAG) and Vector Databases RAG主要解决的是LLM训练数据集老旧，且去除了大量私人数据和公司产品手册的问题本质逻辑是把最新材料作为prompt也放入用户输入中，作为上下文提供给LLM，让其给出合理答案。使用rag的好处如下：

信息更丰富，提供最新数据

利用经过验证的数据来减少伪造

相比fine-tune更省钱

15.1 RAG运行步骤

知识库基础：最新数据库/文档被切分成chunk，随后喂入embedding model转换为vectors。
用户查询：用户查询某个问题。
retrieval（检索）：用户查询问题后，这个问题会被检索，从知识库中检索出相关信息，将相关信息作为上下文和用户查询的问题结合，发送给LLM
LLM生成： LLM基于检索的数据和用户的查询进行分析，得到响应并返还给用户。 Augmented Generation: the LLM enhances its response based on the data retrieved. It allows the response generated to be not only based on pre-trained data but also relevant information from the added context. The retrieved data is used to augment the LLM's responses. The LLM then returns an answer to the user's question.

根据检索到的数据的使用不同，RAG又分成了2个变体： RAG-Sequence,RAG-Sequence 在生成答案时的步骤如下：

检索阶段：首先从知识库中检索与输入查询相关的多个文档。
生成阶段：基于检索到的每个文档，独立生成一个候选答案。
最终答案：从所有候选答案中选择一个作为最终答案，或者通过某种集成方法（如投票）生成最终答案。

RAG-Token,RAG-Token 在生成答案时的步骤如下：

检索阶段：同样地，从知识库中检索与输入查询相关的多个文档。
生成阶段：与 RAG-Sequence 不同的是，RAG-Token 在生成答案时是逐词生成的。即在生成每个词时，都会考虑所有检索到的文档。
最终答案：逐词生成整个答案，每个词的生成都综合考虑了检索到的所有文档的信息。

RAG-Sequence：基于每个检索文档生成多个完整答案，然后选择或集成这些答案。 RAG-Token：逐词生成答案，每个词的生成都利用了所有检索到的文档的信息。

接下来会介绍如何执行1/3步骤的实现

15.2 创建知识库(向量库) 与传统数据库不同，向量数据库是一种专门用于存储、管理和搜索embedding vector的数据库。它存储文档的数字表示。将数据分解为embeddings使我们的 AI 系统更容易理解和处理数据。我们将embeddings存储在向量数据库中，因为 LLM 接受的token数量是有限的。由于无法将整个embeddings传递给 LLM，因此我们需要将它们分解成块chunk，当用户提出问题时，最像问题的embeddings将与提示一起返回。分块还可以降低通过LLM的token数量的成本。一些流行的矢量数据库包括 Azure Cosmos DB、Clarifyai、Pinecone、Chromadb、ScaNN、Qdrant 和 DeepLake。分块的代码如下: def split_text(text, max_length, min_length): words = text.split() chunks = [] current_chunk = []

for word in words:
    current_chunk.append(word)
    if len(' '.join(current_chunk)) < max_length and len(' '.join(current_chunk)) > min_length:
        chunks.append(' '.join(current_chunk))
        current_chunk = []

# If the last chunk didn't reach the minimum length, add it anyway
if current_chunk:
    chunks.append(' '.join(current_chunk))

return chunks

15.3 检索检索的核心思路有4种

Keyword search - used for text searches
Semantic search - uses the semantic meaning of words
Vector search - converts documents from text to vector representations using embedding models. Retrieval will be done by querying the documents whose vector representations are closest to the user question.
Hybrid - a combination of both keyword and vector search.

这里面的关键点在于如何衡量向量相似度，目前常用的方法有余弦相似度、欧几里得距离、点积

为数据库的每个向量创建索引的方法如下: from sklearn.neighbors import NearestNeighbors

embeddings = flattened_df['embeddings'].to_list()

Create the search index

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(embeddings)

To query the index, you can use the kneighbors method

distances, indices = nbrs.kneighbors(embeddings)

查询数据库后，您可能需要按最相关的顺序对结果进行排序。

Find the most similar documents

distances, indices = nbrs.kneighbors([query_vector])

index = []

Print the most similar documents

for i in range(3): index = indices[0][i] for index in indices[0]: print(flattened_df['chunks'].iloc[index]) print(flattened_df['path'].iloc[index]) print(flattened_df['distances'].iloc[index]) else: print(f

wsxk / wsxk.github.io

generative-ai 学习笔记 Ⅴ #195

Create the search index

To query the index, you can use the kneighbors method

Find the most similar documents

Print the most similar documents