0%

今天新开一篇文章。前段时间一直忙于项目周期中琐碎的事情,没有好好总结和思考技术核心的东西。探索RAG的应用也有一段时间了,市面上的应用也看的不少,很多的应用包括langchain-chatchat,Dify, 都是整体搭建了一个最basic版本的RAG框架,至于很多细节点,并没有提供更精细化的实现,今天要写的这部分:PDF上传到知识库之后如何parse和chunking,直接影响后续retrival的表现,目前还没有很全面的review来总结这部分内容。这里就对我看到的一些技术做一些整理。

解析PDF文档的难点主要在于如何精确地捕捉页面的整体布局,并将包括表格标题, 段落以及图片在内的内容转译为文档的文字形式。这一过程涉及到多个技术点,布局的检测,图片中文字的抽取,表格中行与列的识别(如何正确将PDF中的表格识别成可用结构化形式表示的表格,也就是能还原出原表格来)。

目前解析PDF文档主要有三种主流方式:

  1. 基于规则的方法:这种方法根据文档的组织特性来确定每个部分的样式和内容,代表库:pypdf, 这种方法的适用性比较差,很难通过预设的规则覆盖所有PDF的情形。
  2. 基于深度学习模型的方法:一个流行的解决方案是结合了物体检测和OCR模型,代表 Chatdoc
  3. 基于多模态大模型的方法:通过这种方法可以解析PDF中复杂结构或者提取关键信息

基于规则的方法

基于规则的PDF处理库真的太多了,每一次看到一个应用使用的新的PDF解析器,都要重新看看怎么处理的。比如langchain-chatchat用的是pyMuPDF中的fitz包,见langchain-chatchat源代码,这一段处理特别粗糙,我贴上来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import fitz  # pyMuPDF里面的fitz包,不要与pip install fitz混淆
import numpy as np

ocr = get_ocr()
doc = fitz.open(filepath)
resp = ""

b_unit = tqdm.tqdm(
total=doc.page_count, desc="RapidOCRPDFLoader context page index: 0"
)
for i, page in enumerate(doc):
b_unit.set_description(
"RapidOCRPDFLoader context page index: {}".format(i)
)
b_unit.refresh()
text = page.get_text("")
resp += text + "\n"

img_list = page.get_image_info(xrefs=True)
for img in img_list:
if xref := img.get("xref"):
bbox = img["bbox"]
# 检查图片尺寸是否超过设定的阈值
if (bbox[2] - bbox[0]) / (page.rect.width) < PDF_OCR_THRESHOLD[
0
] or (bbox[3] - bbox[1]) / (
page.rect.height
) < PDF_OCR_THRESHOLD[1]:
continue
pix = fitz.Pixmap(doc, xref)
samples = pix.samples
if int(page.rotation) != 0: # 如果Page有旋转角度,则旋转图片
img_array = np.frombuffer(
pix.samples, dtype=np.uint8
).reshape(pix.height, pix.width, -1)
tmp_img = Image.fromarray(img_array)
ori_img = cv2.cvtColor(np.array(tmp_img), cv2.COLOR_RGB2BGR)
rot_img = rotate_img(img=ori_img, angle=360 - page.rotation)
img_array = cv2.cvtColor(rot_img, cv2.COLOR_RGB2BGR)
else:
img_array = np.frombuffer(
pix.samples, dtype=np.uint8
).reshape(pix.height, pix.width, -1)

result, _ = ocr(img_array)
if result:
ocr_result = [line[1] for line in result]
resp += "\n".join(ocr_result)

注意它对PDF中图片的处理,是将某一页中所有的图片存入一个img_list,然后遍历这个list,用ocr算法抠出里面的文字,将这些文字都放到该页text的末尾,很大程度上丧失了图片在PDF中应该表达的语义,不仅如此,我觉得还影响了retrival的performance,加入了噪声。

再来看看Dify怎么处理的:

1
2
3
4
5
6
7
8
9
10
11
12
13
import pypdfium2
with blob.as_bytes_io() as file_path:
pdf_reader = pypdfium2.PdfDocument(file_path, autoclose=True)
try:
for page_number, page in enumerate(pdf_reader):
text_page = page.get_textpage()
content = text_page.get_text_range()
text_page.close()
page.close()
metadata = {"source": blob.source, "page": page_number}
yield Document(page_content=content, metadata=metadata)
finally:
pdf_reader.close()

换了一个pypdfium2的库,图片完全舍弃。我也去搜了有没有对两者的比较,见pypdf or pymupdf?,更有作者做了一个repo用于比较各种library的效果:传送门。结论是如果单论文本抽取的质量,pypdfium2是第一

对于RAG中PDF的处理,我觉得最理想的目标应该是,query能够link到PDF中的图片,不仅如此,这张图片也应该作为reference,目前市面上的reference只能link到相应文件的纯文本。远远不够精细,不过做起来确实有点困难,也需要一点耐心。

image-20240628100839214

更多其他工具的比较可参考A Benchmark and Evaluation for Text Extraction from PDF

image-20240628101040870

基于规则的方法有一个最大的缺点就是:它会将每一行视为由换行符“”分隔的序列,如果那一行确实是以句号为结尾,影响还稍微小一点,但是如果下一行还在表述这句话,语义上就完全断掉了。要知道在chunking的阶段,大部分的做法是以\n作分隔符的。

基于深度学习模型的方法

这种方法的又是是它能准确识别整个文档的布局,包括表格和段落。它甚至能理解表格内的结构。这意味着解析出来的表格能完整的解析成表格原本的样子。局限性就在于对象检测和OCR的识别两个阶段可能会耗时。这时候可以考虑采用GPU加速或者多进程和多线程的方式进行处理。

这里有几个开源的代表框架:

  • Unstructured:它是langchain官方推荐的方式。在启用infer_table_structure=True的hi_res策略下,表格的识别效果良好。fast策略下表现不佳
  • Layout-parser:如果需要识别复杂结构的PDF,建议使用最大的模型,虽然可能会稍慢一点
  • PP-StructureV2: 采用多种模型组合进行文档的分析,性能优于平均水平。这是百度的飞桨出品的文档智能模型。

另外有一些闭源付费的工具诸如ChatDoc和LLama Parse, 这些在现实的落地中存在一定阻碍,毕竟调用API是一定会存在数据上传到别人服务器的风险的,如果是处理非敏感数据,付费API可以做考虑。

这里对Unstructured处理PDF做一些探索。

Unstructured对PDF的处理首先是对layout进行检测(detectron2模型),然后使用tesseract实现OCR的功能,用table transformer处理table。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(
filename=test_file,
# Using pdf format to find embedded image blocks
extract_images_in_pdf=True,
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
# Hard max on chunks
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
)

tables = [el for el in raw_pdf_elements if el.category == 'Table']

print(tables[0].text)

print(tables[0].metadata.text_as_html)

上述的第二个输出:将表格转化为html格式,如果将其保存为html,它可以在浏览器中打开,打开后还是一个完整的表格。unstructured的好处就在于它保证了一个表格的完整性。官方给出的示例代码中就是将每一个element的text去做embedding,对于表格来说就是将表格的text去做embedding。但是我测试了两个例子发现,它把表格的标题以及表格的注解给弄到其他element里去了,这些文本本应该和表格一起,比如下面的表格:

image-20240628160939056

它抽取出来的表格element就只包含纯表格那部分,上面关于Table1的介绍都包含在了上面一个element里

1
2
3
4
5
6
7
8
9
3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the

5

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.
--------------------------------- 上面是另一个element的内容
Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 · d) O(n · d2) O(k · n · d2) O(r · n · d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)

基于多模态大模型的方法

这篇博客主要记录Transformer架构的代码实现。以下是参考资料 - Attention is All you need - Attention? Attention! - lilian wen的tensorflow版本的实现 - illustrated-transformer 强烈建议看illustrated transformer这篇博客,是跟paper介绍的transformer架构完全对齐的 - pytorch transformer实现

​ 这是pytorch的官方实现

​ 斯坦福出的transformer架构的实现tutorial

我自己想实现的一遍的原因在于:

  1. transformer的文章读了很多遍,但是很多细节还是没有去深究。
  2. 斯坦福的实现完全遵照的是paper的架构,但是我觉得还是实现的过于复杂了,我想遵循lilian的tensorflow实现把原生的tranformer架构实现一下
  3. 我对pytorch的掌握没有tensorflow好,感觉现在pytorch基本上成为深度学习网络的主流,特别是大模型出来之后,hugginggface的transformer库也是支持pytorch更好一点,更大的社区。(此时有点后悔当时系统学习的是tensorflow而不是pytorch)

transformer的整体架构: encoder-decoder两大模块,encoder模块内有重复的6个子模块,decoder模块内也有重复的6个子模块。

Transformer model

我们采用自上而下的方式来看这两个模块

Transformer整体架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class Transformer(nn.Module):
'''
define the whole architecture of Transformer in:
Vaswani et al. Attention is All You Need. NIPS 2017.
'''
def __init__(self, num_heads=8, d_model=512, d_ff=2048, num_enc_layers=6, num_dec_layers=6,
drop_rate=0.1, warmup_steps=400, pos_encoding_type='sinusoid',
ls_epsilon=0.1, use_label_smoothing=True,
model_name='transformer', tf_sess_config=None, **kwargs):
super().__init__()
self.h = num_heads
self.d_model = d_model
self.d_ff = d_ff

self.num_enc_layers = num_enc_layers
self.num_dec_layers = num_dec_layers

# Dropout regularization: added in every sublayer before layer_norm(...) and
# applied to embedding + positional encoding.
self.drop_rate = drop_rate

# Label smoothing epsilon
self.ls_epsilon = ls_epsilon
self.use_label_smoothing = use_label_smoothing
self.pos_encoding_type = pos_encoding_type

# For computing the learning rate
self.warmup_steps = warmup_steps

self.config = dict(
num_heads=self.h,
d_model=self.d_model,
d_ff=self.d_ff,
num_enc_layers=self.num_enc_layers,
num_dec_layers=self.num_dec_layers,
drop_rate=self.drop_rate,
warmup_steps=self.warmup_steps,
ls_epsilon=self.ls_epsilon,
use_label_smoothing=self.use_label_smoothing,
pos_encoding_type=self.pos_encoding_type,
model_name=self.model_name,
tf_sess_config=self.tf_sess_config,
)
def forward(self, src, tgt, src_mask, tgt_mask): # 这里进行拼接,将encoder和decoder两大模块拼接在一起
enc_out = self.encoder(src, src_mask)
dec_out = self.decoder(enc_out, src_mask, tgt, tgt_mask)
return dec_out

Tranformer Encoder

image-20240102135835011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def clones(module, N):
"Produce N identical layers."
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class TransformerEncoder(nn.Module):
def __init__(self, encoder_layer, num_enc_layers) -> None:
self.num_enc_layers = num_enc_layers
self.encoder_layers = clones(encoder_layer,num_enc_layers) # 将encoder_layer复制6次

def forward(self, src, src_mask):
out = src
for layer in self.encoder_layers:
out = layer(out, src_mask)
return out

这里实现了一个clones帮助函数,我想过在这里用for循环,lilian在这里就是用的for循环:

1
2
3
4
5
out = inp  # now, (batch, seq_len, embed_size)
with tf.variable_scope(scope):
for i in range(self.num_enc_layers):
out = self.encoder_layer(out, input_mask, f'enc_{i}')
return out

注意这里的每一个encoder_layer的参数都是独立的,也就是有6份encoder_layer的参数需要训练,tensorflow为什么可行?是因为它这里使用了variable_scope的概念,上面的tensorflow实现每一次out和input_mask进来都是和不同的数值进行的运算。如果在pytorch中想实现这种方式,要先把encoder_layer复制六遍,每一次输入进来都拿不同的layer做运算。

encoder layer

接下来我们实现encoder layer中的细节部分,它包含两个sub-layer: 1) self-attention + Add&layer_Norm 2) position-wise feed forward + Add&layer_Norm

image-20240102135835011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class TransformerEncoderLayer(nn.Module):
"""
Args:
d_model: the number of expected features in the input (required).
nhead: the number of heads in the multiheadattention models (required).
dim_feedforward: the dimension of the feedforward network model (default=2048).
About:

"""
# One multi-head attention + one feed-forward
def __init__(self, d_model, n_head, dim_feedforward, dropout = 0.1) -> None:
super().__init__()
self.self_attn = MultiheadAttention(d_model, n_head)
self.norm_1 = nn.LayerNorm(d_model)
# Implementation of Feedforward model(Two linear transformation together with one dropout)
self.linear1 = nn.Linear(d_model, dim_feedforward, )
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm_2 = nn.LayerNorm(d_model)

def __ff_block(self, x):
# feed forward layer contains two linear
out = F.relu(self.linear1(x))
out = self.dropout(out)
out = self.linear2(out)

def forward(self, src, src_mask):
out = src
out = self.norm_1(out + self.self_attn(out, src_mask))# 这里在pytorch的官方实现中在self_attention后还加了一个dropout
out = self.norm_2(out + self.__ff_block(out))

return out

self attention

我一开始查阅的资料是illustrated-transformer, 这个博客内没有具体的实现。后来我参考的是lilian wen的tensorflow实现。在lilian的实现里对于multihead attention是这样写的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def multihead_attention(self, query, memory=None, mask=None, scope='attn'):
"""
Args:
query (tf.tensor): of shape (batch, q_size, d_model)
memory (tf.tensor): of shape (batch, m_size, d_model)
mask (tf.tensor): shape (batch, q_size, k_size)

Returns:h
a tensor of shape (bs, q_size, d_model)
"""
if memory is None:
memory = query

with tf.variable_scope(scope):
# Linear project to d_model dimension: [batch, q_size/k_size, d_model]
Q = tf.layers.dense(query, self.d_model, activation=tf.nn.relu)
K = tf.layers.dense(memory, self.d_model, activation=tf.nn.relu)
V = tf.layers.dense(memory, self.d_model, activation=tf.nn.relu)

# Split the matrix to multiple heads and then concatenate to have a larger
# batch size: [h*batch, q_size/k_size, d_model/num_heads]
Q_split = tf.concat(tf.split(Q, self.h, axis=2), axis=0)
K_split = tf.concat(tf.split(K, self.h, axis=2), axis=0)
V_split = tf.concat(tf.split(V, self.h, axis=2), axis=0)
mask_split = tf.tile(mask, [self.h, 1, 1])

# Apply scaled dot product attention
out = self.scaled_dot_product_attention(Q_split, K_split, V_split, mask=mask_split)

# Merge the multi-head back to the original shape
out = tf.concat(tf.split(out, self.h, axis=0), axis=2) # [bs, q_size, d_model]

# The final linear layer and dropout.
# out = tf.layers.dense(out, self.d_model)
# out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training)

return out

以上的实现其实和博客内的内容有点相左,博客写的是:

As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

结合作者给出的图片:

image-20240104131802468

我一开始的理解是每一个head都有一份单独的W sets(WQ,WK,WV)。每一个head经过了scaled attention的计算

image-20240104132257007

得到的Z的shape都是(batch, seq_len, embeded_size),所以才会有WO这个线性变化(blog里说的):

image-20240104132406484

但我看完代码之后发现并不是我想的那样。我觉得这篇博客写的有点问题。后来又找到了一篇博客,能够解答我的疑问。它最重要的话是:

However, the important thing to understand is that this is a logical split only. The Query, Key, and Value are not physically split into separate matrices, one for each Attention head. A single data matrix is used for the Query, Key, and Value, respectively, with logically separate sections of the matrix for each Attention head. Similarly, there are not separate Linear layers, one for each Attention head. All the Attention heads share the same Linear layer but simply operate on their ‘own’ logical section of the data matrix.