from unstructured.partition.pdf import partition_pdf # Get elements raw_pdf_elements = partition_pdf( filename=test_file, # Using pdf format to find embedded image blocks extract_images_in_pdf=True, # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles # Titles are any sub-section of the document infer_table_structure=True, # Post processing to aggregate text once we have the title chunking_strategy="by_title", # Chunking params to aggregate text blocks # Attempt to create a new chunk 3800 chars # Attempt to keep chunks > 2000 chars # Hard max on chunks max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, )
tables = [el for el in raw_pdf_elements if el.category == 'Table']
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the
5
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention. --------------------------------- 上面是另一个element的内容 Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 · d) O(n · d2) O(k · n · d2) O(r · n · d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)