{"id":5006,"date":"2026-06-08T15:29:47","date_gmt":"2026-06-08T06:29:47","guid":{"rendered":"https:\/\/cinnamon.ai\/?post_type=ideas&#038;p=5006"},"modified":"2026-06-08T15:30:37","modified_gmt":"2026-06-08T06:30:37","slug":"super-rag-tech-blog-04","status":"publish","type":"ideas","link":"https:\/\/cinnamon.ai\/en\/ideas\/super-rag-tech-blog-04\/","title":{"rendered":"[Part 4] The Contents of Document Extraction \u2014 How is the Engine Selected?"},"content":{"rendered":"<ul><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u306f\u3058\u3081\u306b-1\">Introduction<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u672c\u9023\u8f09\u306b\u3064\u3044\u3066-6\">About this series<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u30d5\u30a1\u30a4\u30eb\u5f62\u5f0f\u304b\u3089\u898b\u308b\u30a8\u30f3\u30b8\u30f3\u9078\u629e-docreader\u304c\u4e2d\u6838\u3092\u62c5\u3046-9\">Engine Selection Based on File Format \u2014 DocReader Takes Center Stage<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-pdf\u306f\u3069\u3046\u62bd\u51fa\u3055\u308c\u308b\u304b-docreader-\u3068-azure-document-intelligence-19\">How are PDFs extracted? \u2014 DocReader and Azure Document Intelligence<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u8868\u56f3\u30ad\u30e3\u30d7\u30b7\u30e7\u30f3\u3092\u3069\u3046\u62fe\u3046\u304b-\u901a\u5e38\u306e\u30c6\u30ad\u30b9\u30c8\u62bd\u51fa\u3068\u306e\u9055\u3044-27\">How to extract tables, figures, and captions \u2014 Differences from normal text extraction<\/a><ul><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u8868\u306e\u30bb\u30eb\u7d50\u5408-\u5217\u306e\u5bfe\u5fdc\u95a2\u4fc2\u304c\u5d29\u308c\u308b-31\">Merging cells in a table \u2014 the column correspondence is broken.<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u56f3\u30b0\u30e9\u30d5\u30d5\u30ed\u30fc\u30c1\u30e3\u30fc\u30c8-\u691c\u51fa\u3059\u3089\u3055\u308c\u306a\u3044-38\">Figures, graphs, flowcharts \u2014 not even detected<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u7d20\u306e\u30c6\u30ad\u30b9\u30c8\u5316\u3068\u69cb\u9020\u3092\u4fdd\u3063\u305f\u62bd\u51fa\u306e\u9055\u3044-41\">The difference between &quot;raw text conversion&quot; and &quot;extraction while preserving structure&quot;<\/a><\/li><\/ul><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u30b9\u30ad\u30e3\u30f3pdf\u753b\u50cf\u306eocr-44\">OCR for scanned PDFs and images<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u30c1\u30e3\u30f3\u30af\u5316\u304c\u62bd\u51fa\u306e\u6700\u7d42\u4ed5\u4e0a\u3052-48\">Chunking is the final step in extraction.<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u6301\u3061\u5e30\u308a\u691c\u8a0e\u6750\u6599\u81ea\u793e\u30c7\u30fc\u30bf\u3067\u306e\u6bd4\u8f03\u8a66\u884c\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8-53\">[Things to take home and consider] Checklist for comparative trials using your own data<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u6b21\u56de\u4e88\u544a-70\">Next episode preview<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-\u307e\u3068\u3081-74\">summary<\/a><\/li><\/ul>\n\n\n<div style=\"height:85px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u306f\u3058\u3081\u306b-1\" class=\"wp-block-heading\"><a>Introduction<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the first installment, we introduced the overall picture and differentiating points of Super RAG, and in the second and third installments, we introduced API integration patterns and a function catalog. Up until now, the focus has been on providing a map of &quot;how to use&quot; it. From this installment onwards, we will change our perspective.<strong>What is happening inside Super RAG?<\/strong>\u2014We&#039;ll delve into the &quot;substances&quot; that produce this precision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first thing we&#039;ll discuss is,<strong>Document Extraction<\/strong>That&#039;s it. The RAG mechanism works in the following steps: extraction \u2192 indexing \u2192 search \u2192 answer generation.<strong>The quality of the extraction at the entrance determines the ceiling of all the subsequent steps.<\/strong>This is a particularly strong premise within the design philosophy of Super RAG. If table cells are missed or figure captions are separated from the text, no matter how sophisticated the search algorithm or how high-performance the large-scale language model, it will be impossible to obtain the correct answer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This article will explain how Super RAG designs this entry point, in particular.<strong>How does our in-house document comprehension engine, DocReader, structure its extraction process based on the characteristics of the file format and content?<\/strong>We will unravel the mystery. Engine names will be mentioned, but since the target audience is assumed to be non-engineers, we will proceed by focusing on the decision-making logic of &quot;why this design works.&quot;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At the end of the article, it states that the items to consider taking home this time are:<strong>Checklist for comparison trials using our own data<\/strong>We have prepared this for you. If, after reading the article, you think, &quot;I&#039;d like to try this with my own documents,&quot; you can compare the results by running your actual data through our free or paid trial.<strong>Contact form at the end of the article<\/strong>Please feel free to contact us for a consultation.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u672c\u9023\u8f09\u306b\u3064\u3044\u3066-6\" class=\"wp-block-heading\"><a>About this series<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This is the fourth article in the series. The series is structured into five chapters: (I) Why Super RAG, (II) How to implement it, (III) Visualizing the contents, (IV) What is happening in the field, and (V) Decision-making regarding implementation. Starting with this article, we begin Chapter III, &quot;Visualizing the contents,&quot; and will sequentially cover the three pillars that support the accuracy of Super RAG in the following order: extraction (Part 4), search (Part 5), and response strategy (Part 6).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By Chapter II, you should have grasped the outline of &quot;how to incorporate Super RAG,&quot; so from here on...<strong>This session will help you understand &quot;why accuracy is achieved after the system is integrated.&quot;<\/strong>We&#039;ll move on to that. The purpose is not to replace the specifications, but to give you a map for making design decisions.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u30d5\u30a1\u30a4\u30eb\u5f62\u5f0f\u304b\u3089\u898b\u308b\u30a8\u30f3\u30b8\u30f3\u9078\u629e-docreader\u304c\u4e2d\u6838\u3092\u62c5\u3046-9\" class=\"wp-block-heading\"><a>Engine Selection Based on File Format \u2014 DocReader Takes Center Stage<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Actual business documents vary greatly in both format and content. Some are PDFs with clean text, others are scanned image PDFs from paper documents, some contain complex tables, others are elaborately laid out PowerPoint presentations, and even FAQs and glossaries written in CSV files. A wide variety of materials are mixed together within the same company knowledge base.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Super RAG extraction layer utilizes this diversity.<strong>We don&#039;t try to handle everything with a single general-purpose engine.<\/strong>.instead,<strong>The Super RAG extraction layer distributes engines according to the file format and content characteristics, with the in-house document comprehension engine DocReader at its core.<\/strong>\u2014This is the design we&#039;ve adopted.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"521\" src=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map-1024x521.png\" alt=\"\" class=\"wp-image-5007\" srcset=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map-1024x521.png 1024w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map-300x153.png 300w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map-768x390.png 768w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map-18x9.png 18w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/01_engine_routing_map.png 1194w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><em>File format and engine distribution map<\/em><\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">DocReader is not a standalone OCR library or PDF parser. It combines layout analysis, OCR, table structure reconstruction, figure region extraction, and aggregation of results from multiple models into a single pipeline.<strong>Document comprehension engine<\/strong>in,<strong>Office files such as Word, PowerPoint, and Excel.<\/strong>or,<strong>Image files (PNG\/JPEG)<\/strong>\u2014In other words, materials with a high proportion of visual information \/ complex layouts<strong>To handle it all by one person<\/strong>That is the central point of the design (some parts are converted to PDF internally by Super RAG before being processed by DocReader).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Other file formats will follow their respective paths.<strong>PDF with a standard text layer<\/strong>This is Microsoft&#039;s cloud-based document analysis service. <strong>Azure Document Intelligence<\/strong>(Hereafter, Azure DI) will be used. CSV files with clearly defined rows and columns, and tabular data with a fixed structure such as FAQs\/glossaries, will be handled by Azure DI.<strong>A pandas-based extractor that is robust to structured data.<\/strong>For file formats that are not detected or when the above engine is unresponsive, you can use PyMuPDF or Unstructured.<strong>Fallback path<\/strong>We also have that available.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What&#039;s important here is that, from the user&#039;s perspective,<strong>Simply drop in the files and the appropriate processing will begin.<\/strong>It&#039;s that simple. Users don&#039;t need to differentiate between APIs like, &quot;PDFs go to this API, Excel files to another API, and images to the image API.&quot;<strong>The distribution will be handled on the Super RAG side.<\/strong>\u2014That&#039;s the key point of the design, and the document processing API (POST \/api\/v3.3\/actions\/document-extract\/) that we introduced in the third installment also shows users that the extraction layer makes decisions internally once a file is passed to it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here, I&#039;d like to mention one distinctive design decision. The engine distribution was as follows:<strong>We will decide after actually analyzing the contents (layout) of the file.<\/strong>rather than<strong>The file extension, the column structure (whether it&#039;s for FAQs, a glossary, or a regular table), and the environment settings and upstream processing path are all important.<\/strong>It&#039;s a deterministic system where the decision is made solely based on that. You might think, &quot;But how can you choose the optimal engine without looking at the contents?&quot; but this is a trade-off. If we were to analyze the layout every time before distributing, we would have to run the layout analysis once just for distribution, and then perform the same analysis again in the main process.<strong>Double cost<\/strong>This will occur. Super RAG is &quot;<strong>If the conditions are the same each time, assign it to the same engine each time.<\/strong>By adopting a deterministic design, we prioritize operational predictability, high-speed processing, and cost efficiency.<strong>The actual layout analysis is performed after the allocation is complete, examining the inside of each engine.<\/strong>\u2014Azure DI handles it on the cloud side, while DocReader handles it internally\u2014<strong>It will be done at<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thus,<strong>While using our in-house developed DocReader as the core, we utilize powerful cloud services as needed.<\/strong>This hybrid configuration is a key feature of the Super RAG extraction system. By focusing each engine on the material it excels at extracting, it achieves a balance between extraction quality, operational flexibility, and efficiency.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-pdf\u306f\u3069\u3046\u62bd\u51fa\u3055\u308c\u308b\u304b-docreader-\u3068-azure-document-intelligence-19\" class=\"wp-block-heading\"><a>How are PDFs extracted? \u2014 DocReader and Azure Document Intelligence<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The core of the documents handled in practical work is undoubtedly PDF. Within Super RAG, the materials treated as &quot;PDF&quot; are:<strong>Users directly <\/strong><strong>.pdf<\/strong><strong> Uploaded file<\/strong>but also<strong>When you upload Office files (Word\/PowerPoint\/Excel) or images, Super RAG internally converts them to PDF.<\/strong>This includes up to. Super RAG handles these &quot;PDFs,&quot; which vary greatly in content and structure.<strong>Can be handled by two systems<\/strong>It is designed to be so.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"764\" src=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison-1024x764.png\" alt=\"\" class=\"wp-image-5008\" srcset=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison-1024x764.png 1024w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison-300x224.png 300w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison-768x573.png 768w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison-16x12.png 16w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/02_pdf_engine_comparison.png 1180w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><em>Comparison of processing steps between DocReader and Azure Document Intelligence<\/em><\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">One is,<strong>DocReader&#039;s raster page processing pipeline<\/strong>This involves treating the page as an image and running a multi-stage process\u2014layout analysis \u2192 OCR \u2192 table model \u2192 diagram model \u2192 aggregation\u2014through an in-house pipeline. <strong>Office files<\/strong>or,<strong>Image (PNG\/JPEG)<\/strong> \u2014 these <strong>Super RAG internally converts to PDF.<\/strong> In the system that deals with \u2014<strong>Materials with a high proportion of visual information \/ complex layout<\/strong>It will perform well there.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another one is,<strong>Azure Document Intelligence<\/strong>(Azure DI) is a cloud-based document analysis service provided by Microsoft. It integrates layout analysis, OCR, and table\/figure modeling, and is particularly strong in extracting chapter structure (equivalent to a table of contents). In Super RAG,<strong>users <\/strong><strong>.pdf<\/strong><strong> <\/strong><strong>Files uploaded directly using their file extensions<\/strong>By default, these are routed to Azure DI. Both standard PDFs with a text layer and scanned PDFs without a text layer will be accepted by Azure DI as long as they are passed as .pdf files.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&quot;Even though they are both &#039;PDFs,&#039; they will be sorted by different engines.&quot; This decision is not made after analyzing the contents (layout) of the PDF,<strong>The path through which the file was transferred to Super RAG, and its extension and environment settings.<\/strong>It is determined by this. For example, if a user uploads a PowerPoint file such as a .pptx file, the upstream process converts it to PDF internally while maintaining the context that &quot;this originates from PPTX&quot; and passes it to the DocReader raster page processing pipeline. On the other hand, if the user<strong>Export the PowerPoint presentation to PDF beforehand. <\/strong><strong>.pdf<\/strong><strong> <\/strong><strong>If uploaded as<\/strong>For Super RAG, this is no different from receiving a regular .pdf file.<strong>By default, requests are routed to Azure DI.<\/strong>The intention is<strong>To avoid confusing the materials that each engine excels at.<\/strong>So, Azure DI uses standard PDFs and documents with a clear chapter structure as documents.<strong>This raster page processing pipeline<\/strong>For complex-layout PDFs created by Super RAG from Office files and images, the destinations are predetermined to ensure that each type of file is distributed in a way that leverages its strengths.<strong>The Super RAG extraction layer takes on this decision at the input.<\/strong>This system allows users to receive consistent extraction results without having to worry about the contents.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here, as mentioned in the first installment...<strong>Strengths in Japanese documentation<\/strong>Let me elaborate on this further. Japanese business documents often contain pages with a mix of vertical and horizontal typesetting, text that blends Japanese characters with alphanumeric characters and symbols, unique punctuation and parenthetical notation, and compound notation such as &quot;Figure 1-2&quot; for figure and table numbers.<strong>Elements that general-purpose OCR systems, which are based on English, tend to struggle with.<\/strong>It contains many of these. DocReader<strong>Layout analysis and OCR<\/strong>It is tuned to handle these Japanese-specific materials in a practical way, such as Japanese manuals, regulations, and meeting minutes.<strong>Materials that combine &quot;Japanese language + complex layout&quot;<\/strong>It is designed to be highly effective.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u8868\u56f3\u30ad\u30e3\u30d7\u30b7\u30e7\u30f3\u3092\u3069\u3046\u62fe\u3046\u304b-\u901a\u5e38\u306e\u30c6\u30ad\u30b9\u30c8\u62bd\u51fa\u3068\u306e\u9055\u3044-27\" class=\"wp-block-heading\"><a>How to extract tables, figures, and captions \u2014 Differences from normal text extraction<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Up to this point, we&#039;ve seen that &quot;the Super RAG extraction layer uses different engines with DocReader at its core&quot; and &quot;handles PDFs in two separate systems,&quot;<strong>Why is such a complex system even necessary in the first place?<\/strong>\u2014You might have that question. The answer is included in the business documents.<strong>Tables and figures<\/strong>It&#039;s located there. With regular text extraction, a lot of information gets lost at this point.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"619\" src=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls-1024x619.png\" alt=\"\" class=\"wp-image-5009\" srcset=\"https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls-1024x619.png 1024w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls-300x181.png 300w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls-768x464.png 768w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls-18x12.png 18w, https:\/\/cinnamon.ai\/wp-content\/uploads\/2026\/04\/03_table_figure_pitfalls.png 1121w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><em>Pitfalls of standard text extraction and how DocReader can compensate for them.<\/em><\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 id=\"aioseo-\u8868\u306e\u30bb\u30eb\u7d50\u5408-\u5217\u306e\u5bfe\u5fdc\u95a2\u4fc2\u304c\u5d29\u308c\u308b-31\" class=\"wp-block-heading\"><a>Merging cells in a table \u2014 the column correspondence is broken.<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For example, suppose there is a table like this in part of an employee roster used in a certain business system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">| Department | Employee |<br>|&#8212;&#8212;|&#8212;&#8212;|<br>| General Affairs | Yamada |<br>| | Suzuki |<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Anyone looking at it can tell at a glance that &quot;Ms. Suzuki is also in the General Affairs Department.&quot; The cell in the &quot;Department&quot; column is<strong>Combine vertically<\/strong>It is written that way, and the &quot;General Affairs&quot; above also applies to &quot;Suzuki&quot; below, which is a common way of writing. However, if you simply extract this as text,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1. General Affairs: Yamada<br>2 Suzuki<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And so,<strong>Suzuki&#039;s department is vacant.<\/strong>This is what happens. When the subsequent AI asks, &quot;What department does Mr. Suzuki work in?&quot;, it can only answer, &quot;I don&#039;t know.&quot;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">DocReader treats this type of page as an image,<strong>Table Model<\/strong>The cell structure itself is restored before extraction. Cell merges are recognized as &quot;merged,&quot; and the column correspondences are maintained when passed downstream, so the correspondence &quot;Suzuki&#039;s department = General Affairs&quot; is not broken when it is passed to the AI.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 id=\"aioseo-\u56f3\u30b0\u30e9\u30d5\u30d5\u30ed\u30fc\u30c1\u30e3\u30fc\u30c8-\u691c\u51fa\u3059\u3089\u3055\u308c\u306a\u3044-38\" class=\"wp-block-heading\"><a>Figures, graphs, flowcharts \u2014 not even detected<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Another typical oversight is,<strong>Diagrams, graphs, flowcharts, system configuration diagrams, organizational charts<\/strong>These are visually easy to understand, but they contain almost no text (or only fragments),<strong>It wouldn&#039;t even be detected by a standard text extraction method.<\/strong>This happens quite often. Even if the text says, &quot;Please refer to the following diagram,&quot; the diagram itself is often missing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">DocReader first processes the page image.<strong>Layout Analysis<\/strong>Then, we label each area as &quot;This is text,&quot; &quot;This is a table,&quot; and &quot;This is a figure.&quot; For areas identified as figures,<strong>Diagram Model<\/strong>The process involves extracting the data and, if necessary, summarizing it using LLM and keeping it as part of a chunk. This ensures that the figures themselves and their most recent captions are not excluded from the search.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 id=\"aioseo-\u7d20\u306e\u30c6\u30ad\u30b9\u30c8\u5316\u3068\u69cb\u9020\u3092\u4fdd\u3063\u305f\u62bd\u51fa\u306e\u9055\u3044-41\" class=\"wp-block-heading\"><a>The difference between &quot;raw text conversion&quot; and &quot;extraction while preserving structure&quot;<\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In short, within business documents<strong>Tables and figures are prone to being overlooked.<\/strong>\u2014This refers to the parts that have a structure other than text. Raw text conversion has a weakness in this area, and it will inevitably appear as a bottleneck somewhere in the subsequent search and answer process.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">DocReader&#039;s design aims to address this weakness from the outset. It divides the content into areas using layout analysis, processes each area with specialized models (table model, diagram model, body text), and then regroups them into sections using aggregation.<strong>&quot;Understanding the page&quot; rather than &quot;reading the words.&quot;<\/strong>This difference in approach will improve the accuracy of the subsequent steps.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u30b9\u30ad\u30e3\u30f3pdf\u753b\u50cf\u306eocr-44\" class=\"wp-block-heading\"><a>OCR for scanned PDFs and images<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Documents used in business include:<strong>Paper-derived materials<\/strong>There are many such cases. These include scanned meeting minutes (including handwritten ones), order forms, invoices, and contracts, as well as faxed copies of administrative documents\u2014materials that, while in PDF or image format, cannot have their content extracted as text.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Super RAG also uses these materials<strong>Just add it to the folder.<\/strong>It is designed to be received by [this device].<strong>Scanned PDF (<\/strong><strong>.pdf<\/strong><strong> <\/strong><strong>(File extension)<\/strong>Like standard PDFs, it is routed to Azure DI, where Azure DI&#039;s OCR extracts the text.<strong>Image files such as PNG and JPEG<\/strong>DocReader first converts it to PDF internally,<strong>PDF and common raster page processing pipeline<\/strong>It merges with the existing data and processes it using layout analysis and built-in OCR. Users don&#039;t need to manually route the data, such as &quot;it&#039;s a scan, so it goes through a different route.&quot;<strong>If you register PDFs and images in a folder, they will be processed through the appropriate route.<\/strong>\u2014This simplicity is a major advantage in operation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The OCR engine inside DocReader is designed to flexibly adapt to future environmental changes.<strong>Switchable<\/strong>The design is such that the most common options are the OCR in our currently used in-house Flax Scanner, and alternative candidates such as Azure DI&#039;s OCR and EasyOCR. Depending on the type of document (documents with good print quality, documents with poor scan quality, documents containing handwriting, etc.), the requirements for accuracy and processing costs, and the performance improvements of each model, the choice will be made.<strong>During version upgrade<\/strong><strong>We will select the most suitable OCR.<\/strong>Having various tuning options within the same DocReader pipeline contributes to its superior accuracy.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u30c1\u30e3\u30f3\u30af\u5316\u304c\u62bd\u51fa\u306e\u6700\u7d42\u4ed5\u4e0a\u3052-48\" class=\"wp-block-heading\"><a>Chunking is the final step in extraction.<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The extracted content (sections, tables, figures, FAQs, terminology, etc.) will be used in the final step.<strong>Chunking<\/strong>After going through this process, it is organized into a search unit. Here again, instead of &quot;simply breaking it down into small pieces&quot;,<strong>A special chunker is selected depending on the properties of the material.<\/strong>This is the design.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main chunkers include the general-purpose ReferencesChunker, the FAQChunker which treats Q&amp;A pairs as a single unit, the DictionaryChunker which maintains term-definition pairs, the FreeTextChunker which handles text-centric documents, the PowerPointChunker which maintains PowerPoint slides, and the PDFAttachmentsChunker for handling PDF attachments.<strong>&quot;Even with the same extraction results, the search accuracy can be further improved by changing the way the data is chunked.&quot;<\/strong>\u2014This is one of the features of Super RAG.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, the chunk<strong>Overlap and window<\/strong>There&#039;s also a mechanism to accommodate this. For example, if you include sections where meaning continues across chapter boundaries, slightly overlapping them with adjacent chunks, it becomes easier to answer &quot;cross-boundary questions&quot; in subsequent searches. Chunks that are too long become search noise, and chunks that are too short lose context\u2014the design allows you to adjust the optimal granularity for each type of component.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I mentioned at the beginning that &quot;extraction determines the ceiling,&quot; but to be more precise,<strong>Extraction plus chunking while preserving structure determines the ceiling of subsequent searches.<\/strong>Next time, we&#039;ll discuss the details of that &quot;search&quot; function\u2014how it achieves a balance between meaning and words.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u6301\u3061\u5e30\u308a\u691c\u8a0e\u6750\u6599\u81ea\u793e\u30c7\u30fc\u30bf\u3067\u306e\u6bd4\u8f03\u8a66\u884c\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8-53\" class=\"wp-block-heading\"><a>[Things to take home and consider] Checklist for comparative trials using your own data<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Up to this point, we&#039;ve seen that &quot;the Super RAG extraction layer uses different engines with DocReader at its core,&quot; &quot;PDFs are handled in two separate systems,&quot; &quot;tables and figures have their structure restored using a model,&quot; and &quot;scanned PDFs go to Azure DI, and image files go to DocReader.&quot;<strong>I wonder what would happen if it were our company&#039;s business documents.<\/strong>You might be thinking, &quot;What?&quot;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">to be honest,<strong>The fastest way is to try it with your own data.<\/strong>Yes. Super RAG offers free and paid trials, allowing you to run your actual data, perform searches and inquiries, and see how it differs from your current operations.<strong>Please contact us using the inquiry form at the end of the article.<\/strong> Our sales team will provide you with individual proposals.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To help you in that process, we&#039;ve prepared a checklist for comparative trials that you can use in your internal review. Organizing &quot;what to test&quot; and &quot;how to evaluate&quot; will significantly improve the accuracy of your trials.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>[Things to take home and consider] Checklist for comparative trials using your own data<\/strong> Organize &quot;what to try&quot; and &quot;how to evaluate&quot; from five perspectives.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>perspective<\/strong><\/td><td><strong>Things to do<\/strong><\/td><td><strong>deliverables<\/strong><\/td><td><strong>Tips<\/strong><\/td><\/tr><\/thead><tbody><tr><td><strong>1. <\/strong><strong>File to be verified<\/strong><br>Comprehensive coverage of format \u00d7 complexity<\/td><td>Format: PDF, Scan, DOCX, XLSX, PPTX<br>Complexity: Chapter structure \/ Table cells \/ Figures \/ Multi-column layout<br>Guideline: 3-5 files x 2-3 stages<\/td><td>File List<br>(Format \u00d7 Complexity Matrix)<\/td><td>Including &quot;documents that caused difficulties with existing operations and systems&quot; makes the effects easier to see.<\/td><\/tr><tr><td><strong>2. <\/strong><strong>Designing search queries<\/strong><br>4 types \u00d7 3-5 questions<\/td><td>Single-hop (fact-checking)<br>Multi-hop (crossing multiple documents)<br>Dictionary (Definition of Terms) \/ Strict Terms<\/td><td>Question list<br>(Tagged by type)<\/td><td>Gathering questions that the people in charge of the business actually wanted to ask will make your argument more persuasive.<\/td><\/tr><tr><td><strong>3. <\/strong><strong>Comparative perspective<\/strong><br>Check with 4 axes<\/td><td>Accuracy of extraction \/ Validity of search<br>Basis for the answer \/ Omissions<br>(See the difference between the existing RAG and Super RAG)<\/td><td>Question \u00d7 RAG \u00d7 4-axis result recording sheet<\/td><td>The answer can be quantified by &quot;evaluating it on an N-point scale compared to the model answer.&quot;<\/td><\/tr><tr><td><strong>4. <\/strong><strong>Evaluation metrics<\/strong><br>Quantitative + Qualitative<\/td><td>Quantitative: Hit rate of the top N exemplary response criteria<br>Qualitative: Similarity to the model answer as rated on an N-point scale by the verification team, and the user&#039;s perception of its &quot;usability&quot; (considering both factors makes the assessment easier).<\/td><td>Summary table + comments\/notes<\/td><td>Having the person in charge of the task actually use it leads to faster consensus building.<\/td><\/tr><tr><td><strong>5. <\/strong><strong>Trial consultation<\/strong><br>Individual proposals<\/td><td>The scope, duration, and number of target files are individually designed to meet business requirements.<br>(Contact form at the end of the article)<\/td><td>contact<br>(A summary of perspectives 1-4 is attached.)<\/td><td>Clearly defining and communicating &quot;what you want to try&quot; will significantly accelerate the initial design process.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">From five perspectives, each is:<strong>Tasks\/Deliverables\/Tips<\/strong>We will focus on this point.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><span style=\"text-decoration: underline;\">Perspective 1: How to select files to be verified<\/span><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is recommended to start with 3-5 files at 2-3 levels of complexity, combining comprehensive formatting (PDF text-only\/scanned\/DOCX\/XLSX\/PPTX) and complexity of content (deep chapter structure\/many merged table cells\/many figure captions\/multi-column layout).<br><strong>Tips<\/strong>To avoid verification for the sake of verification, instead of blindly collecting and creating complex documents, including &quot;documents that caused difficulties with existing Q&amp;A systems&quot; will make the effects much clearer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><span style=\"text-decoration: underline;\">Perspective 2: Designing Search Queries<\/span><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will divide the questions into four types (Single-hop fact-checking questions, Multi-hop questions spanning multiple documents, Dictionary questions defining terms, and precise terminology questions for model numbers and proper nouns), and prepare 3 to 5 questions for each type.<br><strong>Tips<\/strong>Gathering questions that actual business users wanted to ask the AI will make the evaluation more persuasive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><span style=\"text-decoration: underline;\">Perspective 3: Comparative Perspective<\/span><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We evaluate the results on four axes: accuracy of extraction (chunk boundaries, table cells, figure captions), validity of search (are the top hits relevant?), basis for answer (are the chunks actually cited from the search hits valid?), and missed information (does Super RAG provide information that is not available in existing RAGs?).<br><strong>Tips<\/strong>By creating an Excel spreadsheet summarizing anticipated questions, model answers, and reasoning behind the answers, and then scoring which of the existing RAGs and the Super RAG answers are closer to the ideal, the differences between the two RAGs will become clear.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><span style=\"text-decoration: underline;\">Perspective 4: Evaluation Indicators<\/span><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We&#039;ll use both quantitative (citation rate of the basis for the answer) and qualitative (similarity to the model answer on an N-point scale, and whether the person in charge feels it&#039;s &quot;usable\/unusable&quot;).<br><strong>Tips<\/strong>Qualitative evaluation is, if possible<strong>To the person in charge of the task<\/strong>Having people physically interact with the product helps to expedite the process of reaching internal consensus.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><span style=\"text-decoration: underline;\">Perspective 5: Trial consultation<\/span><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The specific settings, such as scope, duration, and number of target files, will vary depending on the business requirements.<strong>Please use the contact form at the end of the article.<\/strong>Please feel free to contact us. Our sales team will provide suggestions tailored to your specific situation.<br><strong>Tips<\/strong>If you can organize and communicate in advance what you would like to try from perspectives 1-4 during the consultation, the initial trial design process will become much smoother.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u6b21\u56de\u4e88\u544a-70\" class=\"wp-block-heading\"><a>Next episode preview<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The next episode (episode 5) will be:<strong>Hybrid search and reranking \u2014 Balancing &quot;meaning&quot; and &quot;words&quot;<\/strong>We will deliver this to you. This time, the topic was &quot;extraction determines the ceiling of the later stages,&quot; but that ceiling is<strong>Searching is about deciding how much you can use up.<\/strong>is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Specifically, they are strong at answering questions that ask, &quot;What is the concept in essence?&quot;<strong>Vector search<\/strong>And, it is strong in precise terms like &quot;model number XYZ-123&quot;.<strong>Full-text search<\/strong>How does Super RAG combine to handle both question types? Furthermore, what are the top search results?<strong>Rerank<\/strong>Why does the rearrangement mechanism improve accuracy? We&#039;ll look at how the material, organized through &quot;extraction + chunking,&quot; is utilized in the search stage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By reading the 4th (extraction) and 5th (search) together, you will understand Super RAG<strong>Consistent quality design from preprocessing to search.<\/strong>The outline should become visible.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"aioseo-\u307e\u3068\u3081-74\" class=\"wp-block-heading\"><a>summary<\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Document extraction may seem like a mundane process, but<strong>This is where we set the ceiling for the overall accuracy of RAG.<\/strong>This is a crucial entry point. The Super RAG extraction layer uses this entry point as its in-house document comprehension engine. <strong>DialogReader<\/strong> It is built around this core.<strong>Super RAG assigns the engine according to the file format.<\/strong>,<strong>Office files (Word\/PowerPoint\/Excel) and images (PNG\/JPEG)<\/strong>It was converted to PDF internally. <strong>DocReader&#039;s raster page processing pipeline<\/strong>but,<strong>.Files uploaded directly as .pdf<\/strong>This is a division of labor where Azure Document Intelligence handles tasks such as standard text scanning. It&#039;s designed to preserve the structure of the data, including merged table cells and figure captions\u2014elements often missed by ordinary text extraction\u2014through layout analysis, table modeling, and figure modeling before passing it downstream. The OCR backend can be flexibly switched to adapt to future market changes and technological advancements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you&#039;re wondering &quot;how would this work with our company&#039;s documents?&quot;,<strong>Please feel free to contact us using the inquiry form at the end of the article.<\/strong>You can compare the results by running your own data through free or paid trials. Please also use the checklist in the &quot;Things to Take Home for Consideration&quot; section at the end of this article for internal review before the trial.<\/p>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">&lt;Articles in this series&gt;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/cinnamon.ai\/en\/ideas\/super-rag-tech-blog-01\/\" target=\"_blank\" rel=\"noopener\" title=\"[Part 1] Achieving a Quality Beyond Standard RAG \u2014 Technical Innovations Supporting Super RAG\">[Part 1] Achieving a Quality Beyond Standard RAG \u2014 Technical Innovations Supporting Super RAG<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cinnamon.ai\/en\/ideas\/super-rag-tech-blog-02\/\" title=\"[Part 2] A Thorough Explanation of Three Embedded Patterns \u2014 Analyzing API Call Flows and Dify Expense Review\">[Part 2] A Thorough Explanation of Three Embedded Patterns \u2014 Analyzing API Call Flows and Dify Expense Review<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cinnamon.ai\/en\/ideas\/super-rag-tech-blog-03\/\" title=\"[Part 3] API Function Catalog \u2014 What can be done, and what should be handled in-house?\">[Part 3] API Function Catalog \u2014 What can be done, and what should be handled in-house?<\/a><\/li>\n\n\n\n<li>[Part 4] The Contents of Document Extraction \u2014 How is the Engine Selected? (Scheduled for release in June 2026)<\/li>\n\n\n\n<li>[Part 5] Hybrid Search and Reranking \u2014 Making the Most of Extracted Content by Balancing &quot;Semantic&quot; and &quot;Word&quot; (Scheduled for release in June 2026)<\/li>\n\n\n\n<li>[Part 6] Super RAG Answer Strategy \u2014 Design Decisions for Mastering Single-hop and Three Techniques to Support Answer Quality (Scheduled for release in June 2026)<\/li>\n<\/ul>\n\n\n\n<div style=\"height:80px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-fe48e5de wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/contents.cinnamon.ai\/contact\/inquiry_blog\" target=\"_blank\" rel=\"noreferrer noopener\">contact<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/contents.cinnamon.ai\/download\/wp_superrag_dl_blog\" target=\"_blank\" rel=\"noreferrer noopener\">Super RAG Document Download<\/a><\/div>\n<\/div>\n\n\n\n<div style=\"height:140px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>","protected":false},"excerpt":{"rendered":"<p>\u306f\u3058\u3081\u306b \u7b2c1\u56de\u3067Super RAG\u306e\u5168\u4f53\u50cf\u3068\u5dee\u5225\u5316\u30dd\u30a4\u30f3\u30c8\u3092\u3001\u7b2c2\u56de\u30fb\u7b2c3\u56de\u3067API\u306e\u7d44\u307f\u8fbc\u307f\u30d1\u30bf\u30fc\u30f3\u3068\u6a5f\u80fd\u30ab\u30bf\u30ed\u30b0\u3092\u3054\u7d39\u4ecb\u3057\u3066\u304d\u307e\u3057\u305f\u3002\u3053\u3053\u307e\u3067\u306f\u300c\u3069\u3046\u4f7f\u3046\u304b\u300d\u306e\u5730\u56f3\u3092\u6e21\u3059\u56de\u3067\u3057\u305f\u3002\u4eca\u56de\u304b\u3089\u306f\u8996\u70b9\u3092\u5909\u3048\u3066\u3001Super [&hellip;]<\/p>","protected":false},"featured_media":5004,"template":"","ideas-cat":[59],"class_list":["post-5006","ideas","type-ideas","status-publish","has-post-thumbnail","hentry","ideas-cat-tech"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas\/5006","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas"}],"about":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/types\/ideas"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/media\/5004"}],"wp:attachment":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/media?parent=5006"}],"wp:term":[{"taxonomy":"ideas-cat","embeddable":true,"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas-cat?post=5006"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}