zaeema-n / pdf_extract

1 stars 0 forks source link

Steps to Improve the Accuracy of the workflow #1

Open vibhatha opened 1 day ago

vibhatha commented 1 day ago

Steps to Improve the Accuracy of the workflow

  1. Understanding the Structure: From the images recognize the repeated format for each minister: three columns labeled "Subjects and Functions," "Departments, Statutory Institutions and Public Corporations," and "Laws, Acts, and Ordinances to be Implemented." Each section had numbered lists or bulleted points.

  2. Manual Extraction and Formatting: Manually extract this structured information by interpreting the visible text in each column across images, relying on OpenAI models' trained ability to analyze and organize the content as text.

  3. Organizing Data by Minister: Based on each minister's distinct titles (e.g., "Minister of Defence," "Minister of Finance..."), Group content accordingly and then organize each part into a three-column format, maintaining the original list structure from the images.

  4. Reformatting for Readability: Present the structured information in a clear, numbered, and bulleted format to make it easier to read and follow.

If these were to be performed programmatically, it would involve more steps than the previous code provided, and it would likely require some manual post-processing to ensure accuracy, given that OCR might introduce errors or inconsistencies.

What Would Be Needed to Automate This Process

To automate a task like this as closely as possible to how I initially presented it, a more advanced approach would be necessary. Here's what that would entail:

vibhatha commented 1 day ago

We should move step by step and make this a workflow. Let's make sure each step is as accurate as possible.