Open yuxuanbrandeis opened 1 year ago
bold_text - Jupyter Notebook.pdf Please check - I have modified the code and hope it fixes the bold text problem. While by manually checking the unsuccessful extraction files, I think the most frequent reason is that the titles are in different formats, and the contents contain broken pages. I will continue to try to solve these problems.
Hi Yeabin,
I've come to realize that much of our extraction process has been inaccurate due to our algorithm's inability to effectively identify headings written in bold. It's become evident that while certain reports are successfully extracted starting from the "Management Discussion and Analysis" (MD&A) headings when they're in bold, others are not. Furthermore, we've encountered situations where the extraction begins from footnotes instead of the actual headings.
This has become a significant concern as I've manually reviewed 30 10-K/10-Q reports, cross-referencing them with the SEC website's HTML-format reports. I found that some reports were accurately extracted with bold headings, while others were not.
I've attempted using the re.compile(r'\sDiscussion\s+and\s+Analysis\s+of\s+Financial\s+Condition[s]?\s', re.IGNORECASE | re.DOTALL) pattern to test on individual files, but unfortunately, it's still not yielding the desired results. Addressing this issue is of utmost importance to improve the accuracy and reliability of our extraction process.