yuxuanbrandeis / Julex

This is a shared space for Julex project involving code
0 stars 0 forks source link

Bold heading issues #8

Open yuxuanbrandeis opened 1 year ago

yuxuanbrandeis commented 1 year ago

Hi Yeabin,

I've come to realize that much of our extraction process has been inaccurate due to our algorithm's inability to effectively identify headings written in bold. It's become evident that while certain reports are successfully extracted starting from the "Management Discussion and Analysis" (MD&A) headings when they're in bold, others are not. Furthermore, we've encountered situations where the extraction begins from footnotes instead of the actual headings.

This has become a significant concern as I've manually reviewed 30 10-K/10-Q reports, cross-referencing them with the SEC website's HTML-format reports. I found that some reports were accurately extracted with bold headings, while others were not.

I've attempted using the re.compile(r'\sDiscussion\s+and\s+Analysis\s+of\s+Financial\s+Condition[s]?\s', re.IGNORECASE | re.DOTALL) pattern to test on individual files, but unfortunately, it's still not yielding the desired results. Addressing this issue is of utmost importance to improve the accuracy and reliability of our extraction process.

Meredithfan29 commented 1 year ago

bold_text - Jupyter Notebook.pdf Please check - I have modified the code and hope it fixes the bold text problem. While by manually checking the unsuccessful extraction files, I think the most frequent reason is that the titles are in different formats, and the contents contain broken pages. I will continue to try to solve these problems.