yuxuanbrandeis / Julex

This is a shared space for Julex project involving code
0 stars 0 forks source link

sentence slicing. #5

Open YeabinMoonEcon opened 1 year ago

YeabinMoonEcon commented 1 year ago

I am wondering about your approach to sentences that are not correctly separated. From my previous observation, when the text contains subheadings, symbols, or (un)ordered lists, the resulting sentences are often concatenated. Could you let me know where I should look to find such instances? Additionally, I am interested to know how you handled these cases when inputting them into Finbert.

yuxuanbrandeis commented 1 year ago

Can you please clarify which document you are referring to? I have made some updates to the code in Hassan's main code for HTML formatting. Additionally, we discussed that for flagging, the minimum word count should be around 100, and we checked reports with word counts between 50 and 100 to see if they provide useful information. However, we haven't come up with a specific criterion to measure how many words are considered too many, especially for counts exceeding 10,000 words. We believe that a higher word count can provide useful insights, as we researched in some papers. For example, "Form 10-Ks with higher annual report word counts appear to reduce the ability of investors to quickly incorporate information into current stock prices." Therefore, we want to include reports with large word counts to see whether they have a regression relationship with stock prices. What are your thoughts on this?

yuxuanbrandeis commented 1 year ago

Actually, Hasan was able to find an article that provide insights on average word count. will keep you updated

hasanallahyarov1 commented 1 year ago

So what we did is we replaced \n with a dot. So whenever those subehadings have \n after it will be assumed as a separete sentence. To ignore them we will use the method you suggested by putting the minimum amount of characters which a sentence should contain to go through finbert. Some of them still do not have \n after them so they are concatenated and we have not handled them yet.

YeabinMoonEcon commented 1 year ago

@yuxuanbrandeis

  1. Do you have any document free from sentence slicing issues? Transfer-learning models quite sensitive to the sentence delimiter.
  2. Whether the extracted content contains 30 words or 1 million words is ok as long as it remains valid. We may discover more meaningful patterns in 10-Ks since they undergo auditing. One potential approach could be to select 5-10 firms with significantly small and large word-counts report and track their word counts over time. My assumption is that the word counts would display high variance, but we should verify this.
  3. You will see that the time of release has no impact on stock price. However, it is crucial to construct the regression equation carefully as the timing of release varies among different firms. It is not that easy but worth trying. @hasanallahyarov1 Could you update the demo? Also did you check whether my methods produce different results?
hasanallahyarov1 commented 1 year ago

I will update it with our updated functions. About your methods, to be hones we have thought about it and we could not come to a way to identify sections and then choose the one with maximum number of words. Because if we know the way to identify it in sections then we could just identify MDA section immidietely.

Also i think there are 2 problems with maximum number of words in MDA. It is true that in 10-k at least only footnotes have more words as i found in 1 of the researches. However it is not true for earlier years for sure. Because when we had 2001 year, i have witnessed too many example if not all where MDA is too short. The 2nd problem is to count number of words we need to clean all html from whole document first and then count it. Cause otherwise we need to take number of characters and html can messed it up based on sections. And to clean whole document from html takes more seconds and it is very very noticiable when there is more than 1k documents.

But still the main think is we could not understand how to identify the sections, one way is based on the keyword "ITEM" however it will have same problems with our current algorithm.

hasanallahyarov1 commented 1 year ago

What we have thought to handle that problem when our search pattern occurs not only in the headings. At least it will work for recent years for sure, but for earlier years this method wont work.

So if we can take the occurance of "Management Discussion and Analysis ... " but only the ones which are bold or which has font-weight:700 which is bold. For recent years reports this method will work. And we wanted to ask you if it is worth of try. We are not sure if we can find the ones with those html tags but this is what we have in mind.

Hope it was not very confusing:)