yuxuanbrandeis / Julex

This is a shared space for Julex project involving code
0 stars 0 forks source link

Deliverable1 update? #2

Open YeabinMoonEcon opened 1 year ago

YeabinMoonEcon commented 1 year ago

Could you process the following report? 0001041588-20-000001.txt

yuxuanbrandeis commented 1 year ago

Yes, We will process it. with the finbert model score as well

YeabinMoonEcon commented 1 year ago

For this report, don't have to get a sentiment. I am seeing some updates in the code, but I don't think it'd work well. Just show me how you import extract method from the deliverable module, and produce the output.

YeabinMoonEcon commented 1 year ago

Well, I just find the new folder hosting new codes, and I must say that it poses quite a challenge when it comes to following the workflow. I guess you guys are working now. Please clean up the repository once the task is completed.

yuxuanbrandeis commented 1 year ago
Screen Shot 2023-07-18 at 11 19 09 PM

I realized the module will extract all the text file so i can't put one single file after the file directory, unless i missed something, or I will revise the code so it only reads this specific text file

YeabinMoonEcon commented 1 year ago

Even if you put the folder in the input, it won't work well. However, system directory is not the only problem.

yuxuanbrandeis commented 1 year ago
Screen Shot 2023-07-18 at 11 27 58 PM

it works for me on jupyternotebook, and it saved as a excel on desktop. but we will update it

YeabinMoonEcon commented 1 year ago

Interesting. For the row 16 and 17, are they 10-Ks?

hasanallahyarov1 commented 1 year ago

If you check main_folder in GitHub, we put the latest version of the codes there, and we updated the code so now it will work well for the report you provided as well.

yuxuanbrandeis commented 1 year ago

it is a 10-q actually. we are modifying so the code also print another column labeling it is either 10-q or 10-k

yuxuanbrandeis commented 1 year ago

I

Screen Shot 2023-07-18 at 11 34 44 PM

this company had 2 10-q files under 2022 02/01, they filed twice in march and june

YeabinMoonEcon commented 1 year ago

Alright, I've reviewed Hassan's code. I would rather use his code because it seems strange to include the system path in the deliverable. However, we still need to address the problem with the end-point string.

YeabinMoonEcon commented 1 year ago

Professor Becker is still facing issues with accessing GitHub. Please share the deliverable on Google Drive as well.

YeabinMoonEcon commented 1 year ago

Some reports contain a nesting structure within the MD&A section, where you may come across bulleted points (unordered lists) or numerical subheadings. The current algorithm may not correctly handle the sentence delimitation in such cases. If you come across any reports with this structure, please inform me, and I will review them to identify and fix any issues.

yuxuanbrandeis commented 1 year ago

Thank you, sounds good

YeabinMoonEcon commented 1 year ago

Could you add a demo code to follow the process pipeline. Pick each 10 Q and 10 K report and show how they are processed.

hasanallahyarov1 commented 1 year ago

Do u mean for sentiment analysis?

YeabinMoonEcon commented 1 year ago

From the raw input to the sentiment.

hasanallahyarov1 commented 1 year ago

Yes I will add it now, also while doing sentiment I came up with this report which has bullet points: 0000940944-22-000007.txt . So sent_tokenizer was not working and it was exceeding limit, so this is the way we try to overcome it for now. But we are trying to find a better solution clean_string = re.sub(r"[•;]", ".", clean_string)

hasanallahyarov1 commented 1 year ago

demo.py added

YeabinMoonEcon commented 1 year ago

0001628280-23-004026.txt Try this.

hasanallahyarov1 commented 1 year ago

To be honest, we know this problem, but we can't come to a solution to this. The idea which we came to take only last occurrence however there are some cases in which these items are discussed after the Management Discussion Part, I mean like they add the headers in the end of the document and etc. In general if we take last occurrence it will increase the accuracy in our opinion however some of the cases which are correct now will not be extracted right but in general It will increase in our opinion. But we do not have a final solution to this and no more ideas to such cases. Would be happy to hear anything from your suggestions

hasanallahyarov1 commented 1 year ago

And also we have a question about results, so we added Filed Date and also number of words, however for now we don't know how to improve extraction we are still working on it and the sentiment as well. But we will do time series for last 5 years with this extraction and sentiment for now, and then to make analysis from them. At the same time while they run we will try to improve it, however we won't be able to add it to the output for now cause as you know the running takes time, and whenever we change the code we will need to run all 5 years from the beginning. Do you think it will work? or better to try to improve both extraction and sentiment more and then run?