Closed russelljjarvis closed 3 years ago
Commented.
I am creating a kind of live and author searchable version of your python app here. There is a live/deployed version of the app here, which supports word clouds, as well as readability.
Interesting! I am sure you have see that there are a lot of things in the code which could be made better. I was learning python when we did this project, so a lot of the code could be improved (and is quite embarrassing to look at today). For example, there are functions in the nltk package that do a lot of the things abstract_cleanup.py does (and probably quicker and better). Also nltk can be used instead of treetagger. So it is probably good to keep this in mind! If you need any help, let me know.
I wonder what the process of publishing in elife was like? Were the journal reviewers good?
Absolutely. One of the cool things about elife is they published the reviews and our response to them (so they can be seen at the bottom of the article). We thought they were really helpful. I have always had an enjoyable experience publishing and reviewing for eLife (plus their commitments to open access and transparent methods is really good!)
Yes, you are right. I couldn't wait until the readability of science was pip install able, so I put an acknowledgement of the code origin at the top of the file, and pasted it in my repository as a temporary measure.
As you guessed, I also decided to replace with obsolete tree tagger with the NLTK tagger. This version of your code does seem to execute so I merged this change back into your code here here
One thing that you could do to help is to test the live version of the application on yourself as an author to see if the readability scores match your intuition and knowledge of your own written work.
In exchange for any advice and testing we can offer you coauthorship in our Prepub Open Science Portal. Also if this doesn't suite you but you know of other coding clear language enthusiasts please tell them about the project too.
These are the lines of our code where we compute readability on the scraped corpus At the moment, I am somewhere half way between using your FRE metric, the textstat "standard" metric (a consensus measure over about 10 different readability metrics), and instantaneous versions of them both, that which are less effected by word count.
The reason for all of these different approaches to measuring, is to get rid of intuition violating scores. Like -1 or >60. The purpose of our application is to give readers an insight into their or their colleagues readability on an individual basis and intuition violating scores won't help them do that.
The benefit of using FRE over textstat standard is that it is good for us to have some continuity with your pre-established work. The downside is these metrics seem to be better for a population level analysis.
For example, there are functions in the nltk package that do a lot of the things abstract_cleanup.py does (and probably quicker and better). Also nltk can be used instead of treetagger. So it is probably good to keep this in mind! If you need any help, let me know.
Actually I am using Abstract Clean up. Can you let me know what these other NLTK replacement functions are?
Yes, you are right. I couldn't wait until the readability of science was pip install able, so I put an acknowledgement of the code origin at the top of the file, and pasted it in my repository as a temporary measure.
As I was saying in the PR that you made, I think is a great thing to do! Or we create a new repo for this that is pip installable (and preferably where the code gets cleaned up a little bit more and made more pythonic). But I feel this repo should remain more a static snapshot of the analyses we did in the eLife article. Hopefully that is clear
The benefits of using FRE over textstat standard is that it is good for us to have some continuity with your pre-established work. The downside is these metrics seem to be better for a population level analysis.
We used FRE because it was more standard, but there are critiques of it. I am sure that they correlate quite highly with each other. So using the textstat package makes a lot of sense and using their cleanup functions would also help bring the analyses more in line with other readability analyses.
Actually I am using Abstract Clean up. Can you let me know what these other NLTK replacement functions are?
The abstract cleanup was based on us creating a lot of rules to ensure the preprocessing of articles worked well. And we did several rounds of quality control to ensure they worked on our data. This is something we made up with regular expressions. While it worked, it is the least elegant way to solve the problem. If I was to do a readability analysis today, I would look into functions in both NLTK (or textstat) that could replace some of the heuristics that I created then. First, they will probably be quicker (since the code we wrote is definitely not for speed) but also their code is maintained. That is why I would suggest substituting out our functions with textstat/nltk functions wherever possible.
One thing that you could do to help is to test the live version of the (application)[https://agile-reaches-20338.herokuapp.com/
I will give it a test and get back to you.
Also @bcschiffler @mathesong @pontusps, I'm tagging you in here if any of you want to provide comments to this.
Thanks for reaching out to the other coders.
@mcgurrgurr
Hi @wiheto, I wonder if you had considered remaking this as an elife executable document?
There is a danger that the research methods won't reproduce based on the source code as the source code modules are dated.
In one sense, the code/methods are well preserved, but a Docker container version would preserve the executable as well as the original code. It is my opinion that it is the executable that facilitates reproducibility.
Hi @@wiheto,
I am creating a kind of live and author searchable version of your python app here. There is a live/deployed version of the app here, which supports word clouds, as well as readability.
I have started using some of your readability analysis code directly, and thus I wanted to to call it from inside a python package. Using the code in this manner would properly preserve origin of the code.
I wonder what the process of publishing in elife was like? Were the journal reviewers good?