Stemming, Multiples, and Numbers

MichaelChambers commented 4 years ago

According to the Wikipedia page, the formula was supposed to also include words with other suffixes. I found a details on allowed suffixes here: http://www.lefthandlogic.com/htmdocs/tools/okapi/okapimanual/dale_challWorksheet.PDF

That also says that numbers are allowed, and that the formula is refers to counting multiples of difficult words, so it appears that the uniqueness of easy words need not be considered, but just the total.

I've switched my readability to have the following, but I think these details would likely be better as a isDifficult method in dale-chall-formula. Doing so would mean dale-chall-formula would have to require dale-chall word list, which you perhaps wanted to avoid.

let reDaleChallEndings = /(s|ing|n|ed|ly|er|est)$/ let reDaleChallNumber = /^[1-9]\d{0,2}(,?\d{3})*$/ `

if (
  daleChall.includes(normalized) ||
  normalized.match(reDaleChallNumber) ||
  (normalized.match(reDaleChallEndings) &&
    daleChall.includes(stemmer(normalized)))
) {
  easyWordCount++
}

`

wooorm commented 4 years ago

Hey these are great finds! By the way, did you know that you can open pull requests here on GitHub as well? If you make a change in your fork of readability, you can then request those changes to be added here.

I've switched my readability to have the following, but I think these details would likely be better as a isDifficult method in dale-chall-formula. Doing so would require dale-chall to require dale-chall word list, which you perhaps wanted to avoid.

I think it’s the responsibility of other projects, not the list of words, nor the mathematical formula, to try and match those things.


normalized.match(reDaleChallEndings) &&
   daleChall.includes(stemmer(normalized))

Stemming the word probably doesn’t make it match the word list, stemming does not create valid words, for examples of non-words as results, see the output fixtures. You could fix this by stemming the input list too

MichaelChambers commented 4 years ago

Thanks Titus. For some of these, (this one in particular) I wasn't sure where/how you would prefer to fix it, so I figured I'd just raise a flag. I would have eventually created one ore more pull requests for some of the issues, but I've made other changes stripping things you'll want to keep out, so I didn't have the precise changes ready yet,

MichaelChambers commented 4 years ago

@wooorm Regarding project responsibilities, I would totally agree how you have the list of words broken out. I think I would assume that "the Dale-Chall formula" = all of the parts of that worksheet which are necessarily part of their rules/process, not just the final mathematical equation. Thus the stemming part would seem like a necessary part of following their rules to produce the final score.

Thanks for letting me know that the stemming as I had it wasn't complete. Nice solution about stemming the source list to match!

dale-chall-formula could take the list in dall-chall, and create a second list in memory of stemmed versions of all of them (perhaps lazy loaded on first test). Then it would hold all the dale-chall-formula specifics.

Or there could be a third dale-chall project that was the full worksheet that put the list together with the equation, as well as the other specific parts of the dale-chall process (numbers and stemming). It still makes sense to me to keep that in dale-chall-formula, but whatever works for you.

wooorm commented 4 years ago

While I agree that it would make dale-chall-formula more useful to add functionality like that, well, as we’ve seen from this issue and my experience parsing text, it’s not easy to properly match the rules from your PDF. In fact, it would add lots of logic and thus weight to the project. And be buggy. And I would do it using retext and my other utilities, whereas others might want to swap that out for something else.

I’d be in favor of another project. It could be a new project, but while you can get to 90% perfect text parsing in very few lines of code, properly doing so and getting to 92% adds tons of complexity. So maybe in this project? Because, maybe, ignoring numbers will also help with the other readability formulas?

MichaelChambers commented 4 years ago

True! We're still not getting to 100%. And I had decided for my purposes I didn't care about getting proper nouns too, as that would be too hard, but I hadn't mentioned that. Stepping back a bit, my use case was just that I want to call a function to test any text and have it return the scores. The highlighting was a nice addition. I'm very grateful to you for your projects, as between them they get me 99% of the way to what I wanted! Thanks! As I have only rarely used NPM, I don't have a good sense for splitting out projects there or what is too heavy. I might split readability into two projects - an inner, functional one (readability-scores?) focused on just generating all scores, and the calling (outer) readability project focused on the HTML and user choices. I'm not sure how I would contribute this. I guess I'll try to make readability-scores, and then let you know.

MichaelChambers commented 4 years ago

@wooorm , I've split out the readability-scores code to make a project at https://github.com/MichaelChambers/readability-scores What do you think? I'd suggest adding this as line 73.5 to test.js in order to see all the results, plus difficult words, for the Gettysburg Address. console.log(results)

MichaelChambers commented 4 years ago

I used the above to create my first npm project, and then refactored my https://github.com/MichaelChambers/readability to use the new readability-scores.

In my copy, I'd also made some changes to remove some of the user's options.

You're still listed as the author and contributor to both, as well as the code having your prior commits. I wasn't sure how best to handle that, so tried to err on the side of pointing to you instead.

wooorm commented 4 years ago

Congrats on your first npm project! ✨ Welcome!

If you feel your code is a derivative work, you should indeed include me and the license file from the original work. If you believe this is a new creative work, you may omit those. Legalities are confusing.

And splitting the two up, one for the counts, one for the display, seems like a great idea to me!

MichaelChambers commented 4 years ago

I've created a pull request to use the new project with readability. https://github.com/wooorm/readability/pull/8

MichaelChambers commented 4 years ago

From the Dale-Chall test in readability-scores: Dale-Chall suffixes per http://www.lefthandlogic.com/htmdocs/tools/okapi/okapimanual/dale_challWorksheet.PDF ['s', 'ies', 'ing', 'n', 'ed', 'ied', 'ly', 'er', 'ier', 'est', 'iest'] None of the words below are in the Dale Chall original list. All must use stemming. Note that the stem may not be a word: "Liberties" stem is "liberti", which is why we stem the original list too.

As "lively" is in the list, "livelier" and "liveliest" should pass due to "ier" and "iest" being valid suffixes. But although "prick" is in the list, "prickly" is not. "Prickly" would be a valid base+suffix, but "pricklier" and "prickliest" should not be valid. s = 'Rights, Liberties rolling proven rested praised hurried properly higher longest lively livelier liveliest prick prickly pricklier prickliest'

words / dale-chall-formula

Stemming, Multiples, and Numbers #15