Open ykdojo opened 5 years ago
@Jonathantsho FYI, I'm working on this one now.
I'm starting to think, this might be a good structure for the database:
We already have: Post.
Each Post has text.
In addition to that, we should have a Sentence model.
Each Sentence will belong to a Post.
And each Sentence will have a sentence_index, which will be the index that shows where in the Post it appears.
So, the first sentence will have sentence_index = 0, and the second sentence will have sentence_index = 1, and so on.
I'm going to look into the best way to do this right now.
This actually seems like a non-trivial problem.
Some StackOverflow discussions about this: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python https://stackoverflow.com/questions/4576077/python-split-text-on-sentences
Looks like nltk.tokenize is a preferred solution, as described here: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python
I'm going to try and see if it works with Japanese, too.
I tried using nltk.tokenize, but I got this error:
Looks like we'll need to load some data somewhere first?
Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).
Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).
I'm going to work on this now.
I'm planning to break the sentences by line breaks and ignore empty lines.
Note: I'm planning to work on this branch for this: https://github.com/ykdojo/editdojoprivate/tree/split-post-into-paragraphs
This is so that users will be able to edit a post sentence-by-sentence.