ykdojo / editdojo2

This used to be Edit Dojo's private repo - now it's public.
https://www.csdojo.io/edit
4 stars 0 forks source link

Make a Python script for separating a post into sentences and store that info in the database #48

Open ykdojo opened 5 years ago

ykdojo commented 5 years ago

This is so that users will be able to edit a post sentence-by-sentence.

ykdojo commented 5 years ago

@Jonathantsho FYI, I'm working on this one now.

ykdojo commented 5 years ago

I'm starting to think, this might be a good structure for the database:

We already have: Post.

Each Post has text.

In addition to that, we should have a Sentence model.

Each Sentence will belong to a Post.

And each Sentence will have a sentence_index, which will be the index that shows where in the Post it appears.

So, the first sentence will have sentence_index = 0, and the second sentence will have sentence_index = 1, and so on.

I'm going to look into the best way to do this right now.

ykdojo commented 5 years ago

This actually seems like a non-trivial problem.

Some StackOverflow discussions about this: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python https://stackoverflow.com/questions/4576077/python-split-text-on-sentences

Looks like nltk.tokenize is a preferred solution, as described here: https://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python

I'm going to try and see if it works with Japanese, too.

ykdojo commented 5 years ago

I tried using nltk.tokenize, but I got this error: image

Looks like we'll need to load some data somewhere first?

Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).

ykdojo commented 5 years ago

Anyway, for now, I'm just going to make a simplified version of this algorithm and move on (maybe like break paragraphs by line breaks for now).

I'm going to work on this now.

I'm planning to break the sentences by line breaks and ignore empty lines.

ykdojo commented 5 years ago

Note: I'm planning to work on this branch for this: https://github.com/ykdojo/editdojoprivate/tree/split-post-into-paragraphs