ykdojo / editdojo2

This used to be Edit Dojo's private repo - now it's public.
https://www.csdojo.io/edit
4 stars 0 forks source link

Make a worker script that periodically retrieves users' tweets #6

Open ykdojo opened 5 years ago

ykdojo commented 5 years ago

Probably use lists for this: https://help.twitter.com/en/using-twitter/twitter-lists-not-working https://developer.twitter.com/en/docs/accounts-and-users/create-manage-lists/api-reference/get-lists-statuses

ykdojo commented 5 years ago

Probably use count = 100 or something like that.

ykdojo commented 5 years ago

I'm planning to work on this one next. I'm not sure what the right approach is for scheduling jobs with AWS yet. @Jonathantsho

ykdojo commented 5 years ago

On Heroku, you can add background tasks with Python with Redis Queue: https://devcenter.heroku.com/articles/python-rq

I'm guessing there's something similar on AWS, too.

ykdojo commented 5 years ago

Just read this: https://aws.amazon.com/elasticbeanstalk/

I'm thinking of starting with the AWS tutorial mentioned on that page.

Jonathantsho commented 5 years ago

Just read this: https://aws.amazon.com/elasticbeanstalk/

I'm thinking of starting with the AWS tutorial mentioned on that page.

Great choice, I actually used EBS (Elastic Beanstalk) as the AWS Service for the migration.

There are several approaches to go about this. The one I used was called a cronjob, which is a job process that is triggered periodically (based on time periods), and can run things like retrieving users tweets. http://blog.rotaready.com/scheduled-tasks-elastic-beanstalk-cron/

Another way is to use Celery, a popular python library that runs jobs in the background. https://realpython.com/asynchronous-tasks-with-django-and-celery/

YK - You're planning to run this script periodically to dump the tweet data into a database right? If so, I recommend Postgresql DB or MongoDB. Both are very popular databases capable of handling JSON Data (I assume tweet data is in JSON format).

Both of these solutions should handle very simple workloads in the meantime. I assume in the future we'd like to have realtime updates of tweets (where users tweets updates/posts will get displayed to everyone instantaneously)

ykdojo commented 5 years ago

Yeah I was thinking of using Postgres for this since it looks like a popular choice for Django: https://github.com/ykdojo/editdojo/issues/27

Using either Cron or Celery sounds good to me. I think both of them are popular choices. Perhaps Celery is easier to integrate with the main app cause it's Python-based, but I'm not sure.

I assume in the future we'd like to have realtime updates of tweets (where users tweets updates/posts will get displayed to everyone instantaneously)

Actually, this might be hard. Twitter's API for retrieving tweets in realtime only works for up to 5000 users, if I remember correctly. Retrieving tweets every minute is probably fast enough for most situations, though.

ykdojo commented 5 years ago

Just found this tutorial for using Django on AWS Beanstalk: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create-deploy-python-django.html

I'm thinking of following it now.

Which tutorial did you follow, @Jonathantsho?

ykdojo commented 5 years ago

I found another tutorial: https://realpython.com/deploying-a-django-app-and-postgresql-to-aws-elastic-beanstalk/

ykdojo commented 5 years ago

I'm going to work on #16 before this one.

ykdojo commented 5 years ago

NOTE: We decided to go with Heroku at the beginning instead.

ykdojo commented 5 years ago

@Jonathantsho You can find the rate limits here: https://developer.twitter.com/en/docs/accounts-and-users/create-manage-lists/api-reference/get-lists-statuses

image

Jonathantsho commented 5 years ago

@Jonathantsho You can find the rate limits here: https://developer.twitter.com/en/docs/accounts-and-users/create-manage-lists/api-reference/get-lists-statuses

image

Thanks man! Assgning this task to myself.

Jonathantsho commented 5 years ago

DONE: add users to DB, with already_in_twitter defaulted to false. TODOs: background task to periodically check users with already_in_twitter = False. Group and add them into the twitter list. Set already_in_twitter to True for the users successfully added. retrieve tweets periodically in the same background tasks.

edge case: If user is already in DB and already_in_twitter is True, do not add them to DB. This takes care of an edge case where user authenticates across multiple devices.

I found a lightweight background task library that seems to be really quick to setup. I will try this out. https://github.com/arteria/django-background-tasks

ykdojo commented 5 years ago

DONE: add users to DB, with already_in_twitter defaulted to false.

Nice! Do you have a PR for this?

TODOs: background task to periodically check users with already_in_twitter = False. Group and add them into the twitter list.

Sounds good. There's a way to filter users for already_in_twitter = False with Django. I think it's the filter() function on the CustomUser model.

Set already_in_twitter to True for the users successfully added. retrieve tweets periodically in the same background tasks.

NOTE: We should probably retrieve tweets every minute or so, assuming that Twitter API allows that many requests.

edge case: If user is already in DB and already_in_twitter is True, do not add them to DB. This takes care of an edge case where user authenticates across multiple devices.

Right. Also a case when a user logs out and logs in again.

I think the entire logic of the every-minute worker script should be like this (it's just a draft):

  1. Find all users with already_in_twitter_LIST = False. (Most of the time, this will return an empty list because user signups won't be that frequent.)
  2. Add those users to our Twitter list and flip already_in_twitter_LIST = True. A single list will work until we have 5000 users.
  3. Retrieve new tweets from the list and store them in our database.

Perhaps we should also have an hourly worker script to fix any data integrity issues. For example:

  1. Make sure the Twitter accounts users used are still public (users haven't turned them private)
  2. Make sure that the users we think are in our list actually are.
  3. (There might be other things we want to check hourly in the future.)

I found a lightweight background task library that seems to be really quick to setup. I will try this out. https://github.com/arteria/django-background-tasks

Oh no! I actually went through the same route, and it was kind of confusing/annoying to work with. It might work, but I wouldn't recommend it. So, we should probably go with Heroku's recommended method: https://devcenter.heroku.com/articles/python-rq

Jonathantsho commented 5 years ago

Looking into Heroku's recommended method for background tasks.

Installing and setting up Heroku tonight.

ykdojo commented 5 years ago

Sounds good! Let me know if you have any trouble with it, too.

Jonathantsho commented 5 years ago

Sounds good! Let me know if you have any trouble with it, too.

I keep running into the error: django.db.utils.OperationalError: no such table: django_site when I run setup_twitter.py or exec(open("./setup_twitter.py").read()).

Note: I have postgresql addon setup, and heroku config vars have DATABASE_URL and SECRET_KEY.

This has been a looming issue for the past 4 hours of developing - and a major roadblock.

Jonathantsho commented 5 years ago

Sounds good! Let me know if you have any trouble with it, too.

I keep running into the error: django.db.utils.OperationalError: no such table: django_site when I run setup_twitter.py or exec(open("./setup_twitter.py").read()).

Note: I have postgresql addon setup, and heroku config vars have DATABASE_URL and SECRET_KEY.

This has been a looming issue for the past 4 hours of developing - and a major roadblock.

This issue has been fixed. I've now encountered a new error: "Social Network Login Failure". I think this is because the callback URL is not set correctly against the callback URL in the twitter dev console.

Once I figure out what the callback URL is (if I can add more callback URL to twitter dev portal), I should be able to login.

Jonathantsho commented 5 years ago

Sounds good! Let me know if you have any trouble with it, too.

I keep running into the error: django.db.utils.OperationalError: no such table: django_site when I run setup_twitter.py or exec(open("./setup_twitter.py").read()). Note: I have postgresql addon setup, and heroku config vars have DATABASE_URL and SECRET_KEY. This has been a looming issue for the past 4 hours of developing - and a major roadblock.

This issue has been fixed. I've now encountered a new error: "Social Network Login Failure". I think this is because the callback URL is not set correctly against the callback URL in the twitter dev console.

Once I figure out what the callback URL is (if I can add more callback URL to twitter dev portal), I should be able to login.

This issue has also been fixed.

I'm going to begin working on the redis queue within my dev environment. I found this repo that has a good integration between rq + django + heroku. it's called django-rq.

Looks intuitive based on the readme, going to give it a try.

https://github.com/rq/django-rq

ykdojo commented 5 years ago

Sounds good!

Jonathantsho commented 5 years ago

Sounds good! Let me know if you have any trouble with it, too.

I keep running into the error: django.db.utils.OperationalError: no such table: django_site when I run setup_twitter.py or exec(open("./setup_twitter.py").read()). Note: I have postgresql addon setup, and heroku config vars have DATABASE_URL and SECRET_KEY. This has been a looming issue for the past 4 hours of developing - and a major roadblock.

This issue has been fixed. I've now encountered a new error: "Social Network Login Failure". I think this is because the callback URL is not set correctly against the callback URL in the twitter dev console. Once I figure out what the callback URL is (if I can add more callback URL to twitter dev portal), I should be able to login.

This issue has also been fixed.

I'm going to begin working on the redis queue within my dev environment. I found this repo that has a good integration between rq + django + heroku. it's called django-rq.

Looks intuitive based on the readme, going to give it a try.

https://github.com/rq/django-rq

I didn't end up using django-rq, just used base rq library. the redis queue is setup in my dev environment, however it just runs on sample code. later today i'll start writing code (to check db, do other functions, etc)

Jonathantsho commented 5 years ago

Added new functionalities to background task -every minute, background task executes to retrieve all records in postgres db (users) where already_in_twitter is false -adds them to the twitter list -updates their record (already_in_twitter) to true

Ideal things to do would be exception handling (storing logging data somewhere, or running another nightly concurrent task to check whether the twitter list is accurate with the database list)

It'd be ideal to retrieve the users twitter id when they signup to our app in the post request... I can't quite figure out how to do this. I can only retrieve their username. I have to do another api request per user to retrieve the user id from the name. This isn't an immediate problem for now though.

ykdojo commented 5 years ago

It'd be ideal to retrieve the users twitter id when they signup to our app in the post request... I can't quite figure out how to do this. I can only retrieve their username. I have to do another api request per user to retrieve the user id from the name. This isn't an immediate problem for now though.

I think you can do this pretty easily through django-allauth.

If I remember correctly, you just need to find the Twitter model django-allauth automatically creates when a user signs up and find the Twitter ID there.

ykdojo commented 5 years ago

I was just looking into this a bit.

So, after running python manage.py runserver (after pipenv shell), you can do this:

from allauth.socialaccount.models import SocialAccount

sa = SocialAccount.objects.all()[0] to retrieve the first social account in the database sa.__dict__.keys() to show all the attributes in this object

And I think sa.uid is the Twitter ID.

ykdojo commented 5 years ago

Added a Post model here. What do you think? https://github.com/ykdojo/editdojoprivate/commit/ae3fe0a4bb820ae03befc2fa865e1424fb491e94

ykdojo commented 5 years ago

Also take a look at this file for more examples: https://github.com/ykdojo/editdojoprivate/blob/master/create_post_samples.py

Jonathantsho commented 5 years ago

sa.dict.keys()

@ykdojo wow this is great!! Thanks! never knew you could reference socialaccount, which models.py is it stored in? I can't quite find the class.

On the other hand, i'll update my code and i'll write another script to that spins up a concurrent worker to identify data integrity issues.

ykdojo commented 5 years ago

never knew you could reference socialaccount, which models.py is it stored in? I can't quite find the class.

I think it's this one: https://github.com/pennersr/django-allauth/blob/master/allauth/socialaccount/models.py

Jonathantsho commented 5 years ago

still WIP - need to bug test a little more before I merge request.

Jonathantsho commented 5 years ago

Updates: Learned how to log and monitor background worker process Worker process works and schedules as intended. However the script I’m running in the worker process is failing for some reason, I’ll be working on this later today.

Jonathantsho commented 5 years ago

This issue is 75% complete.

The only to-do in this task now is to

This is how I'm planning to go about doing it, attaching image below. @ykdojo

Basically in our already_in_twitter table, we're going to create a one to many relationship (or one-to-one) works to a table that will store our tweet data. this tweet data table will have three columns - the unique_id(pk), user(str), and tweets(json)

tweet_data table will only be populated with tweets from users who have True in already_in_twitter column in the already_in_twitter table. That way, we won't get issues where we try to pull tweets from users who are not on the tweet list.

50745161_2006997572703406_5121638540727287808_n

Jonathantsho commented 5 years ago

Currently stuck with an issue where I cannot send tasks to worker dynos. I've created an SO for this. https://stackoverflow.com/questions/54433958/python-heroku-how-to-send-tasks-to-background-workers-in-heroku?noredirect=1#comment95677393_54433958

Jonathantsho commented 5 years ago

Fixed issue of zombie workers spinning up. The rule of thumb is to not run python manage.py rqworker "queuename" on the web dyno, but only on the worker dyno.

To do this, you must scale up more workers via heroku worker scale=x, x being number of workers.

Jonathantsho commented 5 years ago

For data integrity check I’m thinking of creating a separate queue called “twit_check”. This will be created under qa_worker, stored in the procfile.

I’ll need to discuss with YK about the minute details of what it’ll check, besides checking when Twitter API fails (when users don’t get added to the list properly)

Jonathantsho commented 5 years ago

We have some issues with the storing tweets in the post models that I'll need to fix.

Jonathantsho commented 5 years ago

We have some issues with the storing tweets in the post models that I'll need to fix.

  • [x] When create_post_samples.py is run more than once, the same tweets can appear, and duplicate. when tweets are retrieved, they must replace the existing tweet_id_strs.
  • [x] Retrieving tweets retrieves all their tweets... since beginning of time. Is this what we want? What if Donald trump signs up and we ingest 5000 tweets? Correct me if I'm wrong. @ykdojo

duplication fixed

ykdojo commented 5 years ago

@Jonathantsho For the second point, maybe retrieve the most recent 200 tweets or something? Not sure if it's the best strategy yet, though.

ykdojo commented 5 years ago

Also, we should almost always retrieve tweets from one of the lists, not from individual users (as we discussed on our call)

Jonathantsho commented 5 years ago

Finished writing code to retrieve all tweets from lists, and store into the post model.

It seems that retrieving the most 200 tweets is possible, the API looks to be able to support this. I'll need to investigate.

Will test and push PR

ykdojo commented 5 years ago

Sounds good. You might be able to only retrieve 20 tweets or so at a time, so you might need to make about 10 calls to get those 200 tweets. I think it depends on the type of call you make.