zenhack / ttrss-sandstorm

Sandstorm port of Tiny Tiny RSS
GNU General Public License v3.0
6 stars 2 forks source link

Cloned/restored grains do not update feeds periodically. #25

Open zenhack opened 3 years ago

zenhack commented 3 years ago

I noticed when working on https://github.com/sandstorm-io/sandstorm/pull/3519 that a cloned grain does not have any scheduled tasks -- this is to be expected as the restoring process doesn't do anything to create them. But this means that if you backup & restore a ttrss grain, or clone one, it will forever fail to fetch feeds in the background. We should try to re-arm the task somehow.

It's possible some support in Sandstorm would make this kind of thing easier to deal with.

jdougan commented 2 years ago

Any progress on this? It is kind of important for me.

jdougan commented 2 years ago

For that matter is there anything else the scheduled update does besides poll feeds? I'm built a test sandstorm with a restored ttrss grain and it is showing very different disk usage on very similar VPS configurations.

ocdtrekkie commented 2 years ago

I believe the scheduled task should also scavenge old post content and remove it. But my TTRSS bloats significantly over time, I think that may be somewhat blamable on MySQL in general but I've never dug into what the grain is retaining.

zenhack commented 2 years ago

Note that these tasks should still happen while the grain is running, just not on a schedule if you're not opening it periodically. Which is still annoying and which we should still fix.

I think the way I'd want to go about tackling this would be to add an id field to ScheduledJob in grain.capnp, and have sandstorm de-dup jobs that have the same id -- this way we could just always register the job on startup and rely on it being idempotent (right now it would create a new, separate job on each run).

Can't say when I'd get to it though.

ocdtrekkie commented 2 years ago

Is there any way jdougan could tamper with the grain backup so it does register for cron when he restores it to a new server?

zenhack commented 2 years ago

/me looks back at code I wrote almost two years ago

Looks like it looks for the presence of the file /var/updates-scheduled to see if it has already registered the job. So if you delete that, it should re-register it on startup.

ocdtrekkie commented 2 years ago

Since he is only really worrying about moving one grain here, I think, that might be a simple enough solution.

I do wonder if it'd be reasonable to add a button that deletes that file to our TTRSS to reschedule. Grain Settings shows what the server has scheduled, but I'm not sure if it could clean up extraneous events from someone hitting the button multiple times though.

zenhack commented 2 years ago

Quoting Jacob Weisz (2022-06-05 22:16:35)

Since he is only really worrying about moving one grain here, I think, that might be a simple enough solution.

I do wonder if it'd be reasonable to add a button that deletes that file to our TTRSS to reschedule. Grain Settings shows what the server has scheduled, but I'm not sure if it could clean up extraneous events from someone hitting the button multiple times though.

The solution I outlined above is little enough work that I don't know if I think a stopgap like this is even worth the trouble (I'm not sure it would even be easier, and it would be a patch we'd have to carry).

ocdtrekkie commented 2 years ago

Yeah there's no way to either drop or disable cron jobs on Sandstorm's side so revealing that in UI would be dangerous. I'm much more comfortable recommending people delete a file in a grain backup then end up needing to modify their server database.

jdougan commented 2 years ago

Would it be reasonable to just not backup the updates-scheduled file?

ocdtrekkie commented 2 years ago

Not really, Sandstorm wants to backup everything, and that file is a TTRSS-specific choice. But you can delete it out of the backup zip file before restoring it to your new server.

jdougan commented 2 years ago

Grain Settings shows what the server has scheduled, but I'm not sure if it could clean up extraneous events from someone hitting the button multiple times though.

Huh? Where is grain setting in the UI?

But you can delete it out of the backup zip file before restoring it to your new server.

And It apparently works if you just shut everything down, remove the file and start everything again.

For that matter, assuming grain IDs are unique, you could just stick the grain ID that created the updates-scheduled file in the file, and when it comes back up check against the current grain id.

ocdtrekkie commented 2 years ago

There is a gear on the top bar titled Grain Settings. Presently it only shows you if scheduled tasks exist, but more will be added to it later.

The issue with your idea there is that grain IDs are not intended to be visible to the grain itself.

jdougan commented 2 years ago

There is a gear on the top bar titled Grain Settings. Presently it only shows you if scheduled tasks exist, but more will be added to it later.

Good. One of sandstorm's major weak points is the admin tools.

The issue with your idea there is that grain IDs are not intended to be visible to the grain itself.

There's no capability so the grain can get the grain id? Does it at least know it is running in sandstorm?

zenhack commented 2 years ago

Quoting John Dougan (2022-06-05 23:55:43)

Does it at least know it is running in sandstorm?

I mean, stuff has to be specifically packaged for sandstorm, so the packager can indicate this however they want (including hard-coding it somehow). An environment variable is often a suggested approach, but apps that are written without sandstorm in mind and then ported usually either don't care or need patching.

ocdtrekkie commented 2 years ago

I'd agree, there's a lot of stuff Sandstorm does or handles that just isn't super visible. Auditing capability was a big goal of Sandstorm originally, so exposing a lot more of that functionality in the UI is something that really excites me.

Generally apps have a SANDSTORM=1 HTTP header passed to them, so apps can easily enable Sandstorm-specific behavior. TTRSS does, it's defined here: https://github.com/zenhack/ttrss-sandstorm/blob/master/.sandstorm/sandstorm-pkgdef.capnp#L122

However, we generally do not want Sandstorm apps to know anything about the "outside world", so generally they cannot see their grain ID, the sharing URLs, etc. (Offer templates, such as used for TTRSS's mobile app support, are in an iframe. The app itself can't see those values.) This helps ensure everything runs through and is secured by the Sandstorm platform.

jdougan commented 2 years ago

we generally do not want Sandstorm apps to know anything about the "outside world"

That is a really limiting view of what sandstorm could do. It's like insisting the iOS apps can't ever be aware of what the system configuration looks like. I mean, for security reasons you may decide that a given app should not have that info (GPS, for instance) but no one (except competing GPS vendors) would insist that NO apps should ever have access.

One of the alleged wins with a capabilities system is you can build caps that support special cases without compromising the general case. I would think a capability(ies) to support system identification of some sort would be an obvious place to start. Doesn't have to be a direct grain id, could be some cookie/token that sandstorm gives you that that you can ask sandstone about later to see if it is still valid.

ocdtrekkie commented 2 years ago

Sure, but adding that to Sandstorm is on par with or more work than what Ian already would preferably do with the scheduling API to solve this. And it'd be arguably better/cleaner for the scheduling API to handle this correctly, over implementing an unrelated system identification API to guess that since the system has changed, updates are probably not scheduled.

zenhack commented 2 years ago

I think @ocdtrekkie omitted an implied "by default" -- indeed, especially once the powerbox is more mature we want connecting apps to things to be pretty fluid -- but apps should be isolated by default.

That said, I don't see an enormous problem with the grain knowing its grain id. But I also don't see compelling use cases, and it seems like a good way for apps to break when cloned/restored/etc.

@ocdtrekkie is correct in that the solution I outlined above isn't actually harder than doing this by exposing the grain id. I still think that's the "right" solution: making schedule() idempotent by supplying an id field.

ocdtrekkie commented 2 years ago

Yeah, that was my intent better stated, thank you.

jdougan commented 2 years ago

I think @ocdtrekkie omitted an implied "by default" -- indeed, especially once the powerbox is more mature we want connecting apps to things to be pretty fluid -- but apps should be isolated by default.

Ah. I failed my "read between the lines" roll. The schedule ID change seem reasonable to me, but it isn't like I understand sandstorm internals to any depth. I've looked around but C++ isn't my forte.

zenhack commented 2 years ago

The relevant code for this is on the javascript side, actually. If you're interested in taking a whack at it, the main files of interest are:

zenhack commented 2 years ago

Ack, actually there is a bit of C++ logic, in supervisor.c++, but it's just forwarding stuff along to the js. Also relevant:

jdougan commented 2 years ago

After restarting the updater and letting it go for a bit, the disk bandwidth usage has dropped considerably and more closely resembles the original installation.

zenhack commented 1 year ago

Quoting John Dougan (2022-06-05 23:20:47)

Huh? Where is grain setting in the UI?

It's the gear icon in the top bar.

And It apparently works if you just shut everything down, remove the file and start everything again.

Yep, glad you figured it out.

For that matter, assuming grain IDs are unique, you could just stick the grain ID that created the updates-scheduled file in the file, and when it comes back up check against the current grain id.

I don't think we actually have access to the grain id at the point when this is scheduled (it happens during startup, before any requests are handled).