wezm / rsspls

Generate RSS feeds from websites
https://rsspls.7bit.org/
Apache License 2.0
267 stars 9 forks source link

Add key-value argument for dynamic feed config #33

Closed Lcchy closed 3 months ago

Lcchy commented 4 months ago

Hi there,

For my personal usage I have been adding a feature to the feeds config.

I am subscribing to a website that has different "profiles" (a bit like mastodon or instagram). The website is the same but the content is different for each profile and I would like to subscribe to different profiles as separate feeds.

I could duplicate the feed config in the feeds.toml but I have close to 100 subscriptions so that would become cumbersome.

That is why I implemented this feature which enables the passing of an arbitrary key-value parameter that will be inserted into the config.

Example:

rsspls --parameter username=john

with the config:

[rsspls]

[[feed]]
title = "%<username> - Feed"
filename = "feed_%<username>.rss"

[feed.config]
url = "https://www.example.com/profile/%<username>/"
...

will simply replace the "%" with "john" and then generate the feed as before.

I already opened the PR so thats its easier to discuss, but feel free to reject it if you don't think its applicable.

wezm commented 4 months ago

Hmm I appreciate the solid use case but I'm not sure introducing templating is the way to go. It kinda feels like an external tool that generated the config would be a better option. It would also have the benefit of fetching multiple feeds in parallel.

I've recently been playing with Nickel and it would be pretty neat for doing this. Of course you could do similar with a Python script or something. E.g.

if you had accounts.json (or accounts.toml) with the accounts you wanted follow like:

{
  "accounts": [
    "one",
    "two",
    "three"
  ]
}

You could generate rsspls.toml using Nickel code like this:

let accounts = import "accounts.json" in

let account_feed = fun account =>
  {
    title = "Posts from %{account}",
    filename = "%{account}.rss",
    config = {
      url = "https://example.com/user/%{account}",
      item = "article",
      heading = "h3",
      link = "h3 a",
      summary = ".post-body",
      date = "time"
    }
  }
in

{
  rsspls = {
    output = "/tmp"
  },
  feed = std.array.map account_feed accounts
}

The config would be generated with nickel export -f toml rsspls.ncl, which from the above two files looks like this:

[[feed]]
filename = "one.rss"
title = "Posts from one"

[feed.config]
date = "time"
heading = "h3"
item = "article"
link = "h3 a"
summary = ".post-body"
url = "https://example.com/user/one"

[[feed]]
filename = "two.rss"
title = "Posts from two"

[feed.config]
date = "time"
heading = "h3"
item = "article"
link = "h3 a"
summary = ".post-body"
url = "https://example.com/user/two"

[[feed]]
filename = "three.rss"
title = "Posts from three"

[feed.config]
date = "time"
heading = "h3"
item = "article"
link = "h3 a"
summary = ".post-body"
url = "https://example.com/user/three"

[rsspls]
output = "/tmp"
Lcchy commented 3 months ago

Sorry for the late response!

Yes I actually agree, this is not how I've been using rsspls until now but I think you're right that using an external tool will be better for rsspls as it will stay focused. I will close the PR.

One other problem I might run into, is that when having a lot of "profile page" feeds being scraped from the same website, fetching all of those pages at once in parallel might end up in a rate limit or IP ban from the website.

Would you be interested in a PR to add the option to fetch a single feed? (something like --single-feed feed_name)

I can open a separate issue for this if you prefer.

wezm commented 3 months ago

One other problem I might run into, is that when having a lot of "profile page" feeds being scraped from the same website, fetching all of those pages at once in parallel might end up in a rate limit or IP ban from the website.

Yeah I can't seem to see a low per-domain connection limit like browsers have in reqwest/hyper so this could be a problem. An option to fetch one feed is one possibility. Another would be a rate limit of sorts like --wait in wget:

       -w seconds
       --wait=seconds
           Wait the specified number of seconds between the retrievals.
           Use of this option is recommended, as it lightens the server
           load by making the requests less frequent.
Lcchy commented 3 months ago

I actually just realized that when dynamically generating the config, I could also just write the single feed I want to fetch with rsspls in the config, making the addition of --single-feed superfluous.

With this I am actually all settled witht the current state.

I see that --wait could be useful in some cases, but I don't need it right now.

Just for context, I am looking into these specific ways of using rsspls as I am actually then serving the rss files locally to my FreshRss instance that I use to keep track of my read items and other feed. As Freshrss has features to spread out the feed refresh calls, I want to let it trigger rsspls on single feeds.

Lcchy commented 3 months ago

I/you can close the issue if you feel contempt too.

wezm commented 3 months ago

With this I am actually all settled witht the current state.

Cool. I will close this off then.