webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
143 stars 29 forks source link

[Feature]: Add date of event being archived #1808

Open osintalex opened 1 month ago

osintalex commented 1 month ago

What change would you like to see?

I would like to be able to add a metadata field when archiving an event. For instance, suppose I am archiving a twitter post about Biden's inauguration. I would like to be able to add metadata that this took place on January 20th, 2021. I would also like to be able to search for this later.

Context

I think this would be helpful when archiving historically significant events. It would really help for serving content to end users who will likely want to search for data within given time ranges. At the moment from what I can see, the only time metadata a crawl saves is the time of the crawl run.

For example, this model could get a field like: timeMetadata: Optional[datetime] = None. The UI could allow a user to add this when creating a crawl config and allow searching by this field when filtering past crawls.

Happy to work on this if there is support for the feature.

Shrinks99 commented 1 month ago

Interesting! We've purposefully kept the metadata page fairly sparse wanting to see what requests would be made before working on anything based on assumptions.

RE: Search: This is something we are addressing at a later date. I know discovery of content within Browsertrix could be improved :)

Not against the addition of an optional date field but I do have some questions:

  1. When searching, how would you expect this to work? If a user has not entered a manual date, should the finish time date be used? Should manual dates and finish time dates be separate fields? Ranked differently in relevance? etc?
  2. Where else might you use this information in the app? Collections perhaps?
  3. Instead of an additional metadata field for this use case, might you be better served with improved search (imagine the full text indices were searchable — or even just the archived item descriptions)?

Would also like @tw4l's feedback :)

osintalex commented 1 month ago

Hey!

Cool :-)

For 1, I think these should be separate fields as they denote different information. Given that manual date is optional, I probably wouldn’t expect it to be ranked and only factored in when I explicitly search for it.

For 2, I’d definitely want it in collections. Main use case for me would be something like ‘collection of events relating to 2020 election’ or something like that. To this end, it could be useful to create a collection based of all archives that have a tag and have manual date within a range.

On Mon, May 20, 2024 at 20:50, Henry Wilkinson @.***(mailto:On Mon, May 20, 2024 at 20:50, Henry Wilkinson < wrote:

Interesting! We've purposefully kept the metadata page fairly sparse wanting to see what requests would be made before working on anything based on assumptions.

RE: Search: This is something we are addressing at a later date. I know discovery of content within Browsertrix could be improved :)

Not against the addition of an optional date field but I do have some questions:

  • When searching, how would you expect this to work? If a user has not entered a manual date, should the finishe time date be used? Should manual dates and finish time dates be seperate fields? Ranked differently in relevance? etc?
  • Where else might you use this information in the app? Collections perhaps?

Would also like @.***(https://github.com/tw4l)'s feedback :)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

tw4l commented 1 month ago

Hi @osintalex, thanks for the issue! The use case and rationale makes perfect sense to me. I think if we're going to expand the metadata fields on the archived items, I'd like to do it in a way that is extensible for adding additional fields in the future and ideally aligned with a descriptive metadata standard (or at least fairly easily crosswalk-able) such as Dublin Core (in which case "coverage" seems likely the most applicable term here).

I'm going to think a little on how we might achieve these aims, but overall I'm in favor!