webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
678 stars 54 forks source link

Should display metadata records in the Info tab #109

Open gwiedeman opened 2 years ago

gwiedeman commented 2 years ago

WARC files can have metadata records. It seems relatively common for these metadata records to be arbitrary JSON key-value pairs.

As a consumer of WARC files, I would like to view metadata included within a WARC file. I would expect this information to be displayed in replayweb.page's "Info" tab.

Our current use case is to preserve email messages as WARC files for improved encoding support and the possible inclusion of externally hosted resources. We are writing email headers as WARC metadata records. While I wouldn't think this would be a canonical use case for replayweb.page, it may still serve as a helpful example. It seems common enough for key-value metadata to be stored this way, and adding replayweb support would go along way to making metadata records more transparent and useful.

Here is a sample metadata record:

WARC/1.0
WARC-Type: metadata
WARC-Record-ID: <urn:uuid:ac47520f-4814-4916-b210-49033004ab3f>
WARC-Target-URI: http://mailbag/12/headers.json
WARC-Date: 2022-06-22T15:13:29Z
WARC-Payload-Digest: sha1:67PHPDHFC57INPX2GBJAJL46ZRZBDBOW
WARC-Block-Digest: sha1:67PHPDHFC57INPX2GBJAJL46ZRZBDBOW
Content-Type: application/json
Content-Length: 12199

{
    "ARC-Authentication-Results": "i=2; mx.microsoft.com 1; spf=pass (sender ip is\r\n 0.0.0.0) smtp.rcpttodomain=albany.edu\r\n smtp.mailfrom=thecrowleycompany.com; dmarc=pass (p=reject sp=none pct=100)\r\n action=none header.from=thecrowleycompany.com; dkim=pass (signature was\r\n verified) header.d=crowleycompany.onmicrosoft.com; arc=pass (0 oda=0 ltdi=1)",
    "ARC-Message-Signature": "i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;\r\n s=arcselector9901;\r\n h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;\r\n bh=W6cfyKoMz0nRFK3Vocfaz75BgMptiCqM/7Bju49f+pw=;\r\n b=as3W1kNwciH8IA1oCRBzmjg0xfgtMjTkdGBQI1q8ca14lEXhpJji7G5onKlf4PIzkfsxOi8fTZ7V0z5W5qK1W1jG6WILs5SoO0y1vteRI3m7rd7yLTW6xLhU3nhUeVeGva3GbFPAyrsVpRcCP4LfrmAP0b2KDewQFv6NXcoDZwuPghpCgfcV1Apy2MnFK0MGOdyzPRPpyWOUhuwzh2Cl1wBkkDtfVD75zdOzV/Rcm2fgKkR/VVW3mPTIjsKU+DYa7qdWONDkgW1R5Qz9+2MZt8g7RuGySW3nwVKMeZwaW93S419ZPCAlVzykoocKW7kkAIqxr1a+DoeoakK8ndJ8uA==",
    "ARC-Seal": "i=2; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=pass;\r\n b=YYjUThqy4dhCeVb6PllD7wHzjiriQDKtzfTj38wZuS2e4zVLW/TklT5s3rtBIGJLbHKHbI5A8mQG9mKpSh0E6CPgU4V3jlgfhnj+m2QxTyPZ+bs0m7dNKe6ArQ7+ZAwQJ2jVk1EvMPihK/4UQ1e0+xPqne/ns+6tKeFutIft3cbwItBjxjG37GI65DTOIRGCOo7KAN+lWqPZjFOzV+w2DabeFB0RGqSew4yXdwXL5UHby4AlXcwhof2dC1pIVEAhDlzUxpNDuq19pyLiBRMtbf9hFh9SLMCFronW0E2sJ3XcETnBI8WsGB/E7VA41AoZnCqjFJ3WTFGad6zjSncOSA==",
    "Accept-Language": "en-US",
    "Authentication-Results": "spf=pass (sender IP is 0.0.0.0)\r\n smtp.mailfrom=TheCrowleyCompany.com; dkim=pass (signature was verified)\r\n header.d=crowleycompany.onmicrosoft.com;dmarc=pass action=none\r\n header.from=TheCrowleyCompany.com;compauth=pass reason=100",
    "Authentication-Results-Original": "dkim=none (message not signed)\r\n header.d=none;dmarc=none action=none header.from=TheCrowleyCompany.com;",
    "Content-Language": "en-US",
    "Content-Type": "multipart/mixed;\r\n\tboundary=\"_007_MN2PR19MB407776A0698BD977B5EE2F3FB3289MN2PR19MB4077namp_\"",
    "DKIM-Signature": "v=1; a=rsa-sha256; c=relaxed/relaxed;\r\n d=crowleycompany.onmicrosoft.com; s=selector2-crowleycompany-onmicrosoft-com;\r\n h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;\r\n bh=W6cfyKoMz0nRFK3Vocfaz75BgMptiCqM/7Bju49f+pw=;\r\n b=pIU57W0tHLUYUYyjSstHPRTUELHiegsY8buAMb47UttnXwOPbR7IdIo62NLDglcysBUecFSOgoyTDw+l9IxnJy0o+JXgqniAb6NGTH3tHDV3agN4tk5RMklEd5c+3Z8SA8MUSWulkG1IdWuKVGIhACMyv9eDUpVAbphSEPZjDj4=",
    "Date": "Thu, 3 Feb 2022 20:01:10 +0000",
    "From": "Redac Ted <redacted@TheCrowleyCompany.com>",
    "MIME-Version": "1.0",
    "Message-ID": "\r\n <MN2PR19MB407776A0698BD977B5EE2F3FB3289@MN2PR19MB4077.namprd19.prod.outlook.com>",
    "Received": "from DM6PR04MB5769.namprd04.prod.outlook.com (redacted)\r\n by PH0PR04MB7160.namprd04.prod.outlook.com with HTTPS; Thu, 3 Feb 2022\r\n 20:01:21 +0000",
    "Received-SPF": "Pass (protection.outlook.com: domain of TheCrowleyCompany.com\r\n designates 0.0.0.0 as permitted sender)\r\n receiver=protection.outlook.com; client-ip=0.0.0.0;\r\n helo=NAM12-MW2-obe.outbound.protection.outlook.com;",
    "Return-Path": "redacted@TheCrowleyCompany.com",
    "Subject": "The Crowley Company - Digitization & Archiving Solutions",
    "Thread-Index": "AdgZMCT+iB3EgJAoQkCtadDKhOeWwg==",
    "Thread-Topic": "The Crowley Company - Digitization & Archiving Solutions",
    "To": "Redac Ted <redacted@TheCrowleyCompany.com>",
    "X-EOPAttributedMessage": "0",
    "X-EOPTenantAttributedMessage": "b5d22194-31d5-473f-9e1d-804fdcbd88ac:0",
    "X-Forefront-Antispam-Report": "\r\n CIP:0.0.0.0;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:NAM12-MW2-obe.outbound.protection.outlook.com;PTR:mail-mw2nam12on2104.outbound.protection.outlook.com;CAT:NONE;SFS:(13230001)(4636009)(9686003)(1096003)(7696005)(45080400002)(58800400005)(55016003)(6506007)(8676002)(40140700001)(33656002)(15974865002)(6862004)(450100002)(166002)(86362001)(52536014)(34756004)(82310400004)(28085005)(7636003)(7066003)(22186003)(26005)(8636004)(83280400002)(83290400002)(83300400002)(83310400002)(83320400002)(83380400001)(21480400003)(336012)(5660300002)(356005)(6200100001)(85282002);DIR:INB;",
    "X-Forefront-Antispam-Report-Untrusted": "\r\n CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MN2PR19MB4077.namprd19.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230001)(4636009)(366004)(66476007)(66446008)(7066003)(6862004)(66946007)(166002)(76116006)(8936002)(26005)(7696005)(8676002)(64756008)(2906002)(40140700001)(6506007)(9686003)(38070700005)(6200100001)(186003)(71200400001)(316002)(33656002)(296002)(55016003)(7416002)(45080400002)(508600001)(99936003)(38100700002)(122000001)(86362001)(5660300002)(83380400001)(52536014)(66556008)(85282002);DIR:OUT;SFP:1102;",
    "X-MS-Exchange-ABP-GUID": "c9fc9e99-7bba-4af1-80a7-1da392a88d97",
    "X-MS-Exchange-AntiSpam-MessageData-Original-0": "\r\n udWyLKTc+8MN9gE3feq4uUZMQClI8hw0+hTJOsxpU+7/whbR8oM6JzuieUSI rElD/1RhAVn73Fy+Xcs+IT93g8LbIQCbKV6YjvEncxdUqIOug7u5agSQYNcm riErQdY746sL557cWeLPEr9sg4QUnJJp5w6joYnWRNIak+KPbEn9DfCnA+un 32s894LxkdpofxWKWT9YJCiwGuV5Ha+HHmP81OX4UIaC9Aw8btilKjlDNUbR xgzUbLSjz7KDi0bv0U2tb1CFCrXX3lrNtav+mYq7AVWjHOEJx6Ub4l8FOU2Q ju9oOrBUyRh2tWjR7GQzpjQnc+JEfA0JrD/3UGdUGBLMk+cMQZUF8CMxohaI YHkYSzDOSeXfPL4SqMo93I+k24sfuYorqxuucPoSjCBgjwrFc1ZbnN+lW/0i 4N350muIEWz3Aq7sdB7HXKx1UN5yloOkrdnbUKmUhOYwujABK1/6cMjYIRh8 fReL+LutObAaNGJw8OYScnunMQrlnbWJiPxABZ9dET4ooKi20+qDWRnC0+qB mtiqLdMcUnmYP4vfquJJgp2KtvcktUC6nhY7h5Uw1O8JDeVg1qxSIPjS2irL aisqiELDyJSKz8VvvErz1sVHcoUKnLttNkC7QhmRKY00cyjwvBau2uREtCWs /RpQ1LZMo2Ecd1QpKRZdh/il++lmPDiDOW0hhBwbjOeyLGemI4gujn8XaLTQ qlBCXgiVOaknbNa/aBcjB1G9rP6IeM13uPt7F0MB1Vuu5pl3IfPI1ghoOFrZ 924cZ95KnToBCR0Mws4LIuNJzgs+/78jf94yFQUMvYs8+obvWBp6WoFd/8yR UuLaXOpy/5/d4Cw10BT1jWBOBQcXjalJZkSvboytPUtk+mWrd9b/Tb2K5gzW XOrkRPTADZnCSc2rUEQD53fYoRaDrTPJoaeid4TFkLnUdpplnBgu9hH/b6MA eesN24dklfjR0FIgPHjbVitj/EpWcbJieJtHAkaXT98lO7ts4kXb3ObrKh6h 1mugAHbfrMjp1nUDNwOhjJAGFGAqNBBfUbZyBcT4KsOYahGAFmqI6P4j6AdQ FRkX3L2C1IBac/Hbr8Gu08pfhMhpsNpbmSoh6/27+jn//Pv1rAzLoEo2wZV/ HSaYQfK0T1O0SVgAQyXsotKcMuDh+gNAmFEyOjZyyiaSRAIDrnArSMfF/AVT 7ZyyxxVW6QUHXlRV0oisSkCR1UWgRaNxyQFbksDZH4MVjWEFO4TNOe9YVXL9 JG8xpmlnsEFb5DWSAeT3eixl5KA4JJTo7PMVKixw7WH9nsHMU4PTkLvl5hvy kTL88LPatvlihmm3imDN7OoEbj2jRp++HConrWhNF4fMAvL4be370HD89+nL NkK6EzN3v9ewlLKDj/+KSOxeSAEwIvug4nVKD15j/vrogZ4D82B+gKfADojw NMDLobG1EM25+On+ClaleeQtOBQOO9uFDYMqNhVP/KBaGaJoXwwGLc0CgEyQ aoljHf+P5NAToM2TI5YlcUwTTGZb387tt11p9Jh3raFRnqZBP6sTxcVHYOI4 C77G/Z6PhQipJklbWOc5HxBxSH9fLYNnvUtRpXuFjIVz+FEJzzEQxcD6ACrW rBLXFilSWl+OOtGVG9zG8Opj45lf9YQt2qEYCHb1b9rOojtvaIAq6Tm3uBJE lR3NlA==",
    "X-MS-Exchange-AntiSpam-MessageData-Original-ChunkCount": "1",
    "X-MS-Exchange-CrossTenant-AuthAs": "Anonymous",
    "X-MS-Exchange-CrossTenant-AuthSource": "\r\n MW2NAM04FT019.eop-NAM04.prod.protection.outlook.com",
    "X-MS-Exchange-CrossTenant-FromEntityHeader": "Internet",
    "X-MS-Exchange-CrossTenant-Id": "b5d22194-31d5-473f-9e1d-804fdcbd88ac",
    "X-MS-Exchange-CrossTenant-Network-Message-Id": "d459af4b-1589-4fec-e194-08d9e74fef87",
    "X-MS-Exchange-CrossTenant-OriginalArrivalTime": "03 Feb 2022 20:01:15.5634\r\n (UTC)",
    "X-MS-Exchange-Organization-AuthAs": "Anonymous",
    "X-MS-Exchange-Organization-AuthSource": "\r\n MW2NAM04FT019.eop-NAM04.prod.protection.outlook.com",
    "X-MS-Exchange-Organization-ExpirationInterval": "1:00:00:00.0000000",
    "X-MS-Exchange-Organization-ExpirationIntervalReason": "OriginalSubmit",
    "X-MS-Exchange-Organization-ExpirationStartTime": "03 Feb 2022 20:01:15.6728\r\n (UTC)",
    "X-MS-Exchange-Organization-ExpirationStartTimeReason": "OriginalSubmit",
    "X-MS-Exchange-Organization-MessageDirectionality": "Incoming",
    "X-MS-Exchange-Organization-Network-Message-Id": "\r\n d459af4b-1589-4fec-e194-08d9e74fef87",
    "X-MS-Exchange-Organization-SCL": "1",
    "X-MS-Exchange-Processed-By-BccFoldering": "15.20.4951.014",
    "X-MS-Exchange-Transport-CrossTenantHeadersPromoted": "\r\n MW2NAM04FT019.eop-NAM04.prod.protection.outlook.com",
    "X-MS-Exchange-Transport-CrossTenantHeadersStamped": "BY5PR19MB4033",
    "X-MS-Exchange-Transport-CrossTenantHeadersStripped": "\r\n MW2NAM04FT019.eop-NAM04.prod.protection.outlook.com",
    "X-MS-Exchange-Transport-EndToEndLatency": "00:00:06.2449819",
    "X-MS-Has-Attach": "yes",
    "X-MS-Office365-Filtering-Correlation-Id": "d459af4b-1589-4fec-e194-08d9e74fef87",
    "X-MS-Office365-Filtering-Correlation-Id-Prvs": "\r\n e9f32e6f-281d-46d7-f7b0-08d9e74fecd7",
    "X-MS-PublicTrafficType": "Email",
    "X-MS-TNEF-Correlator": "",
    "X-Microsoft-Antispam": "BCL:0;",
    "X-Microsoft-Antispam-Mailbox-Delivery": "\r\n\tucf:0;jmr:0;auth:0;dest:I;ENG:(910001)(944506458)(944626604)(920097)(930097);",
    "X-Microsoft-Antispam-Message-Info": "NOha04mTymhtlRbe9mtqUhXudpE56lOWEpiK4cMOqxuhqlj2FQmaqhA5g5i/ J3hHXkUR/xO8Ih+NjhF63WULKXQvftn58tkNCXDobQ6QOL6b4RvmSOcdciIX 1PXf8XYy4pBWuR/V76K1MTlqtpEumEONrGzvAFTT5msObFu3gyVN0rhT9lLZ +PR0r0rmlCqI3lSY9xNHR0pulMkHmTdvKUm3p9xvoZBey3nJ7CNGae6KDI5v vMz/lPe+SsrhhCN8tdpVT0N7LTbHeBEn/o0NyS8DZkebThvJ11jQKqmPtr5b k8lcinanrUb8+XBtKD2g+x9IPvrIiOnTOrfg2SxUkiZtduQkYyvR6je6KJB+ F/zpHw1o1zDWkmprKQwVCQ7JICvaqSXI5kPn+8LYoKxGlbTyyqzdCi1/jdww W+pS/vI4Nk3cQPnvfTAQg7+MNvaFM4WqQwEhOK5aC34ZJqnmxJ8TTEMGYwUM KqXbKYXU40hKcsaJyHTQVFTksSiiHVff0m8ioNMmhIQEWsYEsGUXHhJfyo7Z emc+OShIPVpXDFKAOWSG2l+GghJapiqiLJZdljNsGMCTXpTqTPH/aTs0CbOh BUvdDehXYTWMAfqWOHAJBYrF+pL3hqzi0ItTLWZyXkc2bnFeGx6hciOTVjdq DaJQGO0EGkr9KqKeoMvxH0w4fDnlQZHWFVUyxCXfWEYLRXjEYwqTTv5aelNx 27ogcBL13t3IsORmGb0ubrso0tH+4rKEUpP+3kDGmCpq5jrsxnfqQhZ6aQZr oR/fWdSzUybgTymeT4kP50UpbYeGZ/xfd9Wc9WLWD6Wc8SUgulTGDswcuwqm b55D9jhyUQbC0unzAjUjgj86MmJVkbsxfi+FWdCQqP8j3+9gvygbHxPlrYVV XkkzwD6bLfkfzeK6bd60pkYEuxT4ZccJVRHmEhNA0HP4vC2SqhvnH3M2lWbC 6dhA7WQiC3nXQw3RuIop0PBhcdoL0AIFLBorONYrGcz2BZLFSg/Fyh67dREw i1g+OEf2ypzoVwmuym3Yo9eapA8H8Kes9RHf7Q6pmPT9zPbAL3om+EJtovOD LD4eAaLKUAc0Ykz9wnFyRSI5UmyXb2xYnIOAQuiCxnhhFKN/tUHZ2288XpSr c9qVjT0LuYPjE4JqBsqUaUvtUOweoNHBNjANVfNSze1yft2MuJOLcqhb9308 dO/9FuxS/ouR3itao/wuapwJJ5jzWN3+FHty8ouhYz6fnwpb4ygpc+z6E8qp LnbMmyCEI/r6494WOkXtRT1i9sxzYfNSx9jrAwcqy30oWv/d5/tgOxaXQEG2 A/Vd0LJEGOA4qphPzk8ObPUIgGNcBu3R1oef2xWEBFdKZc2q9drHaYPKhsDI 5ZiiiFE4otMpI3GoBMP8CBXCru5ojI3ZUSt8Z2aUPHmPb61nscVyuAaZ6Z0P Sw/6+OIXiEAsI4ZIxvQC7g4luHLoC2ZV+VQ22yLvNJDSR84KC2NJDdtAeYYl q5CNyGa03SyU82e8E/alo3sa0ULMA04LwmNj9vot58zyqNAPq/mlaAtSDrkH QwGQkxXiDhBvoYq7aU3D+aueJ4Z+JAyiixi8ufNB0qeNpzRCqdAS3kgW7JU2 7i6L2vF1uT/UgrfLXqnt0XHh4uMj2WdhVmyYV+4pXzSLnm4qPqq/mTpXqT8u oKNgroH5AoR/cKJO8VR/JNiDb6yx2ru0m4qkqDw90WDBsFW4aGRcB3Dpfd67 GjeAwxHf930EIxuLeVjbSbWT388pInhTARW7tUwAO7uu15/KW6bNJmt3jQHa chjt0SpvluiYYfSYpxeIIFDFThkAZFBOIPPQ5u0e71ZWf8SaiGvOdVMjNiMP bHPONocnqbrhysjRIhBn0VNt19xUay9vJYknaajTeLyjyo25yvJXt/2irUe2",
    "X-Microsoft-Antispam-Message-Info-Original": "\r\n 0zbQqxcfrVIJQ3Jfze3w0VzZhr49oWdkRJ/feC/gmCfaBc5bEr0clMwW91CtK1oJLZRUkZhePY1r41z/MwZUE4fWWPHfqN1stT5aQnZQGV4r9fCHHm4bAIw8wCss1w/kKOYXEUZO3bbzYIiny89GLQrZAq+262/8OnlJEY3p1fPc7EKOP8i3m2URNKIv8iUZgDf2CTM1wpnBEtY1bseqA1x8X22tsElcRj8ibpKMM3y3WKgcnzVUM99XxF7N99N1FQUkwONbIr2NLwN1P40ig7SyFNbglJvwXwQ2G4BjxPVfh64AsLHdldXtnfkrHVthyhL5KzZ/D5gODT3rFXSHROhpEQdYghE+5BYtRfygq0XGj906eOPFPjZpVScrtwYphsgk00gZljnpcON2mZpNVedvGWSA/yRq354PWbaXO4Ub7K0Szx7QSirLxuMKeMeNv9R//rJ6iRXVsPm6rya+ztg/MeVpCsMt30XhTKS8+OpPAy8jh2qJEbJMsE30YEU/X0oRQid+z6o0VvyzwEq8L6QSOHTJJV4U4XqsE60T6BiGKXYwAVuj87UOSrbWZqreSks/+1zQC5UkGqP+zNXTZvnM0jsED5C8hUwLiv9IopkXnPrqwe72B5AYVPREgcptgkL0LH3cUwDKSC2284VTAqiW7rigqoaXY6lQyqLJBpLPXacpAJSkLGgE93QysfEyBnpWd7bmPL3FdEH3fHLSBGCrJ+37kjmK8NySTRfkGezNDbYEqIk2nI9F0aeXbbqSqg6lBaFxKQdwcx8hMQhaGi0VvOVO+15B2bu8R11crus=",
    "X-Microsoft-Antispam-Untrusted": "BCL:0;",
    "date": "Thu, 3 Feb 2022 20:01:10 +0000",
    "x-microsoft-antispam-prvs": "\r\n <BY5PR19MB40334749137576D3A0E7B7AFB3289@BY5PR19MB4033.namprd19.prod.outlook.com>",
    "x-ms-exchange-antispam-relay": "0",
    "x-ms-exchange-senderadcheck": "1",
    "x-ms-oob-tlc-oobclassifiers": "OLM:826;OLM:826;",
    "x-ms-traffictypediagnostic": "\r\n BY5PR19MB4033:EE_|MW2NAM04FT019:EE_|DM6PR04MB5769:EE_"
}
ikreymer commented 2 years ago

The Info tab is more for a fixed amount of metadata for an entire archive (eg. data from a warcinfo that's usually found at the beginning, although no guarantee that there is one, and there can be multiple. With WACZ, hopefully we'll have more defined file-level metadata that can easily be accessible in this way.

But, the metadata records are similar to response records in the sense that have a URL and there can be a whole lot of them (some crawlers write a metadata for every response record). I think it could make sense to make a category on the URL Search tab, which filters and lists metadata records. That would require indexing them, which currently isn't done, but is definitely doable.. It would be a category, similar to HTML or 'Audio/Video' for example

gwiedeman commented 2 years ago

Sounds reasonable. Being able to facet using the search dropdown would fulfill our use case. Just getting metadata records to display like response records would help a lot, as they are currently not accessible from what I can tell.

WACZ is also interesting, as I suppose this info could go in datapackage.json. Definitely something we'll consider long term.