Open vbanos opened 10 months ago
CDX Summary comparison of the two captures (Good and Bad):
$ zcat good/indexes/index.cdx.gz | ~/bin/cdxj2cdx.py | cdxsummary
Summarizing piped data: STDIN
CDX Overview
───────────────────────────────
Total Captures in CDX 202
Consecutive Unique URLs 202
Consecutive Unique Hosts 11
Total WARC Records Size 4.2 MB
First Memento Date Dec 01
Last Memento Date Dec 01
───────────────────────────────
MIME Type and Status Code Distribution
──────────────────────────────────────
MIME 2XX 3XX 4XX 5XX Other TOTAL
──────────────────────────────────────
HTML 3 0 0 0 0 3
Image 7 0 0 0 0 7
CSS 1 0 0 0 0 1
JavaScript 169 0 0 0 0 169
JSON 5 0 1 0 0 6
Text 0 1 0 0 0 1
Font 1 0 0 0 0 1
Other 13 0 1 0 0 14
──────────────────────────────────────
TOTAL 199 1 2 0 0 202
──────────────────────────────────────
Path and Query Segments
─────────────────────────────────
Path Q0 Q1 Q2 Q3 Q4 Other TOTAL
─────────────────────────────────
P0 0 1 0 0 0 0 1
P1 3 1 0 0 0 0 4
P2 4 0 3 0 0 2 9
P3 170 0 3 7 0 2 182
P4 5 0 0 0 0 0 5
Other 1 0 0 0 0 0 1
─────────────────────────────────
TOTAL 183 2 6 7 0 4 202
─────────────────────────────────
Year and Month Distribution
────────────────────────────────────────────────
Year 01 02 03 04 05 06 07 08 09 10 11 12 TOTAL
────────────────────────────────────────────────
2023 0 0 0 0 0 0 0 0 0 0 0 202 202
────────────────────────────────────────────────
Top 10 Out of 11 Hosts
───────────────────────────────
Host Captures
───────────────────────────────
abs.twimg.com 171
api.twitter.com 12
twitter.com 5
accounts.google.com 4
pbs.twimg.com 4
t.co 1
static.ads-twitter.com 1
dict.brave.com 1
appleid.cdn-apple.com 1
fonts.gstatic.com 1
───────────────────────────────
OTHERS (1 Hosts) 1
───────────────────────────────
Random Sample of 3 OK HTML Mementos
────────────────────────────────────────────────
* https://web.archive.org/web/20231201171243/https://accounts.google.com/gsi/button?theme=outline&size=large&shape=circle&logo_alignment=center&text=signup_with&width=300&client_id=49625052041-kgt0hghf445lmcmhijv46b715m2mpbct.apps.googleusercontent.com&iframe_id=gsi_762763_93711&as=2t8mhTvTtFQiEtW8SmlNCA&hl=en
* https://web.archive.org/web/20231201171242/https://twitter.com/?logout=1701450762854
* https://web.archive.org/web/20231201171240/https://twitter.com/ShitpostGate/status/1730213204825895325
$ zcat bad/indexes/index.cdx.gz | ~/bin/cdxj2cdx.py | cdxsummary
Summarizing piped data: STDIN
CDX Overview
───────────────────────────────
Total Captures in CDX 126
Consecutive Unique URLs 126
Consecutive Unique Hosts 11
Total WARC Records Size 2.9 MB
First Memento Date Dec 01
Last Memento Date Dec 01
───────────────────────────────
MIME Type and Status Code Distribution
──────────────────────────────────────
MIME 2XX 3XX 4XX 5XX Other TOTAL
──────────────────────────────────────
HTML 2 0 0 0 0 2
Image 5 0 0 0 0 5
CSS 1 0 0 0 0 1
JavaScript 104 0 0 0 0 104
JSON 3 0 1 0 0 4
Text 0 1 0 0 0 1
Other 8 0 1 0 0 9
──────────────────────────────────────
TOTAL 123 1 2 0 0 126
──────────────────────────────────────
Path and Query Segments
─────────────────────────────────
Path Q0 Q1 Q2 Q3 Q4 Other TOTAL
─────────────────────────────────
P0 0 1 0 0 0 0 1
P1 4 1 0 0 0 0 5
P2 4 0 1 0 0 1 6
P3 104 0 2 3 0 2 111
P4 2 0 0 0 0 0 2
Other 1 0 0 0 0 0 1
─────────────────────────────────
TOTAL 115 2 3 3 0 3 126
─────────────────────────────────
Year and Month Distribution
────────────────────────────────────────────────
Year 01 02 03 04 05 06 07 08 09 10 11 12 TOTAL
────────────────────────────────────────────────
2023 0 0 0 0 0 0 0 0 0 0 0 126 126
────────────────────────────────────────────────
Top 10 Out of 11 Hosts
───────────────────────────────
Host Captures
───────────────────────────────
abs.twimg.com 103
api.twitter.com 7
twitter.com 5
accounts.google.com 3
pbs.twimg.com 2
t.co 1
static.ads-twitter.com 1
dict.brave.com 1
appleid.cdn-apple.com 1
google-analytics.com 1
───────────────────────────────
OTHERS (1 Hosts) 1
───────────────────────────────
Random Sample of 2 OK HTML Mementos
────────────────────────────────────────────────
* https://web.archive.org/web/20231201170828/https://twitter.com/?logout=1701450507965
* https://web.archive.org/web/20231201170822/https://twitter.com/ShitpostGate/status/1730213204825895325
There are 84 URLs that are captured in the good one, but not in the bad one and 8 in the bad one, but not in the good one:
$ zcat good/indexes/index.cdx.gz | cut -d' ' -f1 | uniq > /tmp/good-surts.txt
$ zcat bad/indexes/index.cdx.gz | cut -d' ' -f1 | uniq > /tmp/bad-surts.txt
$ comm -123 --total /tmp/good-surts.txt /tmp/bad-surts.txt
84 8 118 total
Further investigation suggests that the status codes of all the common URLs are likely the same in both the captures.
URLs that were archived in the Good capture, but not in the Bad one include the following:
$ comm -12 /tmp/good-surts.txt /tmp/bad-surts.txt
com,ads-twitter,static)/uwt.js
com,brave,dict)/edgedl/chrome/dict/en-us-10-1.bdic
com,cdn-apple,appleid)/appleauth/static/jsapi/appleid/1/en_us/appleid.auth.js
com,google,accounts)/gsi/client
com,google,accounts)/gsi/style
com,twimg,abs)/favicons/twitter.3.ico
com,twimg,abs)/responsive-web/client-serviceworker/serviceworker.c378aaea.js
com,twimg,abs)/responsive-web/client-web/bundle.birdwatch.9d172afa.js
com,twimg,abs)/responsive-web/client-web/bundle.communities.21137dca.js
com,twimg,abs)/responsive-web/client-web/bundle.conversation.f0ed625a.js
com,twimg,abs)/responsive-web/client-web/bundle.conversationwithrelay.9dece1ea.js
com,twimg,abs)/responsive-web/client-web/bundle.networkinstrument.1ded521a.js
com,twimg,abs)/responsive-web/client-web/bundle.ocf.0664409a.js
com,twimg,abs)/responsive-web/client-web/bundle.userprofile.93df711a.js
com,twimg,abs)/responsive-web/client-web/chirp-bold.ebb56aba.woff2
com,twimg,abs)/responsive-web/client-web/chirp-heavy.f44ae4ea.woff2
com,twimg,abs)/responsive-web/client-web/chirp-regular.80fda27a.woff2
com,twimg,abs)/responsive-web/client-web/feature-switch-manifest.9d67d6aa.js
com,twimg,abs)/responsive-web/client-web/i18n/en.12ebd59a.js
com,twimg,abs)/responsive-web/client-web/loader.appmodules.e96a390a.js
com,twimg,abs)/responsive-web/client-web/loader.audiodock.0b2c8c4a.js
com,twimg,abs)/responsive-web/client-web/loader.audioonlyvideoplayer.78c9304a.js
com,twimg,abs)/responsive-web/client-web/loader.confetti.a50e5e0a.js
com,twimg,abs)/responsive-web/client-web/loader.dividerhandler.2d61beda.js
com,twimg,abs)/responsive-web/client-web/loader.exploresidebar.79efdada.js
com,twimg,abs)/responsive-web/client-web/loader.newtweetspill.f8696e9a.js
com,twimg,abs)/responsive-web/client-web/loader.sidenav.06efec2a.js
com,twimg,abs)/responsive-web/client-web/loader.signupmodule.82431b6a.js
com,twimg,abs)/responsive-web/client-web/loader.timelinecardhandler.e98ba0ca.js
com,twimg,abs)/responsive-web/client-web/loader.timelinerenderer.98ea95ea.js
com,twimg,abs)/responsive-web/client-web/loader.tweethandler.1a8a559a.js
com,twimg,abs)/responsive-web/client-web/loader.widelayout.dab2609a.js
com,twimg,abs)/responsive-web/client-web/loaders.video.playerhls1.1.153b3bba.js
com,twimg,abs)/responsive-web/client-web/main.84a9074a.js
com,twimg,abs)/responsive-web/client-web/ondemand.dropdown.938dadaa.js
com,twimg,abs)/responsive-web/client-web/ondemand.s.60da2a4a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.aboutthisad~bundle.notmyaccount~bundle.multiaccount~bundle.articles~bundle.audiospacepeek~bundl.f51bec1a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.aboutthisad~bundle.notmyaccount~bundle.multiaccount~bundle.audiospacepeek~bundle.birdwatch~bund.31f309ea.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.articles~bundle.audiospacedetail~bundle.audiospacediscovery~bundle.audiospacebarscreen~bundle.b.fcbfe55a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.audiospacedetail~bundle.audiospacediscovery~bundle.audiospacebarscreen~bundle.birdwatch~bundle..5fe5f9da.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.audiospacepeek~bundle.birdwatch~bundle.bookmarkfolders~bundle.communities~bundle.twitterarticle.37bbdfca.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~bundle.communities~bundle.twitterarticles~bundle.accountverification~ondemand.setting.3de566ba.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~bundle.communities~bundle.twitterarticles~bundle.composemedia~ondemand.settingsintern.4e00a5aa.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~bundle.twitterarticles~bundle.compose~bundle.settings~bundle.display~bundle.ocf~bundl.4eb8ab7a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~loader.inlinetombstonehandler~loader.tweethandler.0dd527da.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~loader.inlinetombstonehandler~loader.tweethandler~loader.immersivetweethandler.0c4bcd7a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~loader.inlinetombstonehandler~loader.tweethandler~loader.tweetcurationactionmenu.1110732a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.birdwatch~ondemand.settingsinternals~bundle.explore~bundle.topics~bundle.trends~loader.explores.b86b7caa.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.bookmarks~bundle.communities~bundle.twitterarticles~bundle.explore~bundle.liveevent~bundle.home.354a82ca.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.conversation~bundle.tweetmediadetail~bundle.immersivemediaviewer.95a3fcaa.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.conversation~bundle.tweetmediadetail~bundle.immersivemediaviewer~loader.inlinetombstonehandler~.9c72de7a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.liveevent~ondemand.inlineplayer~loader.audioonlyvideoplayer.386f204a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.ocf~bundle.loggedouthome~bundle.search~loader.timelinerenderer~loader.signupmodule.aba4d59a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.readermode~bundle.conversation~bundle.tweetmediadetail~bundle.immersivemediaviewer.18bcadda.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.twitterarticles~bundle.composemedia~loaders.video.videoplayerdefaultui~loaders.video.videoplaye.546e4a5a.js
com,twimg,abs)/responsive-web/client-web/shared~bundle.twitterarticles~bundle.composemedia~ondemand.inlineplayer~loaders.video.playerbase~loader.audio.1312b38a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.appmodules~bundle.conversation.2fdf48fa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.appmodules~bundle.loggedouthome~bundle.search.e27e584a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.appmodules~bundle.loggedouthome~bundle.search~ondemand.settingsrevamp~bundle.settings.c2fc49ea.js
com,twimg,abs)/responsive-web/client-web/shared~loader.appmodules~bundle.ocf.bb2c825a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.appmodules~loader.loggedoutnotifications.949004fa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~bundle.audiospacepeek~bundle.audiospaceanalytics~bundle.audiospacereport~bundle.birdw.bf851eba.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dashmenu~loader.sidenav~loader.typeahead~loader.appmodules~loader.dmdrawer~bun.0e71ec1a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dashmenu~loader.sidenav~loader.typeahead~loader.dmdrawer~bundle.account~bundle.6b1a2aaa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dashmenu~loader.sidenav~loader.typeahead~loader.dmdrawer~bundle.multiaccount~b.6c81a84a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dmdrawer~bundle.articles~bundle.audiospacedetail~bundle.audiospacepeek~bundle..971dd09a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dmdrawer~bundle.articles~bundle.audiospacepeek~bundle.audiospacediscovery~bund.3dc827ea.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dmdrawer~bundle.audiospacedetail~bundle.audiospacepeek~bundle.audiospacediscov.8056aafa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.dmdrawer~bundle.audiospacepeek~bundle.audiospaceanalytics~bundle.audiospacerep.e3f6f4aa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.typeahead~loader.appmodules~loader.dmdrawer~bundle.articles~bundle.audiospaced.86ed62ba.js
com,twimg,abs)/responsive-web/client-web/shared~loader.audiodock~loader.typeahead~loader.dmdrawer~bundle.articles~bundle.audiospacedetail~bundle.audio.b9a9150a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dashmenu~loader.dmdrawer~bundle.accountanalytics~bundle.articles~bundle.audiospacepeek~bundle.a.e89e025a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dashmenu~loader.sidenav~bundle.multiaccount~bundle.communities~ondemand.settingsmonetization~bu.9a52b5ca.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dashmenu~loader.sidenav~bundle.multiaccount~bundle.jobsearch.49f1f64a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dashmenu~loader.sidenav~loader.appmodules~loader.dmdrawer~bundle.multiaccount~bundle.birdwatch~.f756cfba.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dashmenu~loader.sidenav~loader.dmdrawer~bundle.multiaccount~bundle.accountanalytics~bundle.comm.7adf520a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.aboutthisad~bundle.notmyaccount~bundle.multiaccount~bundle.articles~bundle.audi.12d1403a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.accountanalytics~bundle.audiospacepeek~bundle.birdwatch~bundle.bookmarkfolders~.ee3ddada.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.articles~bundle.audiospacedetail~bundle.audiospacediscovery~bundle.audiospaceba.c599d51a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.articles~bundle.audiospacepeek~bundle.birdwatch~~bundle.communities~bundle.twit.5624ffba.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.articles~bundle.directmessages~bundle.dmrichtextcompose~bundle.liveevent~bundle.662cc33a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.audiospacepeek~bundle.birdwatch~bundle.twitterarticles~bundle.compose~~bundle.s.2739ccda.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.audiospacepeek~bundle.compose~~bundle.dmrichtextcompose~bundle.directmessages~b.2071379a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.birdwatch~bundle.bookmarkfolders~bundle.communities~bundle.twitterarticles~bund.6d598a5a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.birdwatch~bundle.communities~bundle.compose~bundle.directmessages~bundle.dmrich.c5f3d57a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.birdwatch~bundle.communities~bundle.twitterarticles~bundle.compose~ondemand.com.49260e5a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.bookmarks~bundle.communities~bundle.twitterarticles~bundle.directmessages~bundl.68fbd60a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.compose~bundle.directmessages~bundle.dmrichtextcompose~bundle.jobsearch~bundle..4a25bb7a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.compose~bundle.directmessages~bundle.dmrichtextcompose~bundle.liveevent~bundle..f801acea.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.compose~bundle.directmessages~bundle.dmrichtextcompose~loader.hwcard~loader.tim.6d69199a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.dmdrawer~bundle.directmessages~bundle.liveevent~bundle.userprofile~loader.timelinerenderer.7dc10f2a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.inlinetombstonehandler~loader.tweethandler.400b75fa.js
com,twimg,abs)/responsive-web/client-web/shared~loader.sidenav~bundle.accountanalytics~bundle.communities~ondemand.settingsinternals~ondemand.settings.09d4794a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.sidenav~bundle.jobsearch.c1467cba.js
com,twimg,abs)/responsive-web/client-web/shared~loader.sidenav~bundle.multiaccount~bundle.jobsearch.0cb9b3ca.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~loader.appmodules~bundle.audiospacediscovery~bundle.loggedouthome~bundle.search.b84f4c7a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~loader.appmodules~loader.dmdrawer~bundle.articles~bundle.audiospacedetail~bundle.audi.f491141a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~loader.dmdrawer~bundle.audiospacepeek~bundle.birdwatch~bundle.liveevent~bundle.commun.ffa58d2a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~loader.dmdrawer~bundle.multiaccount~bundle.articles~bundle.audiospacedetail~bundle.au.b2a74a8a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~loader.dmdrawer~bundle.multiaccount~bundle.birdwatch~bundle.communities~bundle.compos.5736180a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.typeahead~ondemand.settingsinternals~bundle.loggedouthome~bundle.search~bundle.userlists~loader.2a97c93a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.widelayout~bundle.conversation.1ded365a.js
com,twimg,abs)/responsive-web/client-web/shared~loader.widelayout~loader.profileclusterfollow.662180ca.js
com,twimg,abs)/responsive-web/client-web/shared~loaders.video.videoplayerdefaultui~loaders.video.videoplayerminiui~loaders.video.videoplayerhashtaghig.bcac88fa.js
com,twimg,abs)/responsive-web/client-web/shared~ondemand.inlineplayer~loader.audioonlyvideoplayer~loader.immersivetweethandler.9163109a.js
com,twimg,abs)/responsive-web/client-web/shared~ondemand.settingsinternals~bundle.explore~bundle.trends~loader.exploresidebar.cac7a51a.js
com,twimg,abs)/responsive-web/client-web/shared~ondemand.settingsrevamp~bundle.twitterblue~bundle.conversation~bundle.twittercoinsmanagement~ondemand..84ea96da.js
com,twimg,abs)/responsive-web/client-web/vendor.1b81224a.js
com,twimg,pbs)/profile_images/1655280202824417287/xgibiqze_normal.jpg
com,twimg,pbs)/profile_images/1707175399560814592/hjpiobvf_bigger.jpg
com,twitter)/home?precache=1
com,twitter)/manifest.json
com,twitter)/shitpostgate/status/1730213204825895325
com,twitter)/sw.js
com,twitter,api)/1.1/hashflags.json
com,twitter,api)/2/guide.json?cards_platform=web-12&count=20&display_location=web_sidebar&entity_tokens=false&ext=mediastats,highlightedlabel,hasnftavatar,voiceinfo,birdwatchpivot,superfollowmetadata,unmentioninfo,editcontrol&focal_tweet_id=1730213204825895325&include_blocked_by=1&include_blocking=1&include_can_dm=1&include_can_media_tag=1&include_cards=1&include_entities=true&include_ext_alt_text=true&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_limited_action_results=true&include_ext_media_availability=true&include_ext_media_color=true&include_ext_profile_image_shape=1&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&include_ext_verified_type=1&include_ext_views=true&include_followed_by=1&include_mute_edge=1&include_page_configuration=false&include_profile_interstitial_type=1&include_quote_count=true&include_reply_count=1&include_user_entities=true&include_want_retweets=1&requestcontext=launch&send_error_codes=true&simple_quoted_tweet=true&skip_status=1&tweet_mode=extended
com,twitter,api)/graphql/ansatahghwrk9d7hk92_mg/usersbyrestids?features={"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"responsive_web_graphql_timeline_navigation_enabled":true}&variables={"userids":["1474091860025122817"]}
com,twitter,api)/graphql/r5zbjueowsdimws3cye0iw/tweetresultbyrestid?features={"creator_subscriptions_tweet_preview_api_enabled":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"tweetypie_unmention_optimization_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":false,"tweet_awards_web_tipping_enabled":false,"responsive_web_home_pinned_timelines_enabled":true,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"responsive_web_media_download_video_enabled":false,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_enhance_cards_enabled":false}&variables={"tweetid":"1730213204825895325","withcommunity":false,"includepromotedcontent":false,"withvoice":false}
Thank you for this analysis @ibnesayeed !
We try to archive a Tweet and we observe the following inconsistent behavior: We run exactly the same command on a busy server and on an idle laptop. On the laptop it succeeds but on the server the capture sometime succeeds and sometimes fails. (By failure, I mean that the capture is incomplete, the full tweet is not archived correctly, sometimes we get only the sidebar or not even that). The logs don't show any warnings or errors in any case.
Details:
We use the following command in all runs:
The laptop is a macbook air M2. The server is a 30 core, 44GB RAM machine that runs various tasks, e.g. maybe it runs 10
docker
cmds similar to the one I mentioned. The memory/cpu usage could be like 60-70%. Its not overloaded.I'm attaching the crawl logs for a successful and a failed capture. bad-crawl-20231201170818604.log good-crawl-20231201171237538.log
Observations:
warning
orerror
logs in the failed capture log. Also note that there is no "timeout" or other references to errors.Hypothesis: Its like some network requests timeout (
timedRun
reaches timeouts ?) and we don't capture the target URLs but there isn't any indication of that.