xwmx / nb

CLI and local web plain text note‑taking, bookmarking, and archiving with linking, tagging, filtering, search, Git versioning & syncing, Pandoc conversion, + more, in a single portable script.
https://xwmx.github.io/nb
GNU Affero General Public License v3.0
6.64k stars 188 forks source link

Some web page are not downloaded when bookmarking #279

Closed mozgwar closed 10 months ago

mozgwar commented 11 months ago

Hi, I'm evaluating this nice software to see if it will fit my need and so far so good. but I noticed that some sites only bookmark the url. the following are not working:

the following works as expected :

mozgwar commented 11 months ago

I did some test and it looks like the problem is in the _file_is_text function

xwmx commented 11 months ago

@mozgwar Thanks for the information. All of those work for me on macOS. What is your OS? Could you share the output of the following command?

uname -a; file --version; bash --version
mozgwar commented 11 months ago

I'm on linux and also my main shell is zsh Linux jessica 6.1.38-cachyos #1 SMP PREEMPT Fri Jul 7 15:28:36 EDT 2023 x86_64 Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz GenuineIntel GNU/Linux file-5.45 magic file from /usr/share/misc/magic seccomp support included GNU bash, version 5.2.15(1)-release (x86_64-pc-linux-gnu) Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html

This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

zsh --version  ✔  at 23:05:02  zsh 5.9 (x86_64-pc-linux-gnu)

xwmx commented 11 months ago

Thanks. That looks normal so far. When you bookmark one of the URLs that doesn’t download, does nb display the message, “Unable to download page at [url]”?

mozgwar commented 11 months ago

No, it just add the url to the bookmark file and stop there

xwmx commented 11 months ago

@mozgwar Thanks. I've still been unable to reproduce or determine the cause of the issue. nb uses curl or wget to download the page to a temp file and then reads the content from there. It sounds like the temp file is being created, but might not be getting recognized as a text file. The _file_is_text() function is pretty simple and the only external program it uses is file, which it looks like you have a recent version of. I'll have to keep thinking about this.

mozgwar commented 11 months ago

I just did: wget https://b-ark.ca/2020/04/22/diy-kindle-news.html and I get the following when I run file on it. 1) file diy-kindle-news.html diy-kindle-news.html: HTML document, Unicode text, UTF-8 text, with very long lines (1447)

2) file --exclude=apptype \ --exclude=encoding \ --exclude=tokens \ --exclude=cdf \ --exclude=compress \ --exclude=elf \ --exclude=tar \ -b --mime-type diy-kindle-news.html application/javascript

xwmx commented 11 months ago

@mozgwar Does it save the title (first line of the bookmark file) as # Clang/Bootstrapping - Gentoo wiki (wiki.gentoo.org) or # (wiki.gentoo.org)?

mozgwar commented 11 months ago

no title just # (wiki.gentoo.org)

akashpal-21 commented 11 months ago

It fails to render any wikipedia page for me on linux :(

akashpal-21 commented 11 months ago

Here is a video of the issue : wiki.debian is rendered but any wikipedia link is not

https://github.com/xwmx/nb/assets/46517170/62d63c1f-6345-4bf0-b672-646cb81eb478

mozgwar commented 11 months ago

I just tried 2 wikipedia pages and I got the following results:

1) file ARM_Cortex-A76 --> ARM_Cortex-A76: HTML document, Unicode text, UTF-8 text, with very long lines (5793) file --exclude=apptype \  INT ✘  at 08:55:48  --exclude=encoding \ --exclude=tokens \ --exclude=cdf \ --exclude=compress \ --exclude=elf \ --exclude=tar ARM_Cortex-A76 ARM_Cortex-A76: JavaScript source, Unicode text, UTF-8 text, with very long lines (5793)

2) file Alaska_Day ---> Alaska_Day: HTML document, Unicode text, UTF-8 text, with very long lines (4067) file --exclude=apptype \  ✔  at 08:59:57  --exclude=encoding \ --exclude=tokens \ --exclude=cdf \ --exclude=compress \ --exclude=elf \ --exclude=tar Alaska_Day Alaska_Day: JavaScript source, Unicode text, UTF-8 text, with very long lines (4067)

leamsi commented 10 months ago

I'm hitting the same issue, so I went ahead and removed the --exclude=encoding flag and that fixed it for me.

Makes me think that this is maybe a bug with recent file versions or maybe one of the extra libmagic database entries that some programs add is causing conflicts?

Just for completeness sake:

$ uname -a; file --version; bash --version
Linux remote-desktop-1 6.2.0-1016-gcp #18-Ubuntu SMP Fri Sep 22 16:23:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
file-5.44
magic file from /etc/magic:/usr/share/misc/magic
GNU bash, version 5.2.15(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
xwmx commented 10 months ago

@leamsi Thanks for the info! I confirmed that this is an issue in file that started at some point between version 5.41 and 5.44 and exists on both Arch and Ubuntu. These options are intended as a performance optimization. I don’t notice any difference in benchmarks at the moment, so I’ve removed it. This change is in the repo and will be in the next release version.

xwmx commented 10 months ago

This should be fixed as of version 7.8.0. Let me know if you run into any more issues with it. Thanks!