zpeters / stashbox

Your personal Internet Archive
GNU General Public License v3.0
47 stars 20 forks source link

Figure out a better way to generate a file name #7

Closed zpeters closed 4 years ago

zpeters commented 4 years ago

Currently it is taking a hash of the path. It would be nice to pull the webpage title, similar to what a web browser would do when you save a file

lucasturci commented 4 years ago

Hi! I think the web browser extracts what is in the tag of the HTML. Is that you want? I can do this.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zpeters"><img src="https://avatars.githubusercontent.com/u/39647?v=4" />zpeters</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>I think that would work perfectly </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/lucasturci"><img src="https://avatars.githubusercontent.com/u/17186683?v=4" />lucasturci</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>Great! Can you assign me the issue?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zpeters"><img src="https://avatars.githubusercontent.com/u/39647?v=4" />zpeters</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>assigned, thank you!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/AceroM"><img src="https://avatars.githubusercontent.com/u/21187304?v=4" />AceroM</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>@lucasturci you could use regexp to find the title or parse it with the <a href="https://pkg.go.dev/golang.org/x/net/html#Parse">golang.org/x/net/html</a> package.</p> <p>so something like this could work</p> <pre><code> var title string htmlBody, err := ioutil.ReadAll(resp.Body) titleRegex := `.*?<title>(.*?)</title>.*` re := regexp.MustCompile(titleRegex) matches := re.FindSubmatch(htmlBody) if len(matches) > 0 { title = string(matches[1]) } site.tite = title</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/lucasturci"><img src="https://avatars.githubusercontent.com/u/17186683?v=4" />lucasturci</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>Hey, so, after I implemented this, I realized there may be two types of issues: encoding and titles that <a href="https://stackoverflow.com/questions/9847288/is-it-possible-to-use-in-a-filename">have '/'</a>. I don't use windows, but there may be some issues with '\' too. How should I escape this characters? I think we will have to choose a different sign to replace them, because there is no way to escape them, I guess... To know better what I'm talking about, clone my <a href="https://github.com/lucasturci/stashbox">fork</a> and try to run <code>go run cmd/stashbox/main.go -url https://github.com/zpeters/stashbox</code></p> <p>The encoding issue is that some sites that are not encoded with utf8 will have bytes not understood by the filesystem, but I don't know if this PR should tackle this problem or another one.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/sharadbhat"><img src="https://avatars.githubusercontent.com/u/14030627?v=4" />sharadbhat</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>@lucasturci I guess we could do what most applications do. Replace all invalid characters with underscores. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/zpeters"><img src="https://avatars.githubusercontent.com/u/39647?v=4" />zpeters</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>I think that is a good solution for now</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>