yatima1460 / Drill

Search files without indexing, but fast crawling
https://drill.software/
GNU General Public License v2.0
268 stars 21 forks source link

File content search #31

Closed sojusnik closed 5 years ago

sojusnik commented 5 years ago

The Linux ecosystem desperately needs a reliable, fast and intuitive way to not only search for files, but its content as well. Did you think about implementing it in Drill too?

yatima1460 commented 5 years ago

I already thought about it, but I want to test first how much it impacts performance

And I will implement it with a config file specifying which file extensions are allowed to be scanned

If the performance impact is near zero I will add .txt files as default

Also this probably needs parallel crawling inside the file as well

sojusnik commented 5 years ago

Sounds great, thanks!

yatima1460 commented 5 years ago

As you can imagine this is not very high in the priority list, heck Drill lacks A TON of other stuff, unless you can provide me a good example of "the average user needs to search inside .txt files for this reason" I will not increase the priority of this 🤔

yatima1460 commented 5 years ago

@sojusnik

Do you actually need to search inside .txt often? Or can you provide me a use case when a user would do this often?

I could implement something like content:searchthisinsidefiles

sojusnik commented 5 years ago

Sorry for the delay, somehow overlooked this issue.

I'm actually searching quite a lot for the content of text files, not only inside .txt, but .md, .odt/.doc, .pdf and .epub files as well. Primary with gnome's tracker, but it's not the most reliable tool.

My use case is to search through my notes, mostly in .txt and .md (mostly for personal use), find documents faster through keywords within .odt/.doc files (mainly for work) and find important text passages in .pdfs (and .epub, if appropriate) for citing (mainly for academia).

yatima1460 commented 5 years ago

@sojusnik I will add .txt and .md support with "content:contentToSearchHere"

Do you think "content:" should only search inside and not the filename? Or should merge the normal search with the content search?

yatima1460 commented 5 years ago

About .pdf and other formats like .doc I need to test if they can be opened like regular .txt or if they need to be processed

sojusnik commented 5 years ago

@yatima1460 I think "content:" should only search inside files, not filenames too. This would prevent too cluttered search results, when searching only for filenames.

yatima1460 commented 5 years ago

This is how I have implemented it:

A file will be searched for content if:

  1. The search string begins with "content:"
  2. Extension is approved as "plain text file" in the "The Internet media type registry" at ftp://ftp.iana.org/assignments/media-types/
  3. Or if it's a .md markdown file, because strangely enough the standard is actually .markdown and not .md

PDF and DOCs need processing and that would slow down even further, also I would need a pdf and docs library, do they even exist for D? And I would probably need to use a C one and bind that and test it, it's a lot of work for a marginal case for now

yatima1460 commented 5 years ago

https://github.com/yatima1460/Drill/commit/54d532526cc61675bb86223afbdedc9eb4ae8311

yatima1460 commented 5 years ago

https://github.com/yatima1460/Drill/releases/tag/2.1.0