zammitjames / dataparksearch

Full featured web search engine
GNU General Public License v2.0
5 stars 0 forks source link

Supress Links During Searches #29

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hello Maxime,

We are using the latest 4.54 snapshot.

We have a URL (call it indexList.html). indexList.html has links to 16
other URLs, call them indexList0.html, indexList1.html thru indexListF.html
for a total of 16 links. These 16 pages contain about 2,000 links per page
and provide a means to access files that are stored in a database file
vault. Without these generated pages, the indexer can't find the files in
the vault. We want the indexer to find and index all files contained on the
16 html pages labeled 0 thru F, but to not serve up the URLs
indexList0.html thru indexListF.html themselves during searches.

Since each page 0-F has thousands of links, the search results tend to find
and rank the indexList0.html type pages higher than the contents and files
found on these pages. 

We've tried various combinations of HrefOnly and can't seem to get the
desired functionality. It appears you can control if the contents of a page
are indexed, but not if a link is indexed. It seems if a link is "allowed"
then it is indexed. We want to "allow" a link, but not index the link. 

We want to scan the links and contents of page
We want to scan all the files and URLs on the page with all
2.
3.

What is the expected output? What do you see instead?

Original issue reported on code.google.com by Imlbr...@gmail.com on 11 May 2010 at 3:12

GoogleCodeExporter commented 9 years ago
Do you make indexList* pages only to feed URLs into DataparkSearch database ?
If yes, try to do that in such way:
- put all your URLs to index into one plain text file, e.g. links.txt, one URL 
per line.
- insert the list of these URLs into search database by the command:
  ./indexer -if /path/to/links.txt

Original comment by dp.max...@gmail.com on 13 May 2010 at 10:32