wfth / wo

0 stars 0 forks source link

Develop a database structure and queries #2

Closed aiwilliams closed 7 years ago

aiwilliams commented 7 years ago

It is unclear what the database for searching needs to look like. Experiment with structures and queries that will provide for the required search results.

aiwilliams commented 7 years ago

PostgreSQL can do amazing things for search. We'll keep working through that. Here is what I've developed so far.

CREATE TABLE search_documents (
  document_table VARCHAR(50) NOT NULL,
  document_id INTEGER NOT NULL,
  language VARCHAR(20) NOT NULL,
  content tsvector NOT NULL,
  CONSTRAINT document_constraint UNIQUE(document_table, document_id, language)
);

INSERT INTO search_documents (document_table, document_id, language, content)
VALUES ('sermons', 1, 'english', (
  setweight(to_tsvector('english','Profiling Christians'), 'A') ||
  setweight(to_tsvector('english','Before dying in a concentration camp in Nazi Germany, Betsy Ten Boom powerfully stated to her sister, "There is no pit so deep God''s love isn''t deeper still." That is the Apostle Peter''s message to these scattered, marginalized and persecuted Christians living in Asia Minor. No matter how difficult the trial; no matter how dark the night; God''s promise still stands.'), 'B')
));

SELECT document_table, document_id FROM search_documents WHERE content @@ to_tsquery('english','die');
isaacjwilliams commented 7 years ago

Ranking

This document

{ document_id: 1,
  title: "Dying",
  body: "Death means dying; but does die mean dice?" }

INSERT INTO search_documents (document_table, document_id, language, content)
VALUES ('sermons', 1, 'english', (
  setweight(to_tsvector('english','Dying'), 'A') ||
  setweight(to_tsvector('english','Death means dying; but does die mean dice?'), 'B')
));

should rank significantly higher than this document when searching for "die".

{ document_id: 2,
  title: "Heaven",
  body "When someone dies, do they go to heaven?" }

INSERT INTO search_documents (document_table, document_id, language, content)
VALUES ('sermons', 2, 'english', (
  setweight(to_tsvector('english','Heaven'), 'A') ||
  setweight(to_tsvector('english','When someone dies, do they go to heaven?'), 'B')
));

And indeed, it does! Significantly!

SELECT document_id, ts_rank(content, keywords) AS rank
FROM search_documents, to_tsquery('english', 'die') keywords
WHERE content @@ keywords;

 document_id |   rank   
-------------+----------
           1 | 0.695739
           2 | 0.243171

So what do the letter weights add to the ranking?

A: 1.0 B: 0.4 C: 0.2 D: 0.1

This can be overridden if necessary by passing an array with the weights you want for each rank as the first argument to ts_rank.

isaacjwilliams commented 7 years ago

Letter weights don't add static amounts to the ranking. My last comment may have made it seem as if they did.

Really, the number that corresponds to a letter (E.G. 1.0 for A) is the fraction of the ranking kept. Something with a raw rank of 1.0 stays 1.0 when weighted as A, and becomes 0.1 when weighted as D.

isaacjwilliams commented 7 years ago

Here are some sample search documents (data from real sermon series).

INSERT INTO sermon_series (document_table, document_id, language, content)
VALUES ('sermon_series', 1, 'english', (
  setweight(to_tsvector('english','Genesis Vol. 1 - Origins'), 'A') ||
  setweight(to_tsvector('english','Genesis 1:1-23, Genesis 1:1, Genesis 2:1-23, Genesis 2:23-25, Genesis 3, Genesis 4, Genesis 5, Genesis 6:1-9:17, Genesis 9:20-11:25'), 'B') ||
  setweight(to_tsvector('english','Imagine, at His command, animals and vegetation spring into existence. Not to mention the entire universe! Origins will take you from the creation of man to judgment of mankind in the great flood. You will also discover the forming of nations and races in the ancient past. And in the process, discover the relevant truth for every believer living today!'), 'C')
));
INSERT INTO sermon_series (document_table, document_id, language, content)
VALUES ('sermon_series', 2, 'english', (
  setweight(to_tsvector('english','Trees . . . and the Christmas Story'), 'A') ||
  setweight(to_tsvector('english','Genesis 2:17, Galatians 3:13, Revelation 22:2'), 'B') ||
  setweight(to_tsvector('english','Were evergreen trees a symbol of pagan worship? Did the Romans really decorate their temples with holly, ivy and mistletoe? Should sincere Christians have a Christmas tree in their living room? In this special holiday message, Stephen will trace the extraordinary relationship between trees...and the Christmas story.'), 'C')
));
INSERT INTO sermon_series (document_table, document_id, language, content)
VALUES ('sermon_series', 3, 'english', (
  setweight(to_tsvector('english','Genesis Vol. 2 - The Patriarchs'), 'A') ||
  setweight(to_tsvector('english','Genesis 12:1-13:4, Genesis 12:1-3, Genesis 13-19, Genesis 14, Genesis 15-18, Genesis 22, Genesis 24, Genesis 25-27, Genesis 28, Genesis 29-31, Genesis 32, Genesis 33-49'), 'B') ||
  setweight(to_tsvector('english','What would you do if God came calling? Imagine the faith it took Abraham to leave his homeland and follow God''s calling. His faith allowed him to experience incredible promises; including the greatest promise of them all - the promise of the Messiah. That''s right, the Messiah would come from the line of Abraham. In this fascinating study, journey with the Patriarchs - Abraham, Isaac, and Jacob - as they experience the furnace of affliction. Watch them grow as they rest on the faithfulness of God.'), 'C')
));

Trying a bunch of different queries on these has reinforced my faith in PostgreSQL searching. Feel feel to play around with them yourself.

SELECT document_id, ts_rank(content, keywords) AS rank
FROM sermon_series, to_tsquery('english', 'christmas') keywords
WHERE content @@ keywords;

 document_id |   rank   
-------------+----------
           2 | 0.651833
SELECT document_id, ts_rank(content, keywords) AS rank
FROM sermon_series, to_tsquery('english', 'abraham & faith') keywords
WHERE content @@ keywords;

 document_id |  rank   
-------------+---------
           3 | 0.24602
SELECT document_id, ts_rank(content, keywords) AS rank
FROM sermon_series, to_tsquery('english', 'Genesis') keywords
WHERE content @@ keywords;

 document_id |   rank   
-------------+----------
           1 | 0.741615
           2 | 0.243171
           3 | 0.746752
isaacjwilliams commented 7 years ago

If you look at the references for Sermon Series, you'll notice that some (Genesis Vol. 2, for instance) have way more references than others (Genesis Vol. 1). This shouldn't necessarily make it the first hit when searching for "genesis", but with the current system it's overweighted by the repetition of "Genesis" in its many references. This can be resolved by using ts_rank's normalization integer, a value passed to ts_rank as the fourth argument that evens the playing field when it comes to document length. See PostgreSQL's search documentation (about 3/4 down the page) and "Normal ranking, the fair way".

aiwilliams commented 7 years ago

I think for our first pass we can just weight the scripture references as a D. The title and the description are more likely to provide the best indication of what the content is about. We'll see! I think we can call this research done.

isaacjwilliams commented 7 years ago

I think we may have to change the way we structure our search_documents table if we want to highlight results. Read the section "Highlight your results!" (about halfway down this page) and I think you'll see why.

isaacjwilliams commented 7 years ago

As it turns out, our previous structure was just fine. To implement highlighting (or headlining), we simply needed to select from the regular tables using ts_headline. Wee!