第三章節：關於 PostgresSQL 沒有 Clustered Index 與 INCLUDE clause - Kyle

如文章所述，在 MySQL 的 innoDB 儲存引擎中索引分為 clustered index 與 non-nlustered index，通常 DB 會自動用 Primary Key 當作 clustered index，白話一點就是完整的資料在儲存空間裡面用 clustered index 當作排序依據， clustered index 相近的資料在 storage 中也會在比較接近的位置。且因爲他是實際上 storage 在 partition 資料時的依據，所以只能有一個 clustered index。而我們自己建立的索引則屬於 non-nlustered index(secondary index)

example query: SELECT * WHERE Height = 175

不過這個架構有一些潛在的問題，先從 non-clustered index 取得 clustered index 再去查詢，看起來只是從查詢一次變成兩次而已，然而實際上如果進行非常大範圍的 range query，例如變成 SELECT * WHERE Height > 175 non-clustered index一個 range 中對應的clustered-index 資料在實際硬碟的存放位置是非常散亂的，這個查詢clustered-index tree 的步驟會在 Storage 中發生非常巨量的 Random Access，可能導致效能的大幅下降。（雖然我自己覺得大部分狀況下不是什麼大問題）

因此 PostgresSQL 在看到這個痛點後採取另一個策略 - 不使用 clustered index。沒有 clustered index 意味著的就是完整的原始資料無法進行排序（因為沒有排序的依據），在 PostgreSQL 的中存放完整資料的地方稱之為 Heap。這個 Heap 是個無序的結構，應該就是文章中提到的堆檔案（heap file）

而 non-clustered index 的部分,，PostgreSQL 則是從在每個 node 裡面儲存 clustered index，改成儲存一個指向 Heap 中對應完整資料的 row 所在位置的指標

因為 Heap 不保證資料順序,所以任何資料被 update 或 insert ，現有的其他資料都不需要變更存放的位置, 相較之下 Innodb 因為是依照 clustered index 來排序,任何資料的 clustered index 被更改或插入新的資料都會造成其他資料的存放位置依照 index 發生相對應的變化。

所以在PostgreSQL 中，用 Non-clustered index 查詢資料，不需要再跑去 clustered index tree 裡面查詢，直接拿指標位置去 storage 裡面對應的 Page 取出資料即可。

Trade Off:

如果是用 primary key（以Innodb 來說就是 clustered index ）來查詢，反而會多了一個用指標查詢的步驟。
無法像 MySQL InnoDB 一樣以 clustered index 進行 range query 時有 locality 的優勢 (clustered index 才有辦法保證 index 跟 storage 中物理位置的相似性)

所以假設有一個 table 叫 user，其中 id(PK)、name、 height 都加了 index，則 MySQL 與 PostgresSQL 可能會有一些差別：

SELECT u.* FROM user u WHERE u.height < 175; ：因為要 select 全部的欄位，而查詢條件是 non-clustered index，所以根據前面所述，因為 MySQL-Innodb 還必須去 clustered index tree 裡面搜尋一遍，是可以預期 PostgreSQL 是比較快的。（實際上我覺得可能還是不會相差很多）
SELECT u.* FROM user u WHERE p.id > 25; ：因為 id 是 clustered index，用 id 來當作查詢條件時對 Innodb 來說可以直接在 clustered index tree 裡面搜尋，找到就可以直接回傳。而對 PostgreSQL 來說因為 id 仍然是 non-clusterd index (跟 Heap 分開的)，所以 PostgreSQL 反而還會多了一次用指標去資料庫把資料撈出來的步驟。因此這樣的操作預期 MySQL 可能更有優勢。（實際上我覺得可能還是不會相差很多）
SELECT u.name FROM user u WHERE u.name = "Kyle"; ：兩個應該差不多（因為都是 non-clustered index 且資料已經存在於 non-clustered index tree 上）

還有什麼方式可以讓兩者效率沒有太大差別嗎？-> INCLUDE clause

問題：在 create index 時可以下 INCLUDE clause 把一些欄位帶到 non-clustered index tree node 裡面存放，當要 select 的欄位全部都在 non-clustered index 的資料結構中的話，就可以免去再多一次查詢的時間。不過如果要存放的 column 太多，每次要 create 新的 index 時創造出來的新的 tree 就會過大，對 storage 是一個負荷，想問大家平常在建立索引的時候是怎麼考量要不要 include 欄位的呢？

weblab-tw / ddia-study-group

第三章節：關於 PostgresSQL 沒有 Clustered Index 與 INCLUDE clause - Kyle #33