データ指向アプリケーションデザイン6章

yagikota commented 1 month ago

What do you do?

6章読む

TODO

[x] 6.1 パーティショニングとレプリケーション
[x] #32
[x] 6.3 パーティショニングとセカンダリインデックス
[x] 6.4 パーティショニングとリバランシング
[x] 6.5 リクエストのルーティング

yagikota commented 1 month ago

Ref

yagikota commented 1 month ago

6.1 パーティショニングとレプリケーション

yagikota commented 1 month ago

6.2 キー・バリューデータのパーティショニング

skew
- 一部のnodeが他のnodeより多くデータを受け取ってる状態。
- その時のnodeをホットスポットという。
キーのハッシュでパーティショニングする　　- コンシステントハッシュ　　- CDNとかで使われる。　　- DBではうまくいかない。使われない。
skewの検出と対処が課題

phindに聞いた

Skew refers to an uneven distribution of data or load within a database system, particularly in the context of sharding. Here are the key points about skew in databases: ### Definition of Skew Skew occurs when certain shards or partitions of a database receive significantly more data or traffic than others, leading to uneven resource utilization and potential performance issues [1][2]. ### Causes of Skew 1. Poor shard key selection: - Choosing a sharding key that doesn't evenly distribute data across shards can lead to skew [2]. 2. Unequal data distribution: - Some data may naturally cluster around certain values, causing imbalance [2]. 3. Query patterns: - Certain queries or operations may disproportionately target specific shards [2]. ### Effects of Skew 1. Performance degradation: - Overloaded shards may experience slower response times and increased latency [1][2]. 2. Resource inefficiency: - Underutilized shards waste computational resources [2]. 3. Reduced scalability: - As skew increases, the ability to add new shards becomes less effective [2]. ### Examples of Skew 1. Customer name sharding: - If most customers' names start with a particular letter, that shard may become overloaded [2]. 2. User ID sharding: - If certain user IDs are more popular or generate more activity, those shards may suffer skew [2]. ### Mitigation Strategies 1. Rebalancing: - Periodically redistribute data to balance loads across shards [1][2]. 2. Dynamic sharding: - Adjust shard boundaries dynamically based on changing data distributions [2]. 3. Load balancing: - Implement mechanisms to distribute incoming requests across shards more evenly [2]. 4. Adaptive query routing: - Route queries to less busy shards when possible [2]. 5. Careful shard key selection: - Choose keys that promote even data distribution [4]. 6. Monitoring and analysis: - Continuously monitor shard performance and analyze data distribution patterns [2]. ### Best Practices 1. Regularly review and adjust sharding strategies as data patterns evolve [2]. 2. Implement automated monitoring and alerting systems for early detection of skew [2]. 3. Design flexible sharding architectures that can adapt to changing data distributions [2]. 4. Consider using multiple shard keys or composite keys to better distribute data [2]. By addressing skew, database administrators can maintain optimal performance, ensure efficient resource utilization, and maximize the benefits of sharding in large-scale database systems.

yagikota commented 1 month ago

6.3 パーティショニングとセカンダリインデックス

セカンダリインデックスとは？プライマリインデックスとの違いは？

### 例: `ユーザー` テーブル | UserID (Primary Key) | Name | Email | Age | |----------------------|-------|------------------|-----| | 1 | Alice | alice@example.com | 30 | | 2 | Bob | bob@example.com | 25 | | 3 | Carol | carol@example.com | 28 | #### 1. プライマリインデックス - **プライマリキー**: `UserID` がプライマリキーとして設定されており、各行（レコード）を一意に識別します。 - **プライマリインデックス**: プライマリキー列 (`UserID`) に基づいて**自動的に作成されるインデックス**です。プライマリインデックスは、行を識別するために一番効率的です。プライマリインデックスにより、`UserID` による検索は非常に高速になります。 **例**: `UserID = 2` のユーザーを検索するクエリ ```sql SELECT * FROM users WHERE UserID = 2; ``` この場合、データベースはプライマリインデックスを使って即座に `UserID` が 2 のレコード (Bob) を見つけ出します。 #### 2. セカンダリインデックス - **セカンダリインデックス**: プライマリキー以外の列に対して作成する追加のインデックスです。たとえば、`Age` 列にセカンダリインデックスを作成すれば、`Age` を基にした検索が効率化されます。 - セカンダリインデックスは、プライマリキーではない列に対して検索クエリを行う際に、パフォーマンスを改善します。 **例**: `Age = 30` のユーザーを検索するクエリ ```sql SELECT * FROM users WHERE Age = 30; ``` この場合、デフォルトでは `Age` 列にはインデックスがないため、データベースはテーブル全体をスキャンして探す必要があります。しかし、`Age` にセカンダリインデックスを作成すれば、クエリは効率化されます。 #### 違いのまとめ | 特徴 | プライマリインデックス | セカンダリインデックス | |----------------------|-----------------------------------------------------|-----------------------------------------------------| | **作成場所** | プライマリキー列 (`UserID`) に自動的に作成される | プライマリキー以外の列（`Name`, `Email`, `Age` など）に手動で作成可能 | | **一意性** | 各行を一意に識別する | 重複する値を持つ列にも作成可能 | | **検索速度** | プライマリキーを使った検索は最も速い | セカンダリ列を使った検索も効率化されるが、プライマリキーほどではない | | **利用目的** | 行を一意に識別し、効率的にアクセスする | 特定の列を基にした検索を最適化する | | **例** | `SELECT * FROM users WHERE UserID = 2;` | `SELECT * FROM users WHERE Age = 30;` | プライマリインデックスは行の一意な識別に使われるために最適化され、セカンダリインデックスは特定の列に基づく検索を効率化するために使われます。

ローカルインデックス・グローバルインデックス

yagikota commented 1 month ago

6.4 パーティショニングとリバランシング

リバランシング
- DBの変化(CPUの追加、ディスクの追加)に伴って、データを移動させること
キーのハッシュの余剰(mod)でリバランシング: 非推奨
- modを使うと、移動が頻繁に生じる
ノード数 < partition数（固定）にしておく
- ノードが1台増えれば、各ノードから、partitionを選んで新しいノードに入れる
- partition数は変化しない。
- Riak, Elasticsearch, Couchbase, Voldemort
動的なパーティショニング
- partitionが設定されたサイズ以上になれば、分割される。
- partitionから大量にデータが削除されたら、近接するpartitionにマージされる
- つまり、データセットの数に応じてpartition数も変わる。
- HBase, RethinkDB
ノード数に比例するパーティショニング
- cassandra, ketama
- コンシステントハッシュに近い
リバランシングは手動と自動を組み合わせてするべき

ref

Database of Databases
- https://dbdb.io/

yagikota commented 1 month ago

6.5 リクエストのルーティング

書くこと特になし。

yagikota commented 1 month ago

DynamoDBについて読みたい https://docs.aws.amazon.com/ja_jp/amazondynamodb/latest/developerguide/Introduction.html

yagikota / study