KH log: BigTable 勉強中

2009年10月30日金曜日

BigTable 勉強中

これはWikiがあった。http://ja.wikipedia.org/wiki/BigTable

列指向DBMS
大きいデータは圧縮される。
Google File System を使用
Chubby Lock System ...？
MapReduce, Google Earth, 等などいくつものGoogle Applicationで使われている。

（MapReduce はDBのメンテに使ったりとかぽい）

"Googleが自社のデータベースを開発する理由はコスト、スケーラビリティ、パフォーマンス特性のより良いコントロールなどである"　そうな。
~~各テーブルは多次元。フィールドはその時点のスナップショットを持つ。バージョニング可能。~~
構造は map: (key:string, column:string, timespamp:int64t) -> string. バージョニングやGC機能あり。

メモ：

Single Master Server タイプだが、クライアントライブラリがうまくキャッシュを用いることで性能低下を防いでいる。

BigTable でも、クライアント用のライブラリをあわせて設計(Co-design)しているようだ。

"Co-design Application and File System API" (GFS)

エラー忘却型コンピューティング (ja.wikipedia.org)

GFS, BigTable, MapReduce はこの概念を適用しているという。
GFS の例だと、concurrent write とか atomic append に相当するかな。

性能を出すために、完全な consistency を持つ土台だけの提供をあきらめ、アプリケーション側で対策する設計になっている。checkpoint をもうけて問題をチェックする機構など。
トータルコスト、性能、課題を解決するシンプルな設計、など。もはや旧来の美学は遺物であるのか。

"re-examined traditional choices!"

　Google はエラー忘却というか、うまく飼いならしてる気がするけどね。

列指向DBMS (ja.wikipedia.org)

列に含まれるデータの型は一致する。ある行全体の取得とかは遅い。
大量のデータをゴッソリ処理するには向いてそうだ。

感想：

まあ、何となくわかったが、使われ方を理解する必要あり。
「万能な土台」よりも「simpleだが高性能な土台」を目指しているっぽい。
現実の問題を解決するための、コスト最適解なのだろう。

infinite information と言うことばをドコゾで読んだが、「現実の問題」とは infinite information をうまく扱うことだろう。もしくは世界征服か

あ、これも論文あるのか　( bigtable-odsi06.pdf ) ...
Introduction:

distributed storage system for managing structured data.
Bigtable does not support a full relational data model

simple data model, that supports dynamic control over data layout and format, and allows clients to reason about the locality ... in the underlying storage!!!

Data is indexed using row and column names that can be arbitarary strings.
treat data as uninterpreted strings.
Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk!!!

Data Model:

A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.

map: (row:string, column:string , timestamp:int64) --> string

We settled ... after examining a variety of potential uses of a Bigtable-like system!!

i.e. Webtable

Rows: 10-100bytes, upto 64KB. atomic!!

Each row range called a tablet. reads of short row range is efficient. lexicographic order.

ex. com.google.maps/* のほうが maps.google.com/*　より近傍にまとまってよかったりとか。

Column Families:

basic unit of access control. read/write/? for different types of apps.
same type.
compress data in the same column family.
column key は column family に含まれる。 column family は少なくしたい。in the hundreds at most. rarely change.

In contrast, ... unbounded number of columns!!

column key syntax: family:qualifier

Timestamps:

realtime in microseconds by Bigtable , or explicitly assigned by client.

collision control (uniq. id is needed)

GC: the last N versions, new-enough versions etc.

API:

allows cells to be used as integer counters
supports the execution of client-supplied scripts!!! in the address spaces of the servers. ... written in Sawzall [28]
can be used with MapReduce, a framework for running large-scale parallel computations developed at Google.

Building blocks:

GFS
SSTable file format privides persistent, ordered immutable map from keys to values.
distributed lock service called Chubby[8]

これも client がちょっとかしこいっぽい。unavailability was 0.0047%、って、、。

Implementation:

client library, one master server, many tablet servers.
Tablet Location

B+-tree

Tablet Assignment, Tablet serving, Compactions

Refinements:

locality group
compression
Caching for read performance.
Bloom filter
Commit-log impl.

対故障性か、、

Speeding up tablet recovery
Exploiting immutability

effective concurrency control!!

Performance Evaluation
Real Applications
Lessons:

vulnerable to many types of filures.

memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using, overflow of GFS quotas, planned and unplanned HW maintenance.
check sum for RPC
stopped assuming a given Chubby operation could return only one of a fixed set of errors.

it is important to delay adding new features until it is clear how the new features will be used. 必要な機能を見極められる。
importance of proper system-level monitoring (Bigtable itself, as well as the clients)
simple designs

枯れてない機能使った複雑なアルゴリズムを捨てて、シンプルに。何をデバッグしてるかわからなくなる。

Related Work:

感想：

良くできてるなー。いや、良くできてるというより、大変そうというか、寿命が縮みそう。

ここでもやはり故障との戦いというか共生？。メンテナンスのコストまでふくめ鍛えられている。
Application と DB 両方やってるからこそできる Co-design
MapReduce が BigTable の裏方として使われてるのね、、、
何となく使い方も見えてきた。

GFS にキャッシュがないのもここまでくれば納得できる。メモリに乗せるか乗せないかというのは、アプリケーションの性能を大きく左右するところで、アプリケーションのデザインによってうまくコントロールできないといけない。BigTable はそのシンプルでreasonableな挙動によって、そういう controllability を獲得しているのだ。
メモってないけど、Real Aplication とか、なかなか面白い。

「大規模なデータセンタを効率よく運用できれば云々、、、」なんて話が馬鹿げて聞こえるな。

大企業はデータセンタを自前で持った方がいいなんて話も見かけるがどうだろう。グローバルなデータセンタの成長（と低コスト化）に対し、ローカルなデータセンタのメリットによるオフセットと成長曲線の関係を考えると確かに可能性はあるかもしれない。少なくとも管理のアウトソースや鍛えられたSWの導入は必要な気がするな。どっちがお特とか定性的な話じゃなくて、実際にいくらかかってるか定量的にみないと、技術や運用のレベルによるコストの差がえらい大きそうな気がするよ。まだまだよくわからん。要勉強かな、、、。

あと、Wiki だけだと誤解があった。論文スルーしないでよかった。
あと、講演のvideoもあった、、、。特に新しい情報はなさげ。

0 件のコメント:

コメントを投稿

登録: コメントの投稿 (Atom)