[Solved] How to understand the concept of wide row and related concepts in Cassandra? [closed]

Question

Are a wide row and a partition synonyms?

partition and row can be considered synonym. wide row is a scenario where the chosen partition key will result in very large number of cells for that key. Consider a scenario which has all persons in a country and partition key used is city, then there will be one row for one city and all person will be cells in that row. For metro city this will lead to wide rows. Another example can be storing sensor data received every few seconds with sensorId as partition key, which will lead to huge number of cells some years down the line.

since a partition key is for a wide row, why are there multiple “rows”
(does “rows” here mean “wide rows”)?

Same as above.

how does the partition key “determine the nodes on which rows are
stored”?

From partiton key hash (MurMur3Hash is default) is generated and each node in cassandra is responsible for range of values. Consider Hash of partition key value turns out to be 20 and Node1 is responsible for range 1 to 100 then that partiton will reside on Node1.

How can a partition key be used for “each partition is uniquely
identified by a partition key”?

As explained above partition key decides on which node the data resides.. Data representation can be considered as huge map which can have only unique keys.

what is a clustering column, for example, what are the clustering
columns in the figure?

Consider a table created like Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) here a is partition key and b is clustering column. In the figure attached clustering key is the clustering column and whole enclosing box is cell.

How do the clustering columns “control how data is sorted for storage
within a partition”?

Cassandra will sort the data using column b in the above example table in ascending table. It can be changed to descending as well.

INSERT INTO test(a,b,c) VALUES('test',2,'test2')
INSERT INTO test(a,b,c) VALUES('test',1,'test1')
INSERT INTO test(a,b,c) VALUES('test-new',1,'test1')

If you run the above query in this order cassandra will store data in following order (Data representation has much more than below.. just check the order of column b):

test -> [b:1,c=test1] [b:2,c=test2]
test-new -> [b:1,c=test1]

a partition is a synonym of a wide row, what does it mean by “the rows
within a partition”?

Clustering column is used to identify cells (cells is a better term than row) within a partition. example SELECT * from test where a="test" and b=1 will pick up the cell with b:1 for partiton key test.

How “the clustering keys are used to uniquely identify the rows within
a partition”?

Above answer should explain this as well.

Accepted Answer

Are a wide row and a partition synonyms?

partition and row can be considered synonym. wide row is a scenario where the chosen partition key will result in very large number of cells for that key. Consider a scenario which has all persons in a country and partition key used is city, then there will be one row for one city and all person will be cells in that row. For metro city this will lead to wide rows. Another example can be storing sensor data received every few seconds with sensorId as partition key, which will lead to huge number of cells some years down the line.

since a partition key is for a wide row, why are there multiple “rows”
(does “rows” here mean “wide rows”)?

Same as above.

how does the partition key “determine the nodes on which rows are
stored”?

From partiton key hash (MurMur3Hash is default) is generated and each node in cassandra is responsible for range of values. Consider Hash of partition key value turns out to be 20 and Node1 is responsible for range 1 to 100 then that partiton will reside on Node1.

How can a partition key be used for “each partition is uniquely
identified by a partition key”?

As explained above partition key decides on which node the data resides.. Data representation can be considered as huge map which can have only unique keys.

what is a clustering column, for example, what are the clustering
columns in the figure?

Consider a table created like Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) here a is partition key and b is clustering column. In the figure attached clustering key is the clustering column and whole enclosing box is cell.

How do the clustering columns “control how data is sorted for storage
within a partition”?

Cassandra will sort the data using column b in the above example table in ascending table. It can be changed to descending as well.

INSERT INTO test(a,b,c) VALUES('test',2,'test2')
INSERT INTO test(a,b,c) VALUES('test',1,'test1')
INSERT INTO test(a,b,c) VALUES('test-new',1,'test1')

If you run the above query in this order cassandra will store data in following order (Data representation has much more than below.. just check the order of column b):

test -> [b:1,c=test1] [b:2,c=test2]
test-new -> [b:1,c=test1]

a partition is a synonym of a wide row, what does it mean by “the rows
within a partition”?

Clustering column is used to identify cells (cells is a better term than row) within a partition. example SELECT * from test where a="test" and b=1 will pick up the cell with b:1 for partiton key test.

How “the clustering keys are used to uniquely identify the rows within
a partition”?

Above answer should explain this as well.