Cassandra Table Optimization following a Generic Schema Technique
Usually in any relational database world, tables are normalized with the primary key-foreign key concept. In the No-SQL Cassandra world, that concept does not exist and data is completely denormalized in order to support the distributed nature of a NoSQL database.
In practice, this generally means roughly one table per query pattern. In order to support multiple query patterns, multiple tables are needed.
To see it from a different perspective, each table should pre-build the “answer” to a high-level query that an application needs to support. In order to support different types of answers, different tables are needed. This is how Cassandra NoSQL database has been optimized for reads. One point to be noted here is that with this approach in order to facilitate 5K distinguished answers, 5K tables in Cassandra database might been required.
Needless to say, with such a data modeling approach, systems can end up with too many tables. Substantial performance degradation (i.e. read/write latency, maintenance nightmare, data out-of-sync etc…) can be caused by having too many tables in a Cassandra cluster.
Now we might ask ourselves why we need to even consider 5K questions to start with? The API should be more compact. Here comes the twist. In some application migration cases, C* may not be the source of truth i.e. application is not writing to C* directly but just reading from it. Data is coming to Cassandra via queuing or batch architecture… Or in some cases, infrastructure might have to support both existing applications running from the old database as well as new applications that are running from the new C* database.
Let’s explore a scenario now.
In the traditional relational world, tables are at least 3NF. Hence we might see many tables. Suppose an application has 400 unique reference tables (data that are not changing hourly or daily) along with a few hundred transactional tables in the relational world. We can’t flatten it out in C* world because they are not related to each other. If the chosen path was to migrate only 400 reference tables as-is to Cassandra and then answer the related 400+ questions from the API side, there would be a max-out on the number of tables just for a subset number of reference tables at hand.
In Cassandra, each table should pre-build the “answer” to a high-level query or API. Hence migrating 400 reference tables as is, usually done with the common/traditional Cassandra data model best practices, would hammer the upper bound of table limit in Cassandra.
Based on the above information, surely proceeding with the traditional Cassandra data model was not an option to consider.
So is there any optimized way to migrate these reference tables that not only answers 400+ questions against those referenced tables, but also takes care of not having too many tables at the same time ?
RDBMS: Suppose there are 4 tables — 2 transactional and 2 reference.
The transactional table names are CUSTOMER AND ORDER. CUSTOMER table has customer information by customer id and ORDER table has placed order by order id with relationship to a customer.
The reference table names are PRODUCT_INFO and REBATES. They are completely segregated tables. They are storing 2 different types of information and hence answering two or many different questions. They are -
The PRODUCT_INFO table has product and feature information. This table is answering “give me feature id(s) based on product id(s) and vice versa”
The REBATE table has rebate code and rebate related information. This table is answering “give me all rebate related information based on rebate id.”
We are going to explore further on the reference tables design.
Approaches:
Traditional data-model approach: Create/migrate to Cassandra based on pre-built answers.
Reference Table #1: FEATURE_INFO_BY_PRODUCT_ID
Note: K=Partition key C=Clustering key
Questions that are answered from the above defined table in Cassandra after migration:
Question: Give me feature information based on product id?
Answer via Cassandra Query: SELECT FEAT_ID FROM FEATURE_INFO_BY_PRODUCT_ID WHERE PROD_ID = ‘P1234’;
We need one more rendition of the product_info table in Cassandra since we need product id info based on feat_id as well.
Reference Table #2: PRODUCT_INFO_BY_FEATURE_ID
Question: Give me product information based on feature id?
Answer via Cassandra Query: SELECT PROD_ID FROM PRODUCT_INFO_BY_FEATURE_ID WHERE FEAT_ID = ‘F1234’;
Reference Table #3: REBATES_BY_REBATE_ID
Questions that are answered from the above defined table in Cassandra after migration:
Question: Give me rebate information based on rebate_id, rebate_type and region?
Answer via Cassandra Query: SELECT REBATE_ID, REBATE_TYPE_CD, MAX_OFFER_COUNT FROM REBATES_BY_REBATE_ID WHERE REBATE_ID = ‘R-123’ AND REBATE_TYPE_CD = ‘R-CD-123’ AND REGION_CD= ‘W’;
Proposed approach:
A generic table structure to accommodate any number of reference tables based on instantaneous need and actual requirement was proposed. With this approach 2 tables were created –
1. A main reference table to hold the actual data with columns, datatypes etc.
2. A metadata_reference table to hold metadata of the reference table.
The main reference table contained as many Cassandra partition keys or Cassandra clustering keys as needed based on requirements. For the sake of simplicity, let’s assume 3 partition key columns, 3 clustering key columns and 4 non keys columns. Since the idea is to have a generic table, we have a generic column name for each column.
Let’s visualize the Cassandra generic table structure:
Column descriptions:
First partition key: This key column is mapped to the source RDBMS table, from where the actual data is migrating. This will help from the API side to correctly identify the table name.
Second partition key: This key column is mapped to source RDBMS table’s primary key information.
Third partition key: This key column is mapped to other RDBMS source tables’ composite primary key. We can use this column to make data in Cassandra unique. This column is useful if the source RDBMS table contains multiple keys. If we don’t need this column then we can use ‘NULL’ (string) as a default value.
First clustering key: If more key exists in RDBMS or ordering is required, then clustering key can be used. If this column is not needed, then a ‘NULL’ (string) as default value can be used.
Second clustering key: If more key exists in RDBMS or ordering is required, then clustering key can be used. If this column is not needed, then a ‘NULL’ (string) as default value can be used.
Third clustering key: If more key exists in RDBMS or ordering is required, then clustering key can be used. If this column is not needed, then a ‘NULL’ (string) as default value can be used.
First Non key column: This is a non-primary key column.
Second Non key column: This is a non-primary key column.
Third Non key column: This is a non-primary key column.
Now let’s talk about the metadata reference table structure.
METADATA_REFERENCE: Since the main reference table was designed with generic column names, there was still need for a table to store actual column name mapping information for each generic column per table. This table would contain actual column name information about the main reference table.
Now that we are aware of the generic table structures, let’s try to migrate our PRODUCT_INFO and REBATES tables in this generic format.
Reference Table #1 (with new format): Table FEATURE_INFO_BY_PRODUCT_ID migrated into the REFERENCE_TABLE:
If the same questions are answered again following the new approach:
Question: Give me feature information based on product id?
Answer via Cassandra Query: SELECT FEAT_ID from REFERENCE_TABLE where SOURCE_TABLE= ‘FEATURE_INFO_BY_PRODUCT_ID’ and PK1_VALUE = ‘P1234’ AND PK2_VALUE= ‘NULL’;
Reference Table #2 (with new format): Table PRODUCT_INFO_BY_FEATURE_ID migrated into the REFERENCE_TABLE:
If the same questions are answered again following the new approach:
Question: Give me product information based on feature id?
Answer via Cassandra Query: select PROD_ID from REFERENCE_TABLE where SOURCE_TABLE= ‘PRODUCT_INFO_BY_FEATURE_ID’ and PK1_VALUE = ‘F1234’ AND PK2_VALUE= ‘NULL’;
Reference Table #3 (with new format): Table REBATES_BY_REBATE_ID migrated into the REFERENCE_TABLE:
Question: Give me rebate information based on rebate_id, rebate_type and region?
Answer via Cassandra Query: SELECT REBATE_ID, REBATE_TYPE_CD, MAX_OFFER_COUNT FROM REFERENCE_TABLE where SOURCE_TABLE= ‘REBATES_BY_REBATE_ID’ and PK1_VALUE = ‘R-123’ AND PK2_VALUE= ‘R-CD-123’ AND CK1_VALUE= ‘W’;
With this solution, the migration of 450+ reference tables into a single Cassandra table was made possible. This reference table in Cassandra was a generic table that had 3 generic partition keys, 3 generic clustering keys and 4 generic non-keys along with some mandatory columns.
The above proposed approach addresses the disadvantages of traditional one query/table per “answer” data model providing a novel technique by schema based optimization. With this approach, the number of table requirements in Cassandra is substantially restricted.
Note: We avoided Secondary Index, SASI, MV or DSE Search.