Understanding Cassandra Data Modelling, Partition and clustering keys

Let’s get started:-

Overview of Cassandra

Cassandra is a NOSQL database developed by Facebook. It is a great database that allows you to effectively run queries on large amount of structured and semi structured data.

To know when you have to choose Cassandra as your database of choice, you need to have an understanding of CAP Theorem.

CAP theorem states that it is impossible for a distributed system to satisfy all three of these guarantees together, you need to leave one out.

C is Consistency, means all nodes have the same data at the same time

P is partition tolerance, means that in case of failure the system continues to operate

A is availability , make sure every request receives a response

In Cassandra, availability and partition tolerance are considered to be more important than consistency in Cassandra. However you can tune consistency also with replication factor and consistency level to also meet C.

Imagine Facebook, it is not the end of the world if some one is not shown the last message from one second ago, but  you can not at all tolerate loss of some data, or the system not being available.

Cassandra doesn’t have a master node, so there is no chance that the whole cluster goes down in case that the master node goes down. Data is distributed equally on the different nodes based on key partitions.

Data Modelling in Cassandra

In Cassandra you have a bit different terminology from relationship databases, the whole database is called a keyspace, and rather than having tables, you have a column family partitioned by row key and sorted by column names/key.

Data is not structured in Cassandra in tables but rather more like nested JSON objects.

Map<ParitionKey, SortedMap<ColumnKey, ColumnValue>>

Cassandra columns are a Map of sorted Maps, the data is partitioned using partition keys and then sorted according to column key and their associated values.

  1. Create a keySpace

Before starting to create tables and start to understand keys and other data modeling concepts in Cassandra let us create a keyspace

CREATE KEYSPACE movielens
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1
};

After the keyspace has been created we need to start using it, so we can start creating tables in this particular keyspace, this command will specify which keyspace to use.

USE movielens;

2) Understanding Components of the Cassandra data model

Primary Key

Primary keys are responsible for distributing the data between the nodes.

This picture can help you to visualize how primary keys work.

Here in this example, I am creating a table for users in the movie lens database. Here I use the user_id as the primary key.

CREATE TABLE users (user_id int, age int, gender text, occupation text, zip text, PRIMARY KEY(user_id) );

Compound Partitioning keys

Compound keys are when multiple columns as partitioning keys but still partitioning is done according to one primary key and the other keys are used to sort the data set and are used as clustering keys. The choice of key totally depends on how you want to use the table.

For example, in the above Movielens dataset if you’re looking for people in a particular zip code it doesn’t make sense to have the user ID as the primary key because data will be distributed on all the Clusters. Instead, if we use a compound key with zip code and user ID we can have all the people with the same zip code on the same cluster so we can easily find them and then we can use the user ID to find a user.  Clustering keys are sorted in ascending order by default, but we can change this by using ‘ORDER By’.

CREATE TABLE users (user_id int, age int, gender text, occupation text, zip text, PRIMARY KEY(zip,user_id) );Composite key

Now in our above example let’s say we have 10000 users from a particular zip code, hence you end up with a single row with 10000 combinations on a particular cluster.

To solve this you can use composite keys. These keys are hash values that are built by combining multiple columns together, for example we can use age and zip code together as a partition keys and further segregate the 10,000 user from a particular zip code.

CREATE TABLE users (user_id int, age int, gender text, occupation text, zip text, PRIMARY KEY((zip,age),user_id) )
WITH CLUSTERING ORDER BY user_id DESC);

When issuing a CQL query, you must include all partition key columns, at a minimum. You can then apply an additional filter by adding each clustering key in the order in which the clustering keys appear.

Setting up a Cassandra database in Google Cloud

For setting up Cassandra on my local system I used google cloud services to deploy a Cassandra cluster.

You can CQL queries from the command line only using by start CQL command line interface

cqlsh

Integrating PYSPARK with Cassandra

Rather than working from the command line another way to use Cassandra by connecting it to Apache spark, this way you can get data from Cassandra in the form of a data frame for processing, as well as store processed dataframes in Cassandra.

For doing this the code I am sharing with you guys is from the udemy course I used to learn the basics of big data.

 
 

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions

def parseInput(line):
    fields = line.split('|')
    return Row(user_id = int(fields[0]), age = int(fields[1]), gender = fields[2], occupation = fields[3], zip = fields[4])

if __name__ == "__main__":
    # Create a SparkSession
    spark = SparkSession.builder.appName("CassandraIntegration").config("spark.cassandra.connection.host", "127.0.0.1").getOrCreate()

    # Get the raw data
    lines = spark.sparkContext.textFile("hdfs:///location/u.user")
    # Convert it to a RDD of Row objects with (userID, age, gender, occupation, zip)
    users = lines.map(parseInput)
    # Convert that to a DataFrame
    usersDataset = spark.createDataFrame(users)

    # Write it into Cassandra
    usersDataset.write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append')\
        .options(table="users", keyspace="movielens")\
        .save()

    # Read it back from Cassandra into a new Dataframe
    readUsers = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="users", keyspace="movielens")\
    .load()

    readUsers.createOrReplaceTempView("users")

    sqlDF = spark.sql("SELECT * FROM users WHERE age &lt; 20")
    sqlDF.show()

    # Stop the session
    spark.stop()

When you execute a similar spark job in your cluster, make sure you include the correct packages.

spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 sparkcassandra.py

If your Cassandra database is running on a separate cluster, make sure you get the right internal IP for the cluster. If you are not sure about the IP, you can find it in /etc/Cassandra/cassandra.yaml in the broadcast_rpc_address field.

Any changes you need to do regarding authentication in Cassandra also need to be done in this particular file.

About the author

admin

Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2023. Created by Meks. Powered by WordPress.