Introduction
It is widely known that Intraway masters DOCSIS technology, including provisioning and service assurance. Our products deal with the complexity of extensive data when interacting with DOCSIS devices. Several of our customers have more than a million subscribers, which means that Intraway handles at least millions of devices.
When the provisioning workflow is completed, Service Assurance use cases become a priority.
The key feature of Service Assurance is data gathering. Therefore, having all the network status in real time within reach is the first step.
When our customer network is centralized in one data center, and the number of devices is kept under a million, it is safe to say that traditional technical approaches work fine. However, DOCSIS penetration is higher every day, and CSPs are getting bigger and geographically distributed. These days, it is common to manage three to five million devices spread over several regions and grouped in different data centers. Collecting all network status poses a considerable challenge.
In this paper, we will analyze two different technical solutions to solve this use case. Tests will be conducted over a simulated system.
Use case description:
- 3 million devices to be monitored
- Full poll every 30 minutes.
- Online events that must be attended.
- Three data centers with regional distribution
- Latency between data centers: 50 ms
- Three types of operations:
○ Update device status (insert or update over device table)
○ Audit log (insert on audit table)
○ Random queries over previous tables
Traditional Approach
Our current solution is based on this architecture. A single database cluster obtains all the information from the Network Collectors (hereinafter called NC) that talk to the devices. Generally, the database is located in the main/bigger data center.
Each NC manages a part of the network depending on the client topology, therefore a subset of the database.
The Problem
When we start handling more than a million of devices with regional distribution, the scenario begins to complicate. We have to deal with at least 700 ops just to keep the data updated. Load average over the database starts to increase dramatically. Up to now, we have managed to operate with fine tuning and correct maintenance. However, how much can operators grow with this model?
Another issue is availability. If a data center is isolated, the solution becomes inconsistent. All data that is collected from that data center will be lost until reconnection.
Reliability is important, but it is not core in our case. Every half an hour all the data will be rebuilt.
At Intraway, we develop carrier class products. Three over six key features of carrier class products are at risk.
Distributed Solution
We started looking for options, and we still do. There are several NoSQL methods these days that can contribute to the solution, but in this paper, we will concentrate on what we believe can be the answer: Cassandra.
Cassandra is a master-master database, where every node is master and if configured correctly, it does not have a single point of failure.
To deal with this use case, a possible architecture using Cassandra could be having a group of nodes in each data center. Cassandra will replicate across all data centers, providing fault tolerance and scalability. Each NC will be assigned to a group of nodes in the data center.
Where is the catch? The problems reside in Cap Theorem (it is not possible to have consistency and availability at the same time if the network is partitioned).
“Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra. But Cassandra can be tuned with replication factor and consistency level to also meet C.”
Luckily, we are not dealing with financial transactions, so we can support eventual inconsistency while reading. When a node is queried, it is possible that the data available is not obtained. This is called eventual consistency. In our use case, this is not a problem.
Important concepts
If you are familiar with distributed databases, you can omit this part. If not, it is essential that you deeply understand these concepts.
- Data Modeling
Data modeling in Cassandra has some particularities. Table relationships do not exist, and it has a column-oriented data model. This means each row can have different columns, as in a sorted map.
Every table must have a primary key that is composed of partition key and cluster keys. Instead of SQL, CQL (Cassandra Query Language) is used.
Cassandra data model consists of keyspaces (analogous to databases). Each keyspace can be configured separately according to business needs.
- Data Partitioning
Cassandra partition data by default. A partitioner determines how data is distributed across the nodes in the cluster. A partitioner is a hash function for computing the token of a partition key. Each row of data is uniquely identified by a partition key and distributed across the cluster by the value of the token.
There are different algorithms to implement partitioning; Murmur3Partitioner is the default one.
- Data Replication
In order to ensure reliability and fault tolerance, Cassandra uses data replication. The total number of replicas across the cluster is referred to as the replication factor. If the operator configures keyspace with a replication factor of 3, they will have three copies of their data partitioned along the cluster.
There are two strategies for data replication: Simple Replication Strategy and Network Topology Strategy. The latter one is highly recommended when the operator has multiple data centers
- Data Consistency
Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Writes and Reads can be separately configured with different consistency levels according to business needs.
Testing time
A special test has been built for this purpose. We coded three endpoints in spring boot:
- /updateCpe: For a given resource, it makes an upsert on CPE table. The purpose is to test concurrent updates.
- /audit: For a given resource, it inserts a row.
- /queryCpe: Query a random CPE
Every endpoint works with MySQL and Cassandra.
The test starts loading only one server, and after a period the others servers joined the test. In the end, 108 threads distributed along the three servers with spring boot API.
Hardware:
- Four dedicated servers with four cores and 8GB of RAM. Spinning disk of 7200 rpm.
Results
We have made several tests with different scenarios playing with duration, load ramp up and data randomize. In order to present clear results, we will share the following scenario:
- Test duration: 15 min
- Peak of 108 concurrent threads. Each server received 36 parallel threads with a ramp up of 30 seconds.
- First server starts at minute 0. The second server at minute 5 and third at minute 10.
- MySQL and Cassandra ran with default configuration.
- Spring boot was configured with pooled datasource of 10 connections.
- Queries always ran over indexes.
We ran the same tests for several hours, and the result was the same, but impossible to graph due to the amount of data.
For a graphical purpose, we have used grafana. Follow this link for further information.
MySQL
- Throughput: 860 ops
- Avg response time: 82.6 ms
- 99th response time: 281 ms
- 95th response time: 181 ms
- 90th response time: 146 ms
Load and Throughput
Cassandra
- Throughput: 2.5k ops
- Avg response time: 24.5 ms
- 99th response time: 96 ms
- 95th response time: 67 ms
- 90th response time: 51ms
Load and Throughput
Comparison Graphs
In the next graph, it is possible to compare both results together.
Conclusion
In this limited test, we could demonstrate what we experience with a centralized database. When it receives transactions from one client, the performance is OK, and Cassandra does not take advantage. However, as we add load from other services, the operations per second do not increase, and at some point start to decrease. Global response time linearly increases as loads increase.
In this test, we did not have to deal with latency, but if you consider that NC from different data centers may impact on the same database, performance is reduced significantly.
On the other hand, Cassandra not only increases operations per seconds as new threads join the test, the average response time remained constant. Average response time was more than three times lesser than the one in MySQL (24.5 vs. 82.6 ms) and operation per seconds almost 3x more. Cassandra starts to make sense when multiple nodes handle the request. For only one node, there are no benefits.
In the end, Cassandra allows us to increase processing power by adding more nodes without any disturbance; for instance, restarting the cluster is not necessary.
Another significant benefit is that fault tolerance is incredible. A cluster with many nodes properly configured with replication factor at least of 3, you can survive the loss of nodes without data loss or impact on the application. In general, if the database lost, there is nothing more to do. The game changes with Cassandra. Even if the connection between data centers is lost, you can continue operating. When reconnection is established, Cassandra will reconcile the data for you.
It is important to say that there are some disadvantages. In a centralized transactional database, the last thing that you have to worry about is consistency. In Cassandra, this has to be managed very carefully. If your business cannot handle eventual consistency, this is not the appropriate solution.
Consistency is customizable in Cassandra; it can be configured from eventual consistency to string consistency and all its blends in the middle. My recommendation is that using Strong consistency is not the best use of this database. You can always make reads with Reading Level ALL (reads all the nodes in the cluster to get the last value), but performance will be reduced. In our use case, we can manage not having the latest update on every row.
Deploy complexity increases with Cassandra. Your team will have to maintain some servers instead of one (or two if you have replicas). Cassandra configuration and tuning is not a simple task. Your deployment must be in line with the network topology and data center distribution. Any changes in the topology might involve changing Cassandra deployment and configuration.
The thing I miss the most is the tooling around MySQL Database. Enterprise Monitoring or even MySQL Workbench does not exist in Cassandra. Everything has to be worked by hand without fancy tools that bring closer not expert people. Almost every developer or operational person knows how MySQL works and can work with its engine. Cassandra is another paradigm that must be taught.
In the end, Cassandra cannot compete with traditional databases if the use case to solve does not merit. However, every day new edge use cases appear, and it is good to know that there are alternatives.
For us, Cassandra might let us improve our product and solve problems.
References
The CAP Theorem | Learn Cassandra
Teddyma.gitbooks.io. (2017). The CAP Theorem | Learn Cassandra. [online] Available at: https://teddyma.gitbooks.io/learncassandra/content/about/the_cap_theorem.html [Accessed 11 Sep. 2017].
Apache Cassandra
Cassandra.apache.org. (2017). Apache Cassandra. [online] Available at: https://cassandra.apache.org/ [Accessed 11 Sep. 2017].
Cassandra Parameters for Dummies
Ecyrd.com. (2017). Cassandra Parameters for Dummies. [online] Available at: https://www.ecyrd.com/cassandracalculator/ [Accessed 11 Sep. 2017].
Ali, S. and profile, V.
Ali, S. and profile, V. (2013). Cassandra Data Modelling – Tables. [online] Intellidzine.blogspot.com.ar. Available at: http://intellidzine.blogspot.com.ar/2013/11/cassandra-data-modelling-tables.html [Accessed 11 Sep. 2017].
Cassandra Data Modelling – Primary Keys
Intellidzine.blogspot.com.ar. (2014). Cassandra Data Modelling – Primary Keys. [online] Available at: http://intellidzine.blogspot.com.ar/2014/01/cassandra-data-modelling-primary-keys.html [Accessed 11 Sep. 2017].