It happened just another working day, suddenly our Terraform job managing Confluent resources stopped working. The error reason is clearly stated: limited number of ACLs is hit in Confluent hosted cluster. With a closer look and purge of our ACLs, number of ACLs is reduced 50%. In this article I want to introduce some one practical solution of setting up topics and their ACLs for a scaled organization.
What Is ACL Like In Kafka?
An access control list (ACL) is a list of rules that specifies which users or systems are granted or denied access to a particular system resource. In Kafka ecosystem, resource can be one of Cluster, Topic, Group, Token and Transactional Id. The most frequently touched ACL is topic, which is used for both producing and consuming message, correspondingly the most often configured operation on topic are Read and Write. To find details about Kafka ACL’s resource and operation, you can read documents here.
Requirements And Challenges Designing ACL Strategy On Topics
- System Limit
There are some hard/soft limits when using Kafka from cloud provider. For example, like how we started this story, by default Confluent’s free tier of cluster limit 1000 ACLs, and paid version will have way large number but still there is limit. This limit can be tuned but with poorly designed strategy, the limit will be hit pretty frequently. For example, if you only use LITERAL pattern type for topic ACL, for every topic with 1 producer and 1 consumer there will be 2 ACLs, with simply math 500 topics will drain 1000 ACL quota easily.
2. Security Concern
It is always a good sign to see network effect of adopting Kafka inside the organization, where events are used for different purposes and across different business domain. However, it will put extra pressure on designing ACL strategy because with ACL with wrong scope. On producer side, irrelevant producer may wrongly produce message to your topic, and on consumer side, irrelevant consumer may see the message with sensitive data. For example, if you use message to capture user’s profile change, without proper ACL scope, any employee can capture this information from a kcat consumer.
3. Operational Effort
Operational effort maybe ignored when the number of topics is small, but with scale it maybe a problem. With fast development pace, topic creation and ACL setup for producer and consumer can happen multiple times a day, if a dedicated team is handling approval topic creation and ACL setup, this is not trivial workload and distracting, if not, there maybe misconfiguration or security issue.
More than daily operational work, software used to manage topics and ACLs need to be maintained as well. As Terraform user, Confluent provider is relatively easier to use, but I hit rate limit problem during work simply because too many ACLs need to be created.
4. Scope Mismatch
ACL is not only used by program, but also used by programers. But the access control on them maybe different. For example, Confluent is using RBAC managing developer authorization. It comes with finer granularity but is different from ACL (and it will be mapped to ACL ultimately). So when design ACL strategy, you also need to take into human user’s RBAC into consideration.
5. CDC/Outbox Connector Authorization
CDC/Outbox authorization can be classified as one of producer’s security issue mentioned above, but it needs special attention because it is easily ignored and can be really tricky to fix in production system. The root cause of the issue is that the authorization of database write and Kafka message belongs to different system. One program can be authorized to write to an outbox table in database, but the topic column maybe filled in with a topic this program not authorized to produce to. This problem is more obvious and impactful in monolith of code when the call path is vague and easy to copy and reuse.
Practical Setup Suggestions
- Group Topics In Meaningful Way And Use Prefix
Follow Domain Driven Design way to create topics. For example, with pattern com_{DOMAIN}_{SUBDOMAIN}_{VERSION}_{command or event name}
if you own a subscription service, you can create a service account for this service to WRITE to topic prefixed com_subscription.*
, then you have this one ACL covers all the topic production in this service. If there are topics needs to be consumed for “internal” usage, like building state machine, you can also offer an ACL on domain or subdomain level to the same service account, which will only add one ACL to cover topics consumption.
2. Limit Scope Of ACL For Consumers
When define ACL for consumers, try to be as specific as possible. Consumer should only see what they really need, that is all. Be generous to create LITERAL topic ACLs because data leak is way severe than hitting number of ACL limit.
3. Be Generous To CDC/Outbox Producer To Save Operational Effort
For a database attached to an entangled monolith, the database itself maybe very busy as well. If multiple small scope connectors are attached, the extra number of replication slot may add burden to database which is not ignorable. In this scenario, I would suggest creating domain level topic PREFIXED ACL for all the possible domains inside the monolith. With practical experience, if stick on LITERAL ACL for topic production in monolith with Outbox/CDC, it is very common (weekly basis) to see unauthorized topics sneak into the outbox table and the common fix like advance replication slot and offset topic could take one to several hours.
4. Automate As Much As Possible
Managing topics and their ACLs manually is really verbose, and engineer power should not be wasted here. However, it is always good to keep an eye on the infrastructure level changes. Below is a suggested way for topic and ACL CI/CD flow. Dotted line and box represent steps needs human intervention, and solid ones are programmatic.
a. Developers can define their schema for business purpose in any format (Avro), and the file path can follow the topic pattern suggested above. If no schema needed like CDC, they can start create topic directly like step c.1.
b. If schema’s path is followed by certain pattern, topic can be inferred and automatically generated by creating pull request in place or in the topic management repo.
c.1. A new PR is created with newly created topics, this can happen in the same repository with schema definition, or in a separate repo.
c.2. Developer can manually change configuration to link topic or its prefix to service account. For consumer, developers can also configure consumer group. For adding consumer group to consuming an existed topic, developer can start with this step.
d. Developers including topic’s producer/consumer and Kafka operator kick in to approval PR.
e. Apply topic creation and ACL change with softwares, like Terraform.
5. Move To RBAC
It seems like more and more Kafka cloud provider is approaching to RBAC based authorization even for service account. Given the official documents like the one in Confluent, it offers better granularity and it can provide consistent configuration across human user and service account. It may take some efforts to migrate all the ACLs to RBAC, but it is worth trying and enjoy the convenience and stronger security in long term.