모니터링 도구
Kafka 클러스터를 모니터링하기 위한 다양한 도구와 설정 방법을 살펴봅니다.
모니터링 아키텍처
일반적인 구성
┌─────────────────────────────────────────────────────────────────┐
│ Kafka Monitoring Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ JMX ┌─────────────┐ ┌─────────────┐ │
│ │ Broker │ ───────► │ JMX Exporter│ ───► │ Prometheus │ │
│ └─────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌─────────┐ JMX ┌─────────────┐ │ │
│ │Producer │ ───────► │ JMX Exporter│ ───────────┤ │
│ └─────────┘ └─────────────┘ │ │
│ ▼ │
│ ┌─────────┐ JMX ┌─────────────┐ ┌─────────────┐ │
│ │Consumer │ ───────► │ JMX Exporter│ ───► │ Grafana │ │
│ └─────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Alerting │ │
│ │ (PagerDuty) │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
JMX (Java Management Extensions)
Broker JMX 설정
# server.properties 또는 환경변수
# kafka-server-start.sh 수정
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.rmi.port=9999 \
-Djava.rmi.server.hostname=<broker-hostname>"
# 또는 JMX_PORT 환경변수
export JMX_PORT=9999인증이 있는 JMX 설정
# jmxremote.password
admin password123
# jmxremote.access
admin readwrite
# JMX 옵션
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.password.file=/path/to/jmxremote.password
-Dcom.sun.management.jmxremote.access.file=/path/to/jmxremote.accessJMX 도구 사용
# JConsole
jconsole localhost:9999
# VisualVM
visualvm --openjmx localhost:9999
# jmxterm (CLI)
java -jar jmxterm.jar -l localhost:9999
# 명령어 예시
beans # 모든 MBean 조회
info -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
get -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec CountPrometheus + JMX Exporter
JMX Exporter 설치
# JMX Exporter 다운로드
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.19.0/jmx_prometheus_javaagent-0.19.0.jar
# Kafka 시작 시 에이전트 추가
export KAFKA_OPTS="-javaagent:/path/to/jmx_prometheus_javaagent-0.19.0.jar=7071:/path/to/kafka-broker.yml"JMX Exporter 설정 (kafka-broker.yml)
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Broker 메트릭
- pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>Count
name: kafka_server_brokertopicmetrics_$1_total
labels:
topic: "$2"
type: COUNTER
- pattern: kafka.server<type=BrokerTopicMetrics, name=(.+)><>Count
name: kafka_server_brokertopicmetrics_$1_total
type: COUNTER
- pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>OneMinuteRate
name: kafka_server_brokertopicmetrics_$1_rate
labels:
topic: "$2"
type: GAUGE
# 파티션 메트릭
- pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
name: kafka_server_replicamanager_underreplicatedpartitions
type: GAUGE
- pattern: kafka.server<type=ReplicaManager, name=PartitionCount><>Value
name: kafka_server_replicamanager_partitioncount
type: GAUGE
- pattern: kafka.server<type=ReplicaManager, name=LeaderCount><>Value
name: kafka_server_replicamanager_leadercount
type: GAUGE
# 컨트롤러 메트릭
- pattern: kafka.controller<type=KafkaController, name=(.+)><>Value
name: kafka_controller_kafkacontroller_$1
type: GAUGE
# 요청 처리 메트릭
- pattern: kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+), version=(.+)><>Count
name: kafka_network_requestmetrics_requests_total
labels:
request: "$1"
version: "$2"
type: COUNTER
- pattern: kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>(\w+)
name: kafka_network_requestmetrics_totaltimems
labels:
request: "$1"
aggregate: "$2"
type: GAUGE
# 로그 메트릭
- pattern: kafka.log<type=Log, name=Size, topic=(.+), partition=(.+)><>Value
name: kafka_log_log_size
labels:
topic: "$1"
partition: "$2"
type: GAUGE
# Consumer Group 메트릭
- pattern: kafka.server<type=FetcherLagMetrics, name=ConsumerLag, clientId=(.+), topic=(.+), partition=(.+)><>Value
name: kafka_server_fetcherlagmetrics_consumerlag
labels:
client_id: "$1"
topic: "$2"
partition: "$3"
type: GAUGEPrometheus 설정
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- 'kafka1:7071'
- 'kafka2:7071'
- 'kafka3:7071'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
- job_name: 'kafka-producer'
static_configs:
- targets:
- 'producer1:7072'
- job_name: 'kafka-consumer'
static_configs:
- targets:
- 'consumer1:7073'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- 'kafka_alerts.yml'알림 규칙 (kafka_alerts.yml)
groups:
- name: kafka-alerts
rules:
# 오프라인 파티션
- alert: KafkaOfflinePartitions
expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka offline partitions detected"
description: "{{ $value }} offline partitions"
# Under-replicated 파티션
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Under-replicated partitions detected"
description: "{{ $value }} under-replicated partitions"
# 컨트롤러 없음
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "No active Kafka controller"
# 높은 Consumer Lag
- alert: KafkaHighConsumerLag
expr: kafka_server_fetcherlagmetrics_consumerlag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag detected"
description: "Consumer lag is {{ $value }}"
# Broker 다운
- alert: KafkaBrokerDown
expr: up{job="kafka"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
description: "Broker {{ $labels.instance }} is not responding"Grafana 대시보드
대시보드 구성
┌─────────────────────────────────────────────────────────────────┐
│ Kafka Overview Dashboard │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Active │ │ Offline │ │ Under- │ │
│ │ Controllers │ │ Partitions │ │ Replicated │ │
│ │ 1 │ │ 0 │ │ 0 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Messages In/Out Per Second │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ ▲ │ │
│ │ /│\ /\ │ │
│ │ / │ \ / \ Messages In │ │
│ │ / │ \/ \___/ │ │
│ │ ────┼─────────────────────────────────────────────► time │ │
│ └───────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Consumer Lag by Group │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ group-a: ████████████████░░░░ 80,000 │ │
│ │ group-b: ██████░░░░░░░░░░░░░░ 30,000 │ │
│ │ group-c: ██░░░░░░░░░░░░░░░░░░ 5,000 │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Grafana 대시보드 JSON 예시
{
"dashboard": {
"title": "Kafka Overview",
"panels": [
{
"title": "Messages In Per Second",
"type": "graph",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])) by (topic)",
"legendFormat": "{{ topic }}"
}
]
},
{
"title": "Consumer Lag",
"type": "graph",
"targets": [
{
"expr": "sum(kafka_server_fetcherlagmetrics_consumerlag) by (client_id, topic)",
"legendFormat": "{{ client_id }} - {{ topic }}"
}
]
},
{
"title": "Under-Replicated Partitions",
"type": "singlestat",
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
}
],
"thresholds": "1,5",
"colors": ["green", "yellow", "red"]
}
]
}
}Kafka 내장 도구
kafka-consumer-groups.sh
# Consumer Group 목록
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
# Consumer Group 상세 (Lag 포함)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-consumer-group
# 출력:
# GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# my-group my-topic 0 1000 1500 500
# my-group my-topic 1 2000 2100 100
# 모든 그룹 상태
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --all-groups --statekafka-log-dirs.sh
# 로그 디렉토리 정보
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --broker-list 0,1,2
# 특정 토픽
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --topic-list my-topickafka-broker-api-versions.sh
# Broker API 버전 확인
kafka-broker-api-versions.sh --bootstrap-server localhost:9092서드파티 도구
Kafka Manager (CMAK)
# docker-compose.yml
version: '3'
services:
kafka-manager:
image: ghcr.io/eshepelyuk/dckr/kafka-manager
ports:
- "9000:9000"
environment:
ZK_HOSTS: "zookeeper:2181"
APPLICATION_SECRET: "random-secret"Kafdrop
# docker-compose.yml
version: '3'
services:
kafdrop:
image: obsidiandynamics/kafdrop
ports:
- "9000:9000"
environment:
KAFKA_BROKERCONNECT: "kafka:9092"
JVM_OPTS: "-Xms32M -Xmx64M"Burrow (Consumer Lag 모니터링)
# burrow.toml
[general]
pidfile = "/var/run/burrow.pid"
stdout-logfile = "/var/log/burrow/burrow.log"
[logging]
level = "info"
[zookeeper]
servers = ["zk1:2181", "zk2:2181", "zk3:2181"]
[kafka "local"]
brokers = ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
[consumer "local"]
class = "kafka"
cluster = "local"
servers = ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
group-whitelist = ".*"
[httpserver "default"]
address = ":8000"Conduktor
상용 GUI 도구 기능:
- 클러스터 관리
- 토픽/Consumer 모니터링
- 메시지 브라우징
- Schema Registry 연동
- 보안 관리
Best Practices
1. 계층적 모니터링
Level 1: 인프라 (CPU, Memory, Disk, Network)
↓
Level 2: Kafka 핵심 (Broker 상태, 파티션 상태)
↓
Level 3: 처리량 (Messages/sec, Bytes/sec)
↓
Level 4: 지연 (Request latency, Consumer Lag)
↓
Level 5: 애플리케이션 (비즈니스 메트릭)
2. 알림 우선순위
P1 (즉시):
- OfflinePartitions > 0
- ActiveController != 1
- Broker Down
P2 (5분 이내):
- UnderReplicatedPartitions > 0
- ConsumerLag > critical_threshold
P3 (30분 이내):
- ConsumerLag > warning_threshold
- HighRequestLatency
P4 (업무 시간):
- FrequentRebalance
- LowDiskSpace3. 대시보드 구성
Overview Dashboard:
- 클러스터 상태 (싱글스탯)
- 처리량 그래프
- 에러율 그래프
- Top 10 토픽
Detail Dashboard:
- Broker별 메트릭
- Topic별 메트릭
- Consumer Group별 메트릭
Troubleshooting Dashboard:
- Request 지연 분포
- GC 메트릭
- 네트워크 메트릭
댓글 (0)