모니터링 도구

Kafka 클러스터를 모니터링하기 위한 다양한 도구와 설정 방법을 살펴봅니다.

모니터링 아키텍처

일반적인 구성

┌─────────────────────────────────────────────────────────────────┐
│                    Kafka Monitoring Stack                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐   JMX    ┌─────────────┐      ┌─────────────┐     │
│  │ Broker  │ ───────► │ JMX Exporter│ ───► │ Prometheus  │     │
│  └─────────┘          └─────────────┘      └─────────────┘     │
│                                                   │              │
│  ┌─────────┐   JMX    ┌─────────────┐            │              │
│  │Producer │ ───────► │ JMX Exporter│ ───────────┤              │
│  └─────────┘          └─────────────┘            │              │
│                                                   ▼              │
│  ┌─────────┐   JMX    ┌─────────────┐      ┌─────────────┐     │
│  │Consumer │ ───────► │ JMX Exporter│ ───► │   Grafana   │     │
│  └─────────┘          └─────────────┘      └─────────────┘     │
│                                                   │              │
│                                                   ▼              │
│                                            ┌─────────────┐     │
│                                            │   Alerting  │     │
│                                            │ (PagerDuty) │     │
│                                            └─────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

JMX (Java Management Extensions)

Broker JMX 설정

# server.properties 또는 환경변수
 
# kafka-server-start.sh 수정
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
    -Dcom.sun.management.jmxremote.authenticate=false \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.port=9999 \
    -Dcom.sun.management.jmxremote.rmi.port=9999 \
    -Djava.rmi.server.hostname=<broker-hostname>"
 
# 또는 JMX_PORT 환경변수
export JMX_PORT=9999

인증이 있는 JMX 설정

# jmxremote.password
admin password123
 
# jmxremote.access
admin readwrite
 
# JMX 옵션
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.password.file=/path/to/jmxremote.password
-Dcom.sun.management.jmxremote.access.file=/path/to/jmxremote.access

JMX 도구 사용

# JConsole
jconsole localhost:9999
 
# VisualVM
visualvm --openjmx localhost:9999
 
# jmxterm (CLI)
java -jar jmxterm.jar -l localhost:9999
 
# 명령어 예시
beans                              # 모든 MBean 조회
info -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
get -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec Count

Prometheus + JMX Exporter

JMX Exporter 설치

# JMX Exporter 다운로드
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.19.0/jmx_prometheus_javaagent-0.19.0.jar
 
# Kafka 시작 시 에이전트 추가
export KAFKA_OPTS="-javaagent:/path/to/jmx_prometheus_javaagent-0.19.0.jar=7071:/path/to/kafka-broker.yml"

JMX Exporter 설정 (kafka-broker.yml)

lowercaseOutputName: true
lowercaseOutputLabelNames: true
 
rules:
  # Broker 메트릭
  - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>Count
    name: kafka_server_brokertopicmetrics_$1_total
    labels:
      topic: "$2"
    type: COUNTER
 
  - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+)><>Count
    name: kafka_server_brokertopicmetrics_$1_total
    type: COUNTER
 
  - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>OneMinuteRate
    name: kafka_server_brokertopicmetrics_$1_rate
    labels:
      topic: "$2"
    type: GAUGE
 
  # 파티션 메트릭
  - pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
    name: kafka_server_replicamanager_underreplicatedpartitions
    type: GAUGE
 
  - pattern: kafka.server<type=ReplicaManager, name=PartitionCount><>Value
    name: kafka_server_replicamanager_partitioncount
    type: GAUGE
 
  - pattern: kafka.server<type=ReplicaManager, name=LeaderCount><>Value
    name: kafka_server_replicamanager_leadercount
    type: GAUGE
 
  # 컨트롤러 메트릭
  - pattern: kafka.controller<type=KafkaController, name=(.+)><>Value
    name: kafka_controller_kafkacontroller_$1
    type: GAUGE
 
  # 요청 처리 메트릭
  - pattern: kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+), version=(.+)><>Count
    name: kafka_network_requestmetrics_requests_total
    labels:
      request: "$1"
      version: "$2"
    type: COUNTER
 
  - pattern: kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>(\w+)
    name: kafka_network_requestmetrics_totaltimems
    labels:
      request: "$1"
      aggregate: "$2"
    type: GAUGE
 
  # 로그 메트릭
  - pattern: kafka.log<type=Log, name=Size, topic=(.+), partition=(.+)><>Value
    name: kafka_log_log_size
    labels:
      topic: "$1"
      partition: "$2"
    type: GAUGE
 
  # Consumer Group 메트릭
  - pattern: kafka.server<type=FetcherLagMetrics, name=ConsumerLag, clientId=(.+), topic=(.+), partition=(.+)><>Value
    name: kafka_server_fetcherlagmetrics_consumerlag
    labels:
      client_id: "$1"
      topic: "$2"
      partition: "$3"
    type: GAUGE

Prometheus 설정

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets:
        - 'kafka1:7071'
        - 'kafka2:7071'
        - 'kafka3:7071'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'
 
  - job_name: 'kafka-producer'
    static_configs:
      - targets:
        - 'producer1:7072'
 
  - job_name: 'kafka-consumer'
    static_configs:
      - targets:
        - 'consumer1:7073'
 
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'
 
rule_files:
  - 'kafka_alerts.yml'

알림 규칙 (kafka_alerts.yml)

groups:
  - name: kafka-alerts
    rules:
      # 오프라인 파티션
      - alert: KafkaOfflinePartitions
        expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka offline partitions detected"
          description: "{{ $value }} offline partitions"
 
      # Under-replicated 파티션
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Under-replicated partitions detected"
          description: "{{ $value }} under-replicated partitions"
 
      # 컨트롤러 없음
      - alert: KafkaNoActiveController
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No active Kafka controller"
 
      # 높은 Consumer Lag
      - alert: KafkaHighConsumerLag
        expr: kafka_server_fetcherlagmetrics_consumerlag > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag detected"
          description: "Consumer lag is {{ $value }}"
 
      # Broker 다운
      - alert: KafkaBrokerDown
        expr: up{job="kafka"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker is down"
          description: "Broker {{ $labels.instance }} is not responding"

Grafana 대시보드

대시보드 구성

┌─────────────────────────────────────────────────────────────────┐
│                     Kafka Overview Dashboard                     │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐            │
│  │ Active       │ │ Offline      │ │ Under-       │            │
│  │ Controllers  │ │ Partitions   │ │ Replicated   │            │
│  │     1        │ │     0        │ │     0        │            │
│  └──────────────┘ └──────────────┘ └──────────────┘            │
├─────────────────────────────────────────────────────────────────┤
│  Messages In/Out Per Second                                     │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │     ▲                                                      │ │
│  │    /│\    /\                                               │ │
│  │   / │ \  /  \     Messages In                              │ │
│  │  /  │  \/    \___/                                         │ │
│  │ ────┼─────────────────────────────────────────────► time  │ │
│  └───────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│  Consumer Lag by Group                                          │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │ group-a: ████████████████░░░░ 80,000                      │ │
│  │ group-b: ██████░░░░░░░░░░░░░░ 30,000                      │ │
│  │ group-c: ██░░░░░░░░░░░░░░░░░░  5,000                      │ │
│  └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Grafana 대시보드 JSON 예시

{
  "dashboard": {
    "title": "Kafka Overview",
    "panels": [
      {
        "title": "Messages In Per Second",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])) by (topic)",
            "legendFormat": "{{ topic }}"
          }
        ]
      },
      {
        "title": "Consumer Lag",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(kafka_server_fetcherlagmetrics_consumerlag) by (client_id, topic)",
            "legendFormat": "{{ client_id }} - {{ topic }}"
          }
        ]
      },
      {
        "title": "Under-Replicated Partitions",
        "type": "singlestat",
        "targets": [
          {
            "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
          }
        ],
        "thresholds": "1,5",
        "colors": ["green", "yellow", "red"]
      }
    ]
  }
}

Kafka 내장 도구

kafka-consumer-groups.sh

# Consumer Group 목록
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
 
# Consumer Group 상세 (Lag 포함)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --describe --group my-consumer-group
 
# 출력:
# GROUP           TOPIC      PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# my-group        my-topic   0          1000            1500            500
# my-group        my-topic   1          2000            2100            100
 
# 모든 그룹 상태
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --describe --all-groups --state

kafka-log-dirs.sh

# 로그 디렉토리 정보
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
    --describe --broker-list 0,1,2
 
# 특정 토픽
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
    --describe --topic-list my-topic

kafka-broker-api-versions.sh

# Broker API 버전 확인
kafka-broker-api-versions.sh --bootstrap-server localhost:9092

서드파티 도구

Kafka Manager (CMAK)

# docker-compose.yml
version: '3'
services:
  kafka-manager:
    image: ghcr.io/eshepelyuk/dckr/kafka-manager
    ports:
      - "9000:9000"
    environment:
      ZK_HOSTS: "zookeeper:2181"
      APPLICATION_SECRET: "random-secret"

Kafdrop

# docker-compose.yml
version: '3'
services:
  kafdrop:
    image: obsidiandynamics/kafdrop
    ports:
      - "9000:9000"
    environment:
      KAFKA_BROKERCONNECT: "kafka:9092"
      JVM_OPTS: "-Xms32M -Xmx64M"

Burrow (Consumer Lag 모니터링)

# burrow.toml
[general]
pidfile = "/var/run/burrow.pid"
stdout-logfile = "/var/log/burrow/burrow.log"
 
[logging]
level = "info"
 
[zookeeper]
servers = ["zk1:2181", "zk2:2181", "zk3:2181"]
 
[kafka "local"]
brokers = ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
 
[consumer "local"]
class = "kafka"
cluster = "local"
servers = ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
group-whitelist = ".*"
 
[httpserver "default"]
address = ":8000"

Conduktor

상용 GUI 도구 기능:
- 클러스터 관리
- 토픽/Consumer 모니터링
- 메시지 브라우징
- Schema Registry 연동
- 보안 관리

Best Practices

1. 계층적 모니터링

Level 1: 인프라 (CPU, Memory, Disk, Network)
    ↓
Level 2: Kafka 핵심 (Broker 상태, 파티션 상태)
    ↓
Level 3: 처리량 (Messages/sec, Bytes/sec)
    ↓
Level 4: 지연 (Request latency, Consumer Lag)
    ↓
Level 5: 애플리케이션 (비즈니스 메트릭)

2. 알림 우선순위

P1 (즉시):
  - OfflinePartitions > 0
  - ActiveController != 1
  - Broker Down
 
P2 (5분 이내):
  - UnderReplicatedPartitions > 0
  - ConsumerLag > critical_threshold
 
P3 (30분 이내):
  - ConsumerLag > warning_threshold
  - HighRequestLatency
 
P4 (업무 시간):
  - FrequentRebalance
  - LowDiskSpace

3. 대시보드 구성

Overview Dashboard:
  - 클러스터 상태 (싱글스탯)
  - 처리량 그래프
  - 에러율 그래프
  - Top 10 토픽

Detail Dashboard:
  - Broker별 메트릭
  - Topic별 메트릭
  - Consumer Group별 메트릭

Troubleshooting Dashboard:
  - Request 지연 분포
  - GC 메트릭
  - 네트워크 메트릭

관련 문서