Leader Election

Leader Election은 Kafka 파티션의 리더를 선출하는 프로세스입니다.

Leader Election 개요

기본 개념

┌─────────────────────────────────────────────────────────────────┐
│                    Leader Election                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Before (Leader 장애):                                           │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐           │
│  │   Broker 1  │   │   Broker 2  │   │   Broker 3  │           │
│  │ Leader ✗    │   │  Follower   │   │  Follower   │           │
│  │   (DOWN)    │   │   (ISR)     │   │   (ISR)     │           │
│  └─────────────┘   └─────────────┘   └─────────────┘           │
│                                                                 │
│  After (Leader Election):                                       │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐           │
│  │   Broker 1  │   │   Broker 2  │   │   Broker 3  │           │
│  │   (DOWN)    │   │ New Leader  │   │  Follower   │           │
│  │             │   │     ✓       │   │   (ISR)     │           │
│  └─────────────┘   └─────────────┘   └─────────────┘           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Election 트리거

트리거	설명
Broker 장애	Leader Broker 다운
Preferred Leader	선호 리더로 재배치
수동 선출	관리자 명령
Controller 재시작	Controller 페일오버

Election 유형

1. Clean Election (ISR 내)

ISR = [Broker1(L), Broker2, Broker3]
           ↓ (장애)

Controller가 ISR 중 하나 선택:
ISR = [Broker2(L), Broker3]

특징:
✓ 데이터 손실 없음
✓ 기본 동작

2. Unclean Election (ISR 외)

ISR = [Broker1(L)]  (모든 Follower가 OSR)
           ↓ (장애)

unclean.leader.election.enable = true:
→ OSR 중 하나가 Leader 됨
→ 데이터 손실 가능!

unclean.leader.election.enable = false:
→ 파티션 사용 불가 (Offline)
→ 데이터 손실 없음

설정

# 기본값: false (권장)
unclean.leader.election.enable = false
 
# Topic별 설정
kafka-configs.sh --alter \
    --entity-type topics \
    --entity-name my-topic \
    --bootstrap-server localhost:9092 \
    --add-config unclean.leader.election.enable=false

Controller의 역할

Controller란

┌─────────────────────────────────────────────────────────────────┐
│                       Kafka Cluster                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                  Controller (Broker 1)                   │   │
│  │                                                          │   │
│  │  - 파티션 리더 선출                                       │   │
│  │  - 클러스터 메타데이터 관리                                │   │
│  │  - Broker 멤버십 관리                                     │   │
│  │  - 파티션 재할당 조정                                      │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                           │                                     │
│                           ▼                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │   Broker 2   │  │   Broker 3   │  │   Broker 4   │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Controller Election

Zookeeper 모드:
- /controller ephemeral 노드 선점
- 가장 먼저 노드 생성한 Broker가 Controller

KRaft 모드:
- Raft 프로토콜로 Controller 선출
- Controller Quorum 구성

Leader Election 과정

1. Controller가 Broker 장애 감지
     ↓
2. 영향받는 파티션 식별
     ↓
3. 각 파티션의 ISR 확인
     ↓
4. ISR 중 새 Leader 선택
     ↓
5. Metadata 업데이트 (Zookeeper/KRaft)
     ↓
6. 모든 Broker에 LeaderAndIsr 요청 전송
     ↓
7. Broker들이 새 Leader 정보 반영

Preferred Leader

개념

# 토픽 생성 시 첫 번째 Replica가 Preferred Leader
Topic: my-topic, Partition: 0
Replicas: [1, 2, 3]
          ↑
    Preferred Leader = Broker 1

불균형 발생

초기 상태 (균형):
Partition 0: Leader=1, Replicas=[1,2,3]
Partition 1: Leader=2, Replicas=[2,3,1]
Partition 2: Leader=3, Replicas=[3,1,2]

Broker 1 장애 후 복구:
Partition 0: Leader=2, Replicas=[1,2,3]  ← 불균형
Partition 1: Leader=2, Replicas=[2,3,1]
Partition 2: Leader=3, Replicas=[3,1,2]

Broker 2에 부하 집중!

자동 리밸런싱

# 자동 Preferred Leader Election
auto.leader.rebalance.enable = true  # 기본값
 
# 불균형 검사 주기
leader.imbalance.check.interval.seconds = 300
 
# 불균형 허용 비율
leader.imbalance.per.broker.percentage = 10

수동 리밸런싱

# 특정 토픽의 Preferred Leader Election
kafka-leader-election.sh \
    --bootstrap-server localhost:9092 \
    --election-type preferred \
    --topic my-topic \
    --partition 0
 
# 전체 클러스터 리밸런싱
kafka-leader-election.sh \
    --bootstrap-server localhost:9092 \
    --election-type preferred \
    --all-topic-partitions

Election 유형 비교

election-type 옵션

# Preferred Election: ISR 내 Preferred Leader로 선출
kafka-leader-election.sh \
    --election-type preferred \
    ...
 
# Unclean Election: OSR 포함하여 강제 선출
kafka-leader-election.sh \
    --election-type unclean \
    ...

사용 시나리오

유형	사용 시나리오	주의사항
Preferred	리밸런싱, 정상 운영	안전함
Unclean	긴급 복구, 가용성 우선	데이터 손실 가능

모니터링

JMX 메트릭

Active Controller:
kafka.controller:type=KafkaController,name=ActiveControllerCount
→ 클러스터에서 1이어야 함

Leader Election Rate:
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

Unclean Leader Election:
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
→ 0이어야 함

로그 확인

# Controller 로그에서 Election 확인
grep -i "leader" /var/log/kafka/controller.log
 
# 특정 파티션 Election 확인
grep "topic=my-topic, partition=0" /var/log/kafka/controller.log

장애 시나리오

시나리오 1: 단일 Broker 장애

Before:
Partition 0: Leader=1, ISR=[1,2,3]

Broker 1 장애:
Partition 0: Leader=2, ISR=[2,3]

→ Clean Election, 데이터 손실 없음

시나리오 2: 다수 Broker 장애

Before:
Partition 0: Leader=1, ISR=[1,2,3]

Broker 1, 2 동시 장애:
Partition 0: Leader=3, ISR=[3]

→ Clean Election, 데이터 손실 없음
   단, min.insync.replicas=2면 쓰기 불가

시나리오 3: 전체 ISR 장애

Before:
Partition 0: Leader=1, ISR=[1]

Broker 1 장애:

unclean.leader.election.enable=false:
→ Partition Offline

unclean.leader.election.enable=true:
→ OSR에서 Leader 선출
→ 데이터 손실 가능

Best Practices

1. Unclean Election 비활성화

# 데이터 무결성 우선
unclean.leader.election.enable = false

2. 충분한 ISR 유지

# RF=3, min.insync.replicas=2 권장
default.replication.factor = 3
min.insync.replicas = 2

3. 자동 리밸런싱 활성화

auto.leader.rebalance.enable = true
leader.imbalance.check.interval.seconds = 300

4. 모니터링 설정

# Unclean Election 알림
- alert: KafkaUncleanLeaderElection
  expr: rate(kafka_controller_ControllerStats_UncleanLeaderElectionsPerSec[5m]) > 0
  labels:
    severity: critical

Yong's Park

Recent

모니터링 도구

성능 튜닝 가이드

장애 대응

Topic

Leader Election

Leader Election

Leader Election 개요

기본 개념

Election 트리거

Election 유형

1. Clean Election (ISR 내)

2. Unclean Election (ISR 외)

설정

Controller의 역할

Controller란

Controller Election

Leader Election 과정

Preferred Leader

개념

불균형 발생

자동 리밸런싱

수동 리밸런싱

Election 유형 비교

election-type 옵션

사용 시나리오

모니터링

JMX 메트릭

로그 확인

장애 시나리오

시나리오 1: 단일 Broker 장애

시나리오 2: 다수 Broker 장애

시나리오 3: 전체 ISR 장애

Best Practices

1. Unclean Election 비활성화

2. 충분한 ISR 유지

3. 자동 리밸런싱 활성화

4. 모니터링 설정

관련 문서

댓글 (0)

목차

백링크