add: envoy redis
This commit is contained in:
@@ -0,0 +1,95 @@
|
||||
# Envoy Gateway Configuration
|
||||
|
||||
This document explains how the Envoy gateway is configured and how to modify it.
|
||||
|
||||
## Files
|
||||
|
||||
- envoy.yaml: ConfigMap + Deployment + Service for Envoy
|
||||
|
||||
## Current Behavior
|
||||
|
||||
- Envoy listens on port 8080 in the Pod and exposes port 80 via a ClusterIP Service.
|
||||
- All HTTP traffic is routed to user-api only.
|
||||
- gRPC is not exposed by this gateway.
|
||||
|
||||
## Routing
|
||||
|
||||
In envoy.yaml, routes are defined under:
|
||||
|
||||
static_resources -> listeners -> http_connection_manager -> route_config -> virtual_hosts
|
||||
|
||||
The current routing rules are:
|
||||
|
||||
- All requests (prefix: "/") -> cluster: user-api
|
||||
|
||||
To add a new HTTP service, add a new route above the default route and define a new cluster.
|
||||
|
||||
Example: route /order to order-api-svc:8899
|
||||
|
||||
1) Add a route match:
|
||||
|
||||
- match:
|
||||
prefix: "/order"
|
||||
route:
|
||||
cluster: order-api
|
||||
|
||||
2) Add a cluster:
|
||||
|
||||
- name: order-api
|
||||
connect_timeout: 2s
|
||||
type: STRICT_DNS
|
||||
lb_policy: ROUND_ROBIN
|
||||
load_assignment:
|
||||
cluster_name: order-api
|
||||
endpoints:
|
||||
- lb_endpoints:
|
||||
- endpoint:
|
||||
address:
|
||||
socket_address:
|
||||
address: order-api-svc.juwan.svc.cluster.local
|
||||
port_value: 8899
|
||||
|
||||
## CSRF Protection
|
||||
|
||||
Envoy uses a Lua filter for CSRF validation:
|
||||
|
||||
- Safe methods (GET/HEAD/OPTIONS):
|
||||
- If csrf_token cookie is missing, Envoy generates one and sets it in the response.
|
||||
- Unsafe methods (POST/PUT/PATCH/DELETE, etc):
|
||||
- Requires BOTH:
|
||||
- header: X-CSRF-Token
|
||||
- cookie: csrf_token
|
||||
- Values must match, otherwise Envoy returns 403.
|
||||
|
||||
If you want a different cookie name or header name, update these in the Lua code:
|
||||
|
||||
- Header: x-csrf-token
|
||||
- Cookie: csrf_token
|
||||
|
||||
To relax or tighten rules, edit the functions:
|
||||
|
||||
- is_safe(method)
|
||||
- envoy_on_request(request_handle)
|
||||
|
||||
## Cookie Attributes
|
||||
|
||||
Current Set-Cookie:
|
||||
|
||||
csrf_token=<value>; Path=/; SameSite=Strict
|
||||
|
||||
To add Secure or HttpOnly, update the string in envoy_on_response.
|
||||
|
||||
## Deployment
|
||||
|
||||
Apply or update:
|
||||
|
||||
kubectl apply -f deploy/k8s/envoy/envoy.yaml
|
||||
|
||||
## Common Changes
|
||||
|
||||
- Change listening port:
|
||||
- Update listener port_value and Service targetPort/port.
|
||||
- Change service namespace:
|
||||
- Update cluster DNS addresses (e.g. service.ns.svc.cluster.local).
|
||||
- Add more services:
|
||||
- Add route + add cluster, as shown above.
|
||||
@@ -0,0 +1,385 @@
|
||||
# Kubernetes 部署问题排查与解决记录
|
||||
|
||||
**日期**: 2026年2月23日
|
||||
**问题**: user-rpc 和 Redis 部署失败
|
||||
**状态**: 已诊断,解决中
|
||||
|
||||
---
|
||||
|
||||
## 📋 问题描述
|
||||
|
||||
执行 `kubectl apply -f test.yaml` 后,资源虽然创建成功,但实际的应用 pods 并未正常运行:
|
||||
|
||||
```
|
||||
kubectl apply -f ..\test.yaml
|
||||
✓ deployment.apps/user-rpc created
|
||||
✓ service/user-rpc-svc created
|
||||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-c created
|
||||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-m created
|
||||
✓ redisreplication.redis.redis.opstreelabs.in/user-redis created
|
||||
✓ redissentinel.redis.redis.opstreelabs.in/user-redis-sentinel created
|
||||
✓ cluster.postgresql.cnpg.io/user-db created
|
||||
```
|
||||
|
||||
但执行 `kubectl get all` 后,发现:
|
||||
- ❌ **user-rpc pods 未创建**(Deployment 0/3 replicas ready)
|
||||
- ❌ **Redis pods 未创建**(RedisReplication 资源存在但无 pods)
|
||||
- ✅ user-db pods 正常运行(3/3)
|
||||
|
||||
---
|
||||
|
||||
## 🔍 排查过程
|
||||
|
||||
### 第一步:检查 Deployment 状态
|
||||
|
||||
```bash
|
||||
kubectl describe deployment user-rpc
|
||||
```
|
||||
|
||||
**发现**:
|
||||
```
|
||||
Conditions:
|
||||
Type Status Reason
|
||||
---- ------ ------
|
||||
Progressing True NewReplicaSetCreated
|
||||
Available False MinimumReplicasUnavailable
|
||||
ReplicaFailure True FailedCreate
|
||||
```
|
||||
|
||||
### 第二步:检查 ReplicaSet 详情
|
||||
|
||||
```bash
|
||||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||||
```
|
||||
|
||||
**发现关键错误**:
|
||||
```
|
||||
Events:
|
||||
Type Reason Age From Message
|
||||
---- ------ ---- ---- -------
|
||||
Warning FailedCreate 3m53s replicaset-controller Error creating:
|
||||
pods "user-rpc-6bf77fbcd9-" is forbidden: error looking up service
|
||||
account default/find-endpoints: serviceaccount "find-endpoints" not found
|
||||
```
|
||||
|
||||
**问题 #1 诊断完成**:❌ **缺失 ServiceAccount "find-endpoints"**
|
||||
|
||||
### 第三步:检查现有 ServiceAccounts
|
||||
|
||||
```bash
|
||||
kubectl get serviceaccount
|
||||
```
|
||||
|
||||
**结果**:
|
||||
```
|
||||
NAME AGE
|
||||
cluster-example 4d10h
|
||||
default 13d
|
||||
redis-operator 9h
|
||||
user-db 4m9s
|
||||
```
|
||||
|
||||
确认 `find-endpoints` 不存在。
|
||||
|
||||
### 第四步:检查 Secrets
|
||||
|
||||
```bash
|
||||
kubectl get secrets
|
||||
```
|
||||
|
||||
**结果**:默认 secrets 都存在,包括:
|
||||
- ✅ user-db-app
|
||||
- ✅ user-redis
|
||||
- ✅ user-db-ca, user-db-replication, user-db-server
|
||||
|
||||
### 第五步:检查 Redis 部署
|
||||
|
||||
```bash
|
||||
kubectl get redisreplication
|
||||
kubectl get pods | grep redis
|
||||
```
|
||||
|
||||
**发现**:
|
||||
- ✅ RedisReplication 资源存在
|
||||
- ❌ Redis pods **完全没有被创建**
|
||||
|
||||
**问题 #2 诊断**:❌ **Redis Operator 未响应 RedisReplication 资源**
|
||||
|
||||
---
|
||||
|
||||
## 🔧 第一次修复尝试
|
||||
|
||||
### 创建缺失的 ServiceAccount
|
||||
|
||||
```bash
|
||||
kubectl create serviceaccount find-endpoints
|
||||
```
|
||||
|
||||
**结果**:✅ ServiceAccount 创建成功
|
||||
|
||||
### 重启 Deployment
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment user-rpc
|
||||
```
|
||||
|
||||
**等待 5-10 秒后重新检查**:
|
||||
|
||||
```bash
|
||||
kubectl get pods -o wide
|
||||
```
|
||||
|
||||
**新的发现**:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
user-rpc-66f97fbdcc-ws7rc 0/1 ErrImagePull 0 26s
|
||||
user-rpc-6bf77fbcd9-njm2z 0/1 ErrImagePull 0 29s
|
||||
user-rpc-6bf77fbcd9-nwjtw 0/1 ImagePullBackOff 0 29s
|
||||
user-rpc-6bf77fbcd9-wjrf8 0/1 ErrImagePull 0 29s
|
||||
```
|
||||
|
||||
✅ **好消息**:Pods 现在被创建了!(说明 ServiceAccount 问题已解决)
|
||||
❌ **新问题**:镜像拉取失败
|
||||
|
||||
---
|
||||
|
||||
## 🐛 根因分析
|
||||
|
||||
### 问题 #1:缺失 ServiceAccount ✅ 已解决
|
||||
|
||||
**根本原因**:test.yaml 的 Deployment manifest 指定了:
|
||||
```yaml
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
serviceAccountName: find-endpoints # 这个 ServiceAccount 不存在
|
||||
```
|
||||
|
||||
但没有在 test.yaml 中创建 ServiceAccount 资源。
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
kubectl create serviceaccount find-endpoints
|
||||
```
|
||||
|
||||
或在 test.yaml 中添加:
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: find-endpoints
|
||||
namespace: default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 问题 #2:镜像拉取失败 ❌ 需要修复
|
||||
|
||||
```bash
|
||||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||||
```
|
||||
|
||||
**详细错误日志**:
|
||||
|
||||
```
|
||||
Events:
|
||||
Warning Failed 38s kubelet Failed to pull image
|
||||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||||
Error response from daemon: Get "https://103.236.53.208:4418/v2/":
|
||||
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
|
||||
|
||||
Warning Failed 23s kubelet Failed to pull image
|
||||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||||
http: server gave HTTP response to HTTPS client
|
||||
```
|
||||
|
||||
**根本原因分析**:
|
||||
|
||||
1. **网络连接失败**:`context deadline exceeded` - 无法连接到镜像仓库
|
||||
2. **协议不匹配**:`http: server gave HTTP response to HTTPS client` -
|
||||
- 地址 `103.236.53.208:4418` 应该是 HTTP 而不是 HTTPS
|
||||
- Docker daemon 尝试用 HTTPS 连接,但服务器使用 HTTP
|
||||
|
||||
**可能原因**:
|
||||
- 镜像仓库地址错误或不可访问
|
||||
- 镜像仓库需要特定的网络配置
|
||||
- 仓库服务器离线或配置不当
|
||||
|
||||
---
|
||||
|
||||
### 问题 #3:Redis 部署失败 ❌ 需要诊断
|
||||
|
||||
**现象**:
|
||||
- RedisReplication 和 RedisSentinel CRD 资源创建成功
|
||||
- 但没有对应的 Redis pods 被创建
|
||||
- `kubectl get pods | grep redis` 返回空
|
||||
|
||||
**可能原因**:
|
||||
|
||||
1. **Redis Operator 未正常工作**
|
||||
- Operator pod 可能存在问题
|
||||
- Operator 未能监听到新的 RedisReplication 资源
|
||||
|
||||
2. **CRD 或 API 版本问题**
|
||||
- manifest 中使用的 API 版本 `v1beta2` 可能不匹配 Operator 版本
|
||||
|
||||
3. **资源限制或权限问题**
|
||||
- Operator 无权限创建 pods
|
||||
- 集群资源限制阻止了 pod 创建
|
||||
|
||||
---
|
||||
|
||||
## ✅ 已执行的修复
|
||||
|
||||
| # | 问题 | 修复方法 | 状态 |
|
||||
|---|------|---------|------|
|
||||
| 1 | 缺失 ServiceAccount | `kubectl create serviceaccount find-endpoints` | ✅ 完成 |
|
||||
| 2 | 镜像拉取失败 | 需要更新镜像地址或修复网络 | ⏳ 待处理 |
|
||||
| 3 | Redis pods 未创建 | 需要诊断 Operator 日志 | ⏳ 待诊断 |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 下一步解决方案
|
||||
|
||||
### 优先级 1:修复 user-rpc 镜像拉取
|
||||
|
||||
**选项 A:使用本地/内部镜像**
|
||||
```yaml
|
||||
# 修改 test.yaml 中的镜像地址
|
||||
image: localhost:5000/user-rpc:latest # 本地私有仓库
|
||||
# 或
|
||||
image: user-rpc:latest # 本地镜像(如果已通过 docker load 导入)
|
||||
```
|
||||
|
||||
**选项 B:修复仓库地址**
|
||||
```yaml
|
||||
# 如果 103.236.53.208:4418 确实是正确仓库
|
||||
image: http://103.236.53.208:4418/library/user-rpc:latest # 显式使用 HTTP
|
||||
```
|
||||
|
||||
**验证步骤**:
|
||||
```bash
|
||||
# 检查镜像仓库连接性
|
||||
curl -v http://103.236.53.208:4418/v2/
|
||||
```
|
||||
|
||||
### 优先级 2:诊断 Redis Operator
|
||||
|
||||
```bash
|
||||
# 查看 Operator 日志
|
||||
kubectl logs -l app.kubernetes.io/name=redis-operator -f
|
||||
|
||||
# 查看 Operator pod
|
||||
kubectl get pods -A | grep redis-operator
|
||||
|
||||
# 查看 RedisReplication 详细信息
|
||||
kubectl describe redisreplication user-redis
|
||||
|
||||
# 检查 Operator 权限(RBAC)
|
||||
kubectl get role,rolebinding,clusterrole,clusterrolebinding | grep redis
|
||||
```
|
||||
|
||||
### 优先级 3:增强 test.yaml
|
||||
|
||||
建议在 test.yaml 中添加缺失的资源定义:
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: find-endpoints
|
||||
namespace: default
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: registry-credentials
|
||||
namespace: default
|
||||
type: kubernetes.io/dockercfg
|
||||
data:
|
||||
.dockercfg: <base64-encoded-credentials> # 如果需要私有仓库认证
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 当前集群状态
|
||||
|
||||
### Pods 状态总结
|
||||
|
||||
| 应用 | 期望副本 | 实际运行 | 状态 |
|
||||
|------|---------|---------|------|
|
||||
| user-db | 3 | 3 | ✅ 正常 |
|
||||
| user-rpc | 3 | 0 | ❌ 镜像拉取失败 |
|
||||
| Redis | 3 | 0 | ❌ Operator 未创建 |
|
||||
| Sentinel | 3 | 0 | ❌ Operator 未创建 |
|
||||
|
||||
### Services 状态
|
||||
|
||||
```
|
||||
✅ kubernetes (内置)
|
||||
✅ user-rpc-svc:9001
|
||||
✅ user-db-r:5432 (只读副本)
|
||||
✅ user-db-ro:5432 (只读副本)
|
||||
✅ user-db-rw:5432 (读写主副本)
|
||||
```
|
||||
|
||||
### HPA 配置
|
||||
|
||||
```
|
||||
✅ user-rpc-hpa-c (CPU 目标: 80%) - 无法工作(pods 未运行)
|
||||
✅ user-rpc-hpa-m (Memory 目标: 80%) - 无法工作(pods 未运行)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 关键命令速查表
|
||||
|
||||
```bash
|
||||
# 查看 Deployment 状态
|
||||
kubectl describe deployment user-rpc
|
||||
|
||||
# 查看 ReplicaSet 错误事件
|
||||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||||
|
||||
# 查看 Pod 详细错误
|
||||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||||
|
||||
# 查看 Pod 日志
|
||||
kubectl logs user-rpc-6bf77fbcd9-njm2z
|
||||
|
||||
# 查看所有事件(按时间排序)
|
||||
kubectl get events --sort-by='.lastTimestamp'
|
||||
|
||||
# 查看特定命名空间的所有资源
|
||||
kubectl get all -n default
|
||||
|
||||
# 重新启动 deployment(强制重新创建 pods)
|
||||
kubectl rollout restart deployment user-rpc
|
||||
|
||||
# 查看 Operator 日志
|
||||
kubectl logs -l app.kubernetes.io/name=redis-operator
|
||||
|
||||
# 检查 CRD 注册状态
|
||||
kubectl api-resources | grep redis
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 总结
|
||||
|
||||
| 问题 | 原因 | 解决状态 |
|
||||
|------|------|---------|
|
||||
| **ServiceAccount 缺失** | manifest 中声明但未创建 | ✅ **已解决** |
|
||||
| **镜像拉取失败** | 仓库地址不可达或协议不匹配 | ⏳ **待处理** |
|
||||
| **Redis 未部署** | Operator 未响应 CRD | ⏳ **待诊断** |
|
||||
|
||||
**建议行动**:
|
||||
1. 确认/修复 user-rpc 镜像地址
|
||||
2. 诊断 Redis Operator 状态
|
||||
3. 验证所有依赖的 ServiceAccounts 和 Secrets 是否存在
|
||||
4. 考虑在 test.yaml 中添加完整的资源定义,避免手工创建
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,743 @@
|
||||
# Redis Kubernetes Service 详细解析
|
||||
|
||||
**问题:** 为什么 Redis 有 8 个 Service,但应用配置中只使用 `user-redis-sentinel-sentinel.juwan.svc.cluster.local:26379`?
|
||||
|
||||
**日期:** 2026年2月22日
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
|
||||
1. [Service 概览](#service-概览)
|
||||
2. [Kubernetes Service 基础](#kubernetes-service-基础)
|
||||
3. [8 个 Service 的详细说明](#8-个-service-的详细说明)
|
||||
4. [为什么使用哪个 Service](#为什么使用哪个-service)
|
||||
5. [Service 创建原理](#service-创建原理)
|
||||
6. [网络流量路由](#网络流量路由)
|
||||
7. [故障排查](#故障排查)
|
||||
|
||||
---
|
||||
|
||||
## 📊 Service 概览
|
||||
|
||||
### 当前 Redis 的 8 个 Service
|
||||
|
||||
```bash
|
||||
$ kubectl get svc -n juwan | grep redis
|
||||
|
||||
NAME TYPE CLUSTER-IP PORTS
|
||||
user-redis ClusterIP 10.103.91.84 6379/TCP,9121/TCP 33m
|
||||
user-redis-additional ClusterIP 10.107.228.48 6379/TCP 33m
|
||||
user-redis-headless ClusterIP None 6379/TCP 33m
|
||||
user-redis-master ClusterIP 10.97.120.76 6379/TCP 33m
|
||||
user-redis-replica ClusterIP 10.100.213.103 6379/TCP 33m
|
||||
user-redis-sentinel-sentinel ClusterIP 10.105.28.231 26379/TCP 32m
|
||||
user-redis-sentinel-sentinel-additional ClusterIP 10.97.111.42 26379/TCP 32m
|
||||
user-redis-sentinel-sentinel-headless ClusterIP None 26379/TCP 32m
|
||||
```
|
||||
|
||||
### 按功能分类
|
||||
|
||||
| 分类 | Service 名称 | 作用 |
|
||||
|-----|-------------|------|
|
||||
| **Redis 数据层** | user-redis | 通用入口 |
|
||||
| | user-redis-additional | 备用入口 |
|
||||
| | user-redis-master | 主节点专用 |
|
||||
| | user-redis-replica | 从节点专用 |
|
||||
| | user-redis-headless | Pod 间通信 |
|
||||
| **Sentinel 监控层** | user-redis-sentinel-sentinel | Sentinel 入口 ⭐ |
|
||||
| | user-redis-sentinel-sentinel-additional | 备用入口 |
|
||||
| | user-redis-sentinel-sentinel-headless | Sentinel 间通信 |
|
||||
|
||||
---
|
||||
|
||||
## 🔷 Kubernetes Service 基础
|
||||
|
||||
### Service 的作用
|
||||
|
||||
**Kubernetes 中的 Service 是什么?**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Kubernetes Cluster │
|
||||
│ │
|
||||
│ Service (虚拟 IP + DNS) │
|
||||
│ ↓ │
|
||||
│ Endpoints (实际 Pod IP 列表) │
|
||||
│ ├─ 10.244.0.10:6379 (Pod 1) │
|
||||
│ ├─ 10.244.1.20:6379 (Pod 2) │
|
||||
│ └─ 10.244.2.30:6379 (Pod 3) │
|
||||
│ │
|
||||
│ 客户端 ──→ Service IP (稳定) ──→ Pod IP (变化) │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Service 的三种类型
|
||||
|
||||
| 类型 | CLUSTER-IP | 用途 | 示例 |
|
||||
|-----|-----------|------|------|
|
||||
| **ClusterIP** | ✅ 有 | 集群内访问 | 10.103.91.84 |
|
||||
| **ClusterIP<br/>(Headless)** | ❌ None | Pod 间直接通信 | None |
|
||||
| **NodePort** | ✅ 有 | 集群外访问 | 10.103.91.84 |
|
||||
|
||||
---
|
||||
|
||||
## 🔍 8 个 Service 的详细说明
|
||||
|
||||
### 第一组:Redis 数据层 Service(端口 6379)
|
||||
|
||||
#### 1️⃣ user-redis(ClusterIP)
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis
|
||||
类型: ClusterIP (有负载均衡)
|
||||
Cluster IP: 10.103.91.84
|
||||
端口: 6379/TCP, 9121/TCP
|
||||
DNS: user-redis.juwan.svc.cluster.local
|
||||
```
|
||||
|
||||
**Endpoints 信息:**
|
||||
```bash
|
||||
$ kubectl get endpoints user-redis -n juwan
|
||||
|
||||
NAME ENDPOINTS
|
||||
user-redis 10.244.0.10:6379,10.244.1.20:6379,10.244.2.30:6379
|
||||
```
|
||||
|
||||
**负载均衡机制:**
|
||||
```
|
||||
客户端请求 ──→ Service IP (10.103.91.84)
|
||||
↓
|
||||
kube-proxy (iptables/ipvs)
|
||||
↓
|
||||
随机选择一个 Pod
|
||||
├─ 10.244.0.10 (redis-0)
|
||||
├─ 10.244.1.20 (redis-1) ← 可能
|
||||
└─ 10.244.2.30 (redis-2)
|
||||
```
|
||||
|
||||
**特点:**
|
||||
- ✅ 对所有 Pod 轮询负载均衡
|
||||
- ✅ 包含 Redis 数据服务(6379)和 Exporter(9121)
|
||||
- ⚠️ 可能把写请求轮询到从节点导致失败
|
||||
|
||||
**适用场景:**
|
||||
- 监控抓取(Prometheus 从 9121 端口抓指标)
|
||||
- 不关心读写分离的简单查询
|
||||
|
||||
**为什么有 2 个端口?**
|
||||
```
|
||||
6379: Redis 数据服务
|
||||
9121: Prometheus Exporter 监控端口
|
||||
└─ 暴露 Redis 性能指标给 Prometheus
|
||||
(redis_up, redis_memory_used, etc.)
|
||||
```
|
||||
|
||||
**不用这个的原因:**
|
||||
```
|
||||
❌ 如果直接使用 user-redis 进行读写:
|
||||
├─ 写请求可能被路由到从节点 (error)
|
||||
├─ 无法进行故障自动转移
|
||||
└─ 依赖于手动更新配置
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 2️⃣ user-redis-additional(ClusterIP)
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis-additional
|
||||
类型: ClusterIP (有负载均衡)
|
||||
Cluster IP: 10.107.228.48
|
||||
端口: 6379/TCP
|
||||
Endpoints: 同 user-redis
|
||||
```
|
||||
|
||||
**作用:**
|
||||
- 功能完全同 `user-redis`
|
||||
- 提供额外的访问入口
|
||||
- 用于多租户/网络隔离场景
|
||||
|
||||
**为什么有这个?**
|
||||
```
|
||||
场景:某些网络策略可能只允许访问特定 Service
|
||||
└─ 额外的 Service 提供备用入口
|
||||
```
|
||||
|
||||
**不常用的原因:**
|
||||
- 大多数场景用 `user-redis` 就足够
|
||||
- `user-redis-additional` 是备用
|
||||
|
||||
---
|
||||
|
||||
#### 3️⃣ user-redis-headless(ClusterIP: None)
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis-headless
|
||||
类型: ClusterIP (Headless Service)
|
||||
Cluster IP: None ← 关键:无虚拟 IP
|
||||
端口: 6379/TCP
|
||||
DNS: user-redis-headless.juwan.svc.cluster.local
|
||||
```
|
||||
|
||||
**特殊之处:无虚拟 IP**
|
||||
|
||||
```bash
|
||||
# 正常 Service 查询返回虚拟 IP
|
||||
$ nslookup user-redis.juwan.svc.cluster.local
|
||||
Name: user-redis.juwan.svc.cluster.local
|
||||
Address: 10.103.91.84 ← 虚拟 IP
|
||||
|
||||
# Headless Service 查询返回所有 Pod IP
|
||||
$ nslookup user-redis-headless.juwan.svc.cluster.local
|
||||
Name: user-redis-headless.juwan.svc.cluster.local
|
||||
Address: 10.244.0.10 ← Pod 1 实际 IP
|
||||
Address: 10.244.1.20 ← Pod 2 实际 IP
|
||||
Address: 10.244.2.30 ← Pod 3 实际 IP
|
||||
```
|
||||
|
||||
**使用场景:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ StatefulSet (Redis Cluster/Replication) │
|
||||
│ │
|
||||
│ redis-0 (主) redis-1 (从) redis-2 (从) │
|
||||
│ ↓ ↓ ↓ │
|
||||
│ 10.244.0.10 10.244.1.20 10.244.2.30 │
|
||||
│ ↑ │
|
||||
│ 需要直接连接到特定 Pod: │
|
||||
│ redis-0.user-redis-headless (连接主节点) │
|
||||
│ redis-1.user-redis-headless (连接从节点) │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**谁在使用?**
|
||||
- Redis 主从复制:从节点需要连接到已知的主节点
|
||||
- Sentinel 监控:需要直接访问特定 Redis 实例
|
||||
- Redis Operator 内部使用
|
||||
|
||||
**为什么应用不用这个?**
|
||||
```
|
||||
❌ Pod DNS 只能在 Pod 内使用
|
||||
└─ 外部应用不知道 Pod 的具体 DNS 名称
|
||||
|
||||
✅ 用虚拟 Service IP 的优势
|
||||
└─ 无需关心底层 Pod 变化
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 4️⃣ user-redis-master(ClusterIP)
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis-master
|
||||
类型: ClusterIP
|
||||
Cluster IP: 10.97.120.76
|
||||
端口: 6379/TCP
|
||||
Endpoints: 10.244.0.10:6379 (只有 1 个 Pod)
|
||||
DNS: user-redis-master.juwan.svc.cluster.local
|
||||
```
|
||||
|
||||
**特点:只指向主节点**
|
||||
|
||||
```bash
|
||||
$ kubectl get endpoints user-redis-master -n juwan
|
||||
|
||||
NAME ENDPOINTS
|
||||
user-redis-master 10.244.0.10:6379 ← 仅主节点
|
||||
```
|
||||
|
||||
**对比所有 Endpoints:**
|
||||
```
|
||||
user-redis-master: 10.244.0.10 (主)
|
||||
user-redis-replica: 10.244.1.20, 10.244.2.30 (从)
|
||||
user-redis: 所有 Pod
|
||||
```
|
||||
|
||||
**为什么分开?**
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Redis 主从架构 │
|
||||
│ │
|
||||
│ Redis Master (10.244.0.10) │
|
||||
│ ├─ 处理所有写操作 │
|
||||
│ └─ 赋值数据给 Slave │
|
||||
│ │
|
||||
│ Redis Slave 1 (10.244.1.20) │
|
||||
│ └─ 处理只读操作 │
|
||||
│ │
|
||||
│ Redis Slave 2 (10.244.2.30) │
|
||||
│ └─ 处理只读操作 │
|
||||
└─────────────────────────────────────────┘
|
||||
|
||||
请求分类:
|
||||
┌───────────────────────┐
|
||||
│ SET key value │ ──→ user-redis-master (10.97.120.76)
|
||||
│ HSET user:1 name john │
|
||||
└───────────────────────┘
|
||||
|
||||
┌───────────────────────┐
|
||||
│ GET key │ ──→ user-redis-replica (10.100.213.103)
|
||||
│ HGET user:1 name │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
**适用场景:**
|
||||
- ✅ 读写分离架构
|
||||
- ✅ 优化读性能(从节点处理读)
|
||||
- ✅ 减轻主节点负担
|
||||
|
||||
**为什么应用通常不直接用?**
|
||||
```
|
||||
❌ 需要在应用层面区分读写操作
|
||||
├─ 写操作 → user-redis-master
|
||||
├─ 只读操作 → user-redis-replica
|
||||
└─ 代码复杂度高
|
||||
|
||||
✅ Sentinel 模式自动处理
|
||||
└─ 应用无需关心主从区别
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 5️⃣ user-redis-replica(ClusterIP)
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis-replica
|
||||
类型: ClusterIP
|
||||
Cluster IP: 10.100.213.103
|
||||
端口: 6379/TCP
|
||||
Endpoints: 10.244.1.20:6379, 10.244.2.30:6379 (两个从节点)
|
||||
DNS: user-redis-replica.juwan.svc.cluster.local
|
||||
```
|
||||
|
||||
**特点:只指向从节点,支持负载均衡**
|
||||
|
||||
```bash
|
||||
$ kubectl get endpoints user-redis-replica -n juwan
|
||||
|
||||
NAME ENDPOINTS
|
||||
user-redis-replica 10.244.1.20:6379, 10.244.2.30:6379
|
||||
```
|
||||
|
||||
**读流量分散:**
|
||||
```
|
||||
应用发送 GET 请求
|
||||
↓
|
||||
user-redis-replica (10.100.213.103)
|
||||
↓
|
||||
随机选择一个从节点
|
||||
├─ 10.244.1.20 (redis-1) ← 可能
|
||||
└─ 10.244.2.30 (redis-2) ← 可能
|
||||
```
|
||||
|
||||
**适用场景:**
|
||||
- 除了 Sentinel 模式外的读优化
|
||||
- 需要手动管理读写分离
|
||||
|
||||
---
|
||||
|
||||
### 第二组:Sentinel 监控层 Service(端口 26379)
|
||||
|
||||
#### 6️⃣ user-redis-sentinel-sentinel(ClusterIP)⭐⭐⭐
|
||||
|
||||
**基本信息:**
|
||||
```yaml
|
||||
名称: user-redis-sentinel-sentinel
|
||||
类型: ClusterIP
|
||||
Cluster IP: 10.105.28.231
|
||||
端口: 26379/TCP
|
||||
Endpoints: 10.244.0.50:26379, 10.244.1.70:26379, 10.244.2.90:26379
|
||||
(3 个 Sentinel 实例)
|
||||
DNS: user-redis-sentinel-sentinel.juwan.svc.cluster.local
|
||||
```
|
||||
|
||||
**为什么应用使用这个?**
|
||||
|
||||
```
|
||||
应用程序配置:
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ Redis: │
|
||||
│ Host: user-redis-sentinel-sentinel │
|
||||
│ Port: 26379 │
|
||||
│ Type: sentinel │
|
||||
│ MasterName: mymaster │
|
||||
└──────────────────────────────────────────────┘
|
||||
|
||||
连接流程:
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ 应用程序 │
|
||||
└────────────────────┬────────────────────────┘
|
||||
│
|
||||
↓
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ user-redis-sentinel-sentinel (26379) │
|
||||
│ ├─ Sentinel 1: 10.244.0.50:26379 │
|
||||
│ ├─ Sentinel 2: 10.244.1.70:26379 │
|
||||
│ └─ Sentinel 3: 10.244.2.90:26379 │
|
||||
└────────────────────┬────────────────────────┘
|
||||
│
|
||||
应用询问: "mymaster 在哪?"
|
||||
↓
|
||||
Sentinel 回答: "在 10.244.0.10:6379"
|
||||
↓
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Redis Master: 10.244.0.10:6379 │
|
||||
│ (应用直接连接进行读写) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
故障转移过程:
|
||||
Master 故障 → Sentinel 检测 → 提升新主节点
|
||||
→ 应用下次查询时 → 获得新主节点 IP
|
||||
→ 自动连接新主节点
|
||||
```
|
||||
|
||||
**为什么这是最佳选择?**
|
||||
|
||||
1. **自动故障转移**
|
||||
```
|
||||
主节点宕机 (✗) → Sentinel 自动选举新主 → 应用自动连接
|
||||
```
|
||||
|
||||
2. **高可用**
|
||||
```
|
||||
Sentinel 集群(3 个) → 任意 1-2 个故障仍可用
|
||||
```
|
||||
|
||||
3. **应用无感知**
|
||||
```
|
||||
应用只需配置 MasterName: mymaster
|
||||
无需关心主从地址变化
|
||||
```
|
||||
|
||||
4. **标准做法**
|
||||
```
|
||||
✅ 业界公认的 Redis 高可用方案
|
||||
✅ 最小化应用改动
|
||||
✅ 自动化程度最高
|
||||
```
|
||||
|
||||
**为什么不用其他 Service?**
|
||||
|
||||
```
|
||||
❌ user-redis-master/user-redis-replica
|
||||
└─ 需要应用层区分读写,主从切换需要重启应用
|
||||
|
||||
❌ user-redis/user-redis-additional
|
||||
└─ 没有故障转移能力,故障时应用会报错
|
||||
|
||||
✅ user-redis-sentinel-sentinel
|
||||
└─ 自动发现新主节点,无需重启应用
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 7️⃣ user-redis-sentinel-sentinel-additional(ClusterIP)
|
||||
|
||||
**说明:** 功能同 `user-redis-sentinel-sentinel`,备用入口
|
||||
|
||||
---
|
||||
|
||||
#### 8️⃣ user-redis-sentinel-sentinel-headless(ClusterIP: None)
|
||||
|
||||
**说明:** 供 Sentinel 内部通信和选举使用
|
||||
|
||||
---
|
||||
|
||||
## 🎯 为什么使用哪个 Service
|
||||
|
||||
### 应用配置选择
|
||||
|
||||
#### ⭐⭐⭐ Sentinel 模式(生产推荐)
|
||||
|
||||
```yaml
|
||||
# 应用配置
|
||||
Redis:
|
||||
Host: user-redis-sentinel-sentinel.juwan.svc.cluster.local:26379
|
||||
Type: sentinel
|
||||
MasterName: mymaster
|
||||
Pass: ${REDIS_PASSWORD}
|
||||
```
|
||||
|
||||
**优势:**
|
||||
- ✅ 自动故障转移(RTO < 30 秒)
|
||||
- ✅ 应用无需重启
|
||||
- ✅ 自动发现新主节点
|
||||
- ✅ 生产标准做法
|
||||
|
||||
---
|
||||
|
||||
#### ⭐⭐ 主从分离模式(可选)
|
||||
|
||||
```yaml
|
||||
# 应用配置(需要两个 host)
|
||||
Redis:
|
||||
Master:
|
||||
Host: user-redis-master.juwan.svc.cluster.local:6379
|
||||
Slave:
|
||||
Host: user-redis-replica.juwan.svc.cluster.local:6379
|
||||
```
|
||||
|
||||
**适用场景:**
|
||||
- 读写分离显著
|
||||
- 对读性能有极高要求
|
||||
|
||||
**缺点:**
|
||||
- 主从故障需手动切换
|
||||
- 应用层复杂度高
|
||||
|
||||
---
|
||||
|
||||
#### ❌ 不推荐的做法
|
||||
|
||||
```yaml
|
||||
# ❌ 直接连接单个节点
|
||||
Redis:
|
||||
Host: user-redis-0.user-redis-headless.juwan.svc.cluster.local:6379
|
||||
# 问题:Pod 重启 IP 变化,需要更新配置
|
||||
|
||||
# ❌ 连接通用 Service(无故障转移)
|
||||
Redis:
|
||||
Host: user-redis.juwan.svc.cluster.local:6379
|
||||
# 问题:无法自动转移,故障时应用报错
|
||||
|
||||
# ❌ 硬编码 Pod IP
|
||||
Redis:
|
||||
Host: 10.244.0.10:6379
|
||||
# 问题:Pod 重启 IP 变化,应用立即不可用
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔌 Service 创建原理
|
||||
|
||||
### 为什么会自动创建这么多 Service?
|
||||
|
||||
**由 Redis Operator 自动创建:**
|
||||
|
||||
```go
|
||||
// Redis Operator 逻辑(伪代码)
|
||||
func CreateServicesForRedis(redis *RedisReplication) {
|
||||
// 数据层 Service
|
||||
CreateService("user-redis", AllRedisNodes)
|
||||
CreateService("user-redis-additional", AllRedisNodes)
|
||||
CreateService("user-redis-master", [MasterNode])
|
||||
CreateService("user-redis-replica", [SlaveNodes])
|
||||
CreateHeadlessService("user-redis-headless", AllRedisNodes)
|
||||
|
||||
// 监控层 Service
|
||||
CreateService("user-redis-sentinel-sentinel", AllSentinelNodes)
|
||||
CreateService("user-redis-sentinel-sentinel-additional", AllSentinelNodes)
|
||||
CreateHeadlessService("user-redis-sentinel-sentinel-headless", AllSentinelNodes)
|
||||
}
|
||||
```
|
||||
|
||||
**为什么这样设计?**
|
||||
|
||||
| Service | 原因 |
|
||||
|---------|------|
|
||||
| 多个 ClusterIP | 不同场景需要不同的 Endpoints 配置 |
|
||||
| 包含 additional | 网络隔离/多租户支持 |
|
||||
| 包含 headless | StatefulSet 需要 Pod 间直接通信 |
|
||||
|
||||
**类比:**
|
||||
```
|
||||
Redis Operator 就像一个完整的产品
|
||||
└─ 提供多种方式使用 Redis
|
||||
├─ 简单: user-redis
|
||||
├─ 高级: user-redis-master/replica
|
||||
├─ HA: user-redis-sentinel-sentinel
|
||||
└─ 内部: headless services
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🌐 网络流量路由
|
||||
|
||||
### 查询 Service 背后的 Pod
|
||||
|
||||
**查看 Service Endpoints:**
|
||||
|
||||
```bash
|
||||
# 查看 user-redis 关联的 Pod
|
||||
$ kubectl get endpoints user-redis -n juwan
|
||||
NAME ENDPOINTS
|
||||
user-redis 10.244.0.10:6379,10.244.1.20:6379,10.244.2.30:6379
|
||||
|
||||
# 查看 user-redis-master 关联的 Pod
|
||||
$ kubectl get endpoints user-redis-master -n juwan
|
||||
NAME ENDPOINTS
|
||||
user-redis-master 10.244.0.10:6379
|
||||
|
||||
# 查看 user-redis-replica 关联的 Pod
|
||||
$ kubectl get endpoints user-redis-replica -n juwan
|
||||
NAME ENDPOINTS
|
||||
user-redis-replica 10.244.1.20:6379,10.244.2.30:6379
|
||||
```
|
||||
|
||||
**Pod 和 Service 的映射关系:**
|
||||
|
||||
```
|
||||
Pods (实际运行的实例) Services (虚拟 IP)
|
||||
└─ redis-0 (主) └─ user-redis (所有)
|
||||
├─ 10.244.0.10 ├─ 10.103.91.84
|
||||
└─ :6379
|
||||
└─ user-redis-master (仅主)
|
||||
└─ redis-1 (从) ├─ 10.97.120.76
|
||||
├─ 10.244.1.20
|
||||
└─ :6379
|
||||
└─ user-redis-replica (仅从)
|
||||
└─ redis-2 (从) ├─ 10.100.213.103
|
||||
├─ 10.244.2.30
|
||||
└─ :6379
|
||||
```
|
||||
|
||||
**DNS 解析过程:**
|
||||
|
||||
```
|
||||
应用 DNS 查询
|
||||
└─ user-redis-master.juwan.svc.cluster.local
|
||||
↓
|
||||
CoreDNS (Kubernetes DNS)
|
||||
└─ 查询并返回 Service IP:
|
||||
├─ 10.97.120.76 (user-redis-master)
|
||||
├─ 或 10.100.213.103 (user-redis-replica)
|
||||
├─ 或 10.103.91.84 (user-redis)
|
||||
└─ 或 Sentinel 的 IP
|
||||
```
|
||||
|
||||
**Sentinel 模式的特殊之处:**
|
||||
|
||||
```
|
||||
应用查询 Sentinel
|
||||
└─ user-redis-sentinel-sentinel.juwan.svc.cluster.local:26379
|
||||
↓
|
||||
Sentinel Service (负载均衡到 3 个 Sentinel 节点)
|
||||
↓
|
||||
Sentinel 节点 (任选一个)
|
||||
↓
|
||||
应用询问: "mymaster 主节点 IP 是什么?"
|
||||
↓
|
||||
Sentinel 回答: "10.244.0.10:6379"
|
||||
↓
|
||||
应用直接连接 Redis Master: 10.244.0.10:6379
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 故障排查
|
||||
|
||||
### 问题 1:为什么应用连接失败?
|
||||
|
||||
**检查步骤:**
|
||||
|
||||
```bash
|
||||
# 1. 验证 Service 存在
|
||||
kubectl get svc user-redis-sentinel-sentinel -n juwan
|
||||
|
||||
# 2. 验证 Endpoints 不为空
|
||||
kubectl get endpoints user-redis-sentinel-sentinel -n juwan
|
||||
|
||||
# 3. 测试 DNS 解析
|
||||
kubectl run -it --rm nettest --image=busybox --restart=Never -n juwan -- \
|
||||
nslookup user-redis-sentinel-sentinel.juwan.svc.cluster.local
|
||||
|
||||
# 4. 测试连接性
|
||||
kubectl run -it --rm nettest --image=busybox --restart=Never -n juwan -- \
|
||||
nc -zv user-redis-sentinel-sentinel.juwan.svc.cluster.local 26379
|
||||
|
||||
# 5. 查看应用日志
|
||||
kubectl logs -f user-rpc-xxx -n juwan
|
||||
```
|
||||
|
||||
### 问题 2:为什么看不到某个 Service?
|
||||
|
||||
```bash
|
||||
# 确保在正确的命名空间
|
||||
kubectl get svc -n juwan | grep redis
|
||||
|
||||
# 如果 Redis Operator 有问题,Service 可能不会创建
|
||||
# 查看 Operator 日志
|
||||
kubectl logs -n default deployment/redis-operator
|
||||
```
|
||||
|
||||
### 问题 3:Service IP 经常变化?
|
||||
|
||||
```bash
|
||||
# Service IP 是稳定的(除非被删除和重建)
|
||||
# 如果频繁变化,说明 Service 被频繁重建
|
||||
|
||||
# 检查 Service 创建事件
|
||||
kubectl describe svc user-redis-sentinel-sentinel -n juwan
|
||||
|
||||
# 检查 Operator 是否有异常
|
||||
kubectl describe redissentinel user-redis-sentinel -n juwan
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 总结
|
||||
|
||||
### 快速理解
|
||||
|
||||
| Service | 用途 | 应用是否使用 |
|
||||
|---------|------|-----------|
|
||||
| **user-redis-sentinel-sentinel** | ⭐ Sentinel 高可用 | ✅ **生产推荐** |
|
||||
| user-redis-master | 直连主节点 | ⚠️ 需要读写分离 |
|
||||
| user-redis-replica | 直连从节点 | ⚠️ 需要读写分离 |
|
||||
| user-redis | 通用入口 | ❌ 不推荐(无 HA) |
|
||||
| headless services | 内部通信 | ❌ 应用不用 |
|
||||
|
||||
### 为什么有这么多 Service?
|
||||
|
||||
**答案:** 为了提供灵活的使用方式
|
||||
|
||||
```
|
||||
Redis Operator 的设计理念:
|
||||
┌─────────────────────────────────────────┐
|
||||
│ 提供完整的 Redis 高可用解决方案 │
|
||||
│ │
|
||||
│ ├─ 简单使用场景 │
|
||||
│ │ └─ user-redis (所有节点) │
|
||||
│ │ │
|
||||
│ ├─ 高级使用场景 │
|
||||
│ │ ├─ user-redis-master (写) │
|
||||
│ │ └─ user-redis-replica (读) │
|
||||
│ │ │
|
||||
│ ├─ 生产场景 (推荐) │
|
||||
│ │ └─ user-redis-sentinel-sentinel │
|
||||
│ │ │
|
||||
│ └─ 内部通信 │
|
||||
│ └─ headless services │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 应用该用哪个?
|
||||
|
||||
**一句话:使用 `user-redis-sentinel-sentinel:26379` + Sentinel 模式**
|
||||
|
||||
```yaml
|
||||
# 这是最佳实践
|
||||
Redis:
|
||||
Host: user-redis-sentinel-sentinel.juwan.svc.cluster.local:26379
|
||||
Type: sentinel
|
||||
MasterName: mymaster
|
||||
```
|
||||
|
||||
**为什么?**
|
||||
- ✅ 自动故障转移
|
||||
- ✅ 应用无需重启
|
||||
- ✅ 无需手工干预
|
||||
- ✅ 行业标准
|
||||
|
||||
---
|
||||
|
||||
**文档版本:** 1.0
|
||||
**创建日期:** 2026年2月22日
|
||||
**维护者:** DevOps Team
|
||||
@@ -0,0 +1,779 @@
|
||||
# Redis Sentinel 部署问题诊断与修复报告
|
||||
|
||||
**问题日期:** 2026年2月22日
|
||||
**命名空间:** juwan
|
||||
**涉及资源:** user-rpc deployment, RedisSentinel
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
|
||||
1. [问题背景](#问题背景)
|
||||
2. [问题现象](#问题现象)
|
||||
3. [诊断过程](#诊断过程)
|
||||
4. [根因分析](#根因分析)
|
||||
5. [解决方案](#解决方案)
|
||||
6. [修复步骤](#修复步骤)
|
||||
7. [验证结果](#验证结果)
|
||||
8. [后续建议](#后续建议)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 问题背景
|
||||
|
||||
### 部署目标
|
||||
部署一个简单的三节点 Redis Sentinel 哨兵集群作为缓存服务,供 user-rpc 服务使用。后续如有需要再扩展为分片集群。
|
||||
|
||||
### 初始配置
|
||||
在 `deploy/k8s/service/user/user-rpc.yaml` 中配置了:
|
||||
- user-rpc Deployment(3副本)
|
||||
- user-rpc Service
|
||||
- HPA(CPU和内存)
|
||||
- **RedisSentinel 资源**
|
||||
- PostgreSQL Cluster
|
||||
|
||||
---
|
||||
|
||||
## 🔴 问题现象
|
||||
|
||||
### 执行的操作
|
||||
```bash
|
||||
kubectl apply -f .\deploy\k8s\service\user\user-rpc.yaml
|
||||
```
|
||||
|
||||
### 输出结果
|
||||
```
|
||||
deployment.apps/user-rpc configured
|
||||
service/user-rpc-svc unchanged
|
||||
horizontalpodautoscaler.autoscaling/user-rpc-hpa-c unchanged
|
||||
horizontalpodautoscaler.autoscaling/user-rpc-hpa-m unchanged
|
||||
redissentinel.redis.redis.opstreelabs.in/user-redis unchanged
|
||||
cluster.postgresql.cnpg.io/user-db unchanged
|
||||
```
|
||||
|
||||
### 观察到的异常
|
||||
查看命名空间资源:
|
||||
```bash
|
||||
kubectl get all -n juwan
|
||||
```
|
||||
|
||||
**发现:**
|
||||
- ✅ user-api pods 正常运行
|
||||
- ✅ user-rpc pods 正常运行
|
||||
- ✅ PostgreSQL clusters 正常运行
|
||||
- ❌ **没有任何 Redis 相关的 Pod**
|
||||
- ❌ **没有 Redis Service**
|
||||
|
||||
---
|
||||
|
||||
## 🔍 诊断过程
|
||||
|
||||
### 步骤 1:检查 RedisSentinel 资源状态
|
||||
|
||||
**目的:** 确认 RedisSentinel 资源是否被成功创建
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get redissentinel user-redis -n juwan
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
NAME AGE
|
||||
user-redis 9m56s
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ RedisSentinel 资源已创建
|
||||
- ❌ 但没有创建任何 Pod
|
||||
- **结论:** Operator 没有按照 RedisSentinel 规格创建实际资源
|
||||
|
||||
---
|
||||
|
||||
### 步骤 2:查看 RedisSentinel 详细信息
|
||||
|
||||
**目的:** 检查资源的详细配置和事件
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl describe redissentinel user-redis -n juwan
|
||||
```
|
||||
|
||||
**关键输出:**
|
||||
```yaml
|
||||
API Version: redis.redis.opstreelabs.in/v1beta2
|
||||
Kind: RedisSentinel
|
||||
Metadata:
|
||||
Creation Timestamp: 2026-02-22T12:41:47Z
|
||||
Finalizers:
|
||||
redisSentinelFinalizer
|
||||
Generation: 2
|
||||
Spec:
|
||||
Cluster Size: 3
|
||||
Redis Sentinel Config:
|
||||
Redis Replication Name: user-redis # ⚠️ 关键配置
|
||||
Master Group Name: myMaster
|
||||
Redis Port: 6379
|
||||
Quorum: 2
|
||||
Events: <none> # ⚠️ 没有任何事件
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- 配置中引用了 `redisReplicationName: user-redis`
|
||||
- 没有任何 Events,说明 Operator 可能在等待某些前置条件
|
||||
- **需要进一步检查依赖资源**
|
||||
|
||||
---
|
||||
|
||||
### 步骤 3:检查 Secret 资源
|
||||
|
||||
**目的:** 确认 RedisSentinel 所需的密码 Secret 是否存在
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get secret user-redis -n juwan
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
NAME TYPE DATA AGE
|
||||
user-redis Opaque 1 5h48m
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ Secret 存在,排除认证配置问题
|
||||
|
||||
---
|
||||
|
||||
### 步骤 4:查找 Redis Pods
|
||||
|
||||
**目的:** 确认是否有 Redis Pod 使用不同的命名规则
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get pods -n juwan | Select-String redis
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
(空输出 - 没有找到任何 Redis Pod)
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ❌ 确认没有任何 Redis Pod 被创建
|
||||
|
||||
---
|
||||
|
||||
### 步骤 5:检查 StatefulSet
|
||||
|
||||
**目的:** Redis 通常使用 StatefulSet 部署,检查是否有相关资源
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get statefulset -n juwan
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
No resources found in juwan namespace.
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ❌ 没有 StatefulSet 被创建
|
||||
- **结论:** Operator 完全没有执行创建动作
|
||||
|
||||
---
|
||||
|
||||
### 步骤 6:检查 Redis Operator 状态
|
||||
|
||||
**目的:** 确认 Redis Operator 本身是否正常运行
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get pods -A | Select-String redis-operator
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
default redis-operator-7dcf4468c9-gfbcm 1/1 Running 0 4h50m
|
||||
ot-operators redis-operator-69bb645f5b-6kxkv 0/1 ErrImageNeverPull 0 4h48m
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ default namespace 中的 Operator 正常运行
|
||||
- ⚠️ ot-operators namespace 中的 Operator 镜像拉取失败(但不影响我们的场景)
|
||||
|
||||
---
|
||||
|
||||
### 步骤 7:查看 Operator 日志
|
||||
|
||||
**目的:** 从 Operator 日志中寻找线索
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl logs redis-operator-7dcf4468c9-gfbcm -n default --tail=50
|
||||
```
|
||||
|
||||
**关键输出:**
|
||||
```json
|
||||
{"level":"info","ts":"2026-02-22T08:01:56Z","msg":"Starting Controller","controller":"redissentinel"}
|
||||
{"level":"info","ts":"2026-02-22T08:01:56Z","msg":"Starting workers","controller":"redissentinel","worker count":1}
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ RedisSentinel Controller 已启动
|
||||
- ✅ 没有错误日志
|
||||
- ❌ 但也没有处理 user-redis 资源的日志
|
||||
- **推测:** Operator 在等待某个依赖资源
|
||||
|
||||
---
|
||||
|
||||
### 步骤 8:检查 RedisReplication 资源(关键发现)
|
||||
|
||||
**目的:** 根据 RedisSentinel 配置中的 `redisReplicationName: user-redis`,检查对应的 RedisReplication 是否存在
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get redisreplication -n juwan
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
No resources found in juwan namespace.
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ❌ **RedisReplication 资源不存在!**
|
||||
- 🔎 **这就是问题的根本原因**
|
||||
|
||||
---
|
||||
|
||||
## 💡 根因分析
|
||||
|
||||
### 问题根源
|
||||
|
||||
**RedisSentinel 依赖 RedisReplication,但配置中只创建了 RedisSentinel,没有创建 RedisReplication。**
|
||||
|
||||
### Redis Operator 架构理解
|
||||
|
||||
在 OpsTree Redis Operator 中,资源之间的关系如下:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ RedisSentinel (哨兵层) │
|
||||
│ - 3个 Sentinel 节点 │
|
||||
│ - 负责监控和自动故障转移 │
|
||||
│ - 引用: redisReplicationName │
|
||||
└──────────────┬──────────────────────────┘
|
||||
│ 监控
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ RedisReplication (数据层) │
|
||||
│ - 1个 Master + N个 Replica │
|
||||
│ - 提供实际的缓存服务 │
|
||||
│ - 主从复制 │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 错误配置的问题
|
||||
|
||||
原始配置直接创建了 RedisSentinel,但:
|
||||
|
||||
1. **缺少被监控对象:** Sentinel 需要监控一个 RedisReplication 集群
|
||||
2. **引用不存在的资源:** `redisReplicationName: user-redis` 指向一个不存在的 RedisReplication
|
||||
3. **Operator 行为:** Operator 发现依赖的 RedisReplication 不存在,因此不会创建 Sentinel Pod
|
||||
|
||||
### 为什么没有错误提示?
|
||||
|
||||
- CRD 验证只检查语法和字段类型
|
||||
- 资源引用关系由 Operator 运行时检查
|
||||
- Operator 采用了"等待依赖"策略,而不是报错
|
||||
|
||||
---
|
||||
|
||||
## ✅ 解决方案
|
||||
|
||||
### 正确的部署顺序
|
||||
|
||||
1. **先创建 RedisReplication**(建立 Redis 主从复制集群)
|
||||
2. **再创建 RedisSentinel**(监控上述复制集群)
|
||||
|
||||
### 配置结构
|
||||
|
||||
```yaml
|
||||
# 第一步:创建 Redis 主从复制(数据层)
|
||||
apiVersion: redis.redis.opstreelabs.in/v1beta2
|
||||
kind: RedisReplication
|
||||
metadata:
|
||||
name: user-redis # Sentinel 将引用这个名称
|
||||
namespace: juwan
|
||||
spec:
|
||||
clusterSize: 3 # 1 Master + 2 Replicas
|
||||
kubernetesConfig:
|
||||
image: quay.io/opstree/redis:v7.0.12
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
redisSecret:
|
||||
name: user-redis
|
||||
key: password
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 1Gi # 每个 Redis 节点 1GB 存储
|
||||
|
||||
---
|
||||
# 第二步:创建 Sentinel 监控(监控层)
|
||||
apiVersion: redis.redis.opstreelabs.in/v1beta2
|
||||
kind: RedisSentinel
|
||||
metadata:
|
||||
name: user-redis-sentinel # 使用不同的名称避免混淆
|
||||
namespace: juwan
|
||||
spec:
|
||||
clusterSize: 3 # 3个 Sentinel 节点(推荐奇数)
|
||||
kubernetesConfig:
|
||||
image: quay.io/opstree/redis-sentinel:v7.0.12 # 使用 Sentinel 专用镜像
|
||||
redisSentinelConfig:
|
||||
redisReplicationName: user-redis # 引用上面的 RedisReplication
|
||||
masterGroupName: mymaster
|
||||
quorum: "2" # 需要 2 个 Sentinel 同意才能进行故障转移
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 修复步骤
|
||||
|
||||
### 步骤 1:删除错误的 RedisSentinel 资源
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl delete redissentinel user-redis -n juwan
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
redissentinel.redis.redis.opstreelabs.in "user-redis" deleted
|
||||
```
|
||||
|
||||
**说明:** 删除仅创建了 CRD 实例但未创建实际 Pod 的资源
|
||||
|
||||
---
|
||||
|
||||
### 步骤 2:更新配置文件
|
||||
|
||||
修改 `deploy/k8s/service/user/user-rpc.yaml`,将单独的 RedisSentinel 替换为:
|
||||
1. RedisReplication(数据层)
|
||||
2. RedisSentinel(监控层)
|
||||
|
||||
**变更内容:**
|
||||
- 添加 `RedisReplication` 资源定义
|
||||
- 添加 `storage.volumeClaimTemplate` 配置
|
||||
- 修改 RedisSentinel 的 `metadata.name` 为 `user-redis-sentinel`
|
||||
- 使用正确的 Sentinel 镜像:`quay.io/opstree/redis-sentinel:v7.0.12`
|
||||
- 完善 Sentinel 配置参数
|
||||
|
||||
---
|
||||
|
||||
### 步骤 3:应用更新后的配置
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl apply -f .\deploy\k8s\service\user\user-rpc.yaml
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
deployment.apps/user-rpc configured
|
||||
service/user-rpc-svc unchanged
|
||||
horizontalpodautoscaler.autoscaling/user-rpc-hpa-c unchanged
|
||||
horizontalpodautoscaler.autoscaling/user-rpc-hpa-m unchanged
|
||||
redisreplication.redis.redis.opstreelabs.in/user-redis created ✅
|
||||
redissentinel.redis.redis.opstreelabs.in/user-redis-sentinel created ✅
|
||||
cluster.postgresql.cnpg.io/user-db unchanged
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ RedisReplication 成功创建
|
||||
- ✅ RedisSentinel 成功创建
|
||||
- 🎯 两个资源都是新创建(created),符合预期
|
||||
|
||||
---
|
||||
|
||||
## ✅ 验证结果
|
||||
|
||||
### 验证 1:检查 Pod 创建情况(等待 30 秒)
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get statefulset,pods -n juwan | Select-String -Pattern "user-redis|NAME"
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
NAME READY AGE
|
||||
statefulset.apps/user-redis 3/3 81s ✅
|
||||
statefulset.apps/user-redis-sentinel-sentinel 3/3 24s ✅
|
||||
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
pod/user-redis-0 2/2 Running 0 80s ✅
|
||||
pod/user-redis-1 2/2 Running 0 52s ✅
|
||||
pod/user-redis-2 2/2 Running 0 47s ✅
|
||||
pod/user-redis-sentinel-sentinel-0 1/1 Running 0 24s ✅
|
||||
pod/user-redis-sentinel-sentinel-1 1/1 Running 0 8s ✅
|
||||
pod/user-redis-sentinel-sentinel-2 1/1 Running 0 5s ✅
|
||||
```
|
||||
|
||||
**分析:**
|
||||
- ✅ **RedisReplication** 创建了 3 个 Pod(user-redis-0/1/2)
|
||||
- 每个 Pod 有 2 个容器(2/2):Redis + Exporter
|
||||
- 所有 Pod 处于 Running 状态
|
||||
- ✅ **RedisSentinel** 创建了 3 个 Pod(user-redis-sentinel-sentinel-0/1/2)
|
||||
- 每个 Pod 有 1 个容器(1/1):Sentinel
|
||||
- 所有 Pod 处于 Running 状态
|
||||
- ✅ 创建了 2 个 StatefulSet,READY 状态为 3/3
|
||||
|
||||
---
|
||||
|
||||
### 验证 2:检查 Service 资源
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get svc -n juwan | Select-String -Pattern "redis|NAME"
|
||||
```
|
||||
|
||||
**输出:**
|
||||
```
|
||||
NAME TYPE CLUSTER-IP PORT(S) AGE
|
||||
user-redis ClusterIP 10.103.91.84 6379/TCP,9121/TCP 95s ✅
|
||||
user-redis-additional ClusterIP 10.107.228.48 6379/TCP 95s
|
||||
user-redis-headless ClusterIP None 6379/TCP 95s ✅
|
||||
user-redis-master ClusterIP 10.97.120.76 6379/TCP 95s ✅
|
||||
user-redis-replica ClusterIP 10.100.213.103 6379/TCP 95s ✅
|
||||
user-redis-sentinel-sentinel ClusterIP 10.105.28.231 26379/TCP 40s ✅
|
||||
user-redis-sentinel-sentinel-additional ClusterIP 10.97.111.42 26379/TCP 39s
|
||||
user-redis-sentinel-sentinel-headless ClusterIP None 26379/TCP 41s
|
||||
```
|
||||
|
||||
**Service 功能说明:**
|
||||
|
||||
#### Redis 数据层 Service(端口 6379)
|
||||
- **user-redis-master**: 主节点服务,用于写操作
|
||||
- **user-redis-replica**: 从节点服务,用于读操作
|
||||
- **user-redis**: 通用访问入口(负载均衡到所有节点)
|
||||
- **user-redis-headless**: 无头服务,用于 StatefulSet Pod 间通信
|
||||
- **user-redis-additional**: 额外的访问入口
|
||||
|
||||
#### Sentinel 监控层 Service(端口 26379)
|
||||
- **user-redis-sentinel-sentinel**: Sentinel 访问入口
|
||||
- **user-redis-sentinel-sentinel-headless**: Sentinel 节点间通信
|
||||
- **user-redis-sentinel-sentinel-additional**: 额外的 Sentinel 访问入口
|
||||
|
||||
---
|
||||
|
||||
### 验证 3:检查完整的集群状态
|
||||
|
||||
**命令:**
|
||||
```bash
|
||||
kubectl get all -n juwan
|
||||
```
|
||||
|
||||
**最终状态统计:**
|
||||
|
||||
| 资源类型 | 名称 | 数量 | 状态 |
|
||||
|---------|------|------|------|
|
||||
| **Deployment** | user-api | 3/3 | ✅ Running |
|
||||
| **Deployment** | user-rpc | 3/3 | ✅ Running |
|
||||
| **StatefulSet** | cluster-example (PostgreSQL) | 3/3 | ✅ Running |
|
||||
| **StatefulSet** | user-db (PostgreSQL) | 3/3 | ✅ Running |
|
||||
| **StatefulSet** | user-redis (Redis 数据) | 3/3 | ✅ Running |
|
||||
| **StatefulSet** | user-redis-sentinel-sentinel | 3/3 | ✅ Running |
|
||||
|
||||
**Pod 总计:** 18 个(全部 Running)
|
||||
**Service 总计:** 13 个
|
||||
**HPA 总计:** 6 个
|
||||
|
||||
---
|
||||
|
||||
## 📊 架构图
|
||||
|
||||
### 部署后的 Redis 架构
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────┐
|
||||
│ 应用层 (user-rpc) │
|
||||
│ │
|
||||
│ [需要添加 Redis 连接配置] │
|
||||
└──────────┬─────────────────────────────┬───────────────────┘
|
||||
│ │
|
||||
│ 写操作 │ 读操作
|
||||
↓ ↓
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ user-redis- │ │ user-redis- │
|
||||
│ master │ │ replica │
|
||||
│ Service │ │ Service │
|
||||
└─────────────┘ └─────────────┘
|
||||
│ │
|
||||
└──────────┬──────────────────┘
|
||||
↓
|
||||
┌──────────────────────────────────────────┐
|
||||
│ RedisReplication (数据层) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌───────┐ │
|
||||
│ │ Master │→ │ Replica │→ │Replica│ │
|
||||
│ │ redis-0 │ │ redis-1 │ │redis-2│ │
|
||||
│ └──────────┘ └──────────┘ └───────┘ │
|
||||
└──────────────────────────────────────────┘
|
||||
↑
|
||||
│ 监控 & 故障转移
|
||||
│
|
||||
┌──────────────────────────────────────────┐
|
||||
│ RedisSentinel (监控层) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌───────┐ │
|
||||
│ │Sentinel-0│ │Sentinel-1│ │Sentinel-2│
|
||||
│ └──────────┘ └──────────┘ └───────┘ │
|
||||
│ │
|
||||
│ Quorum: 2/3 (多数派决策) │
|
||||
└──────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 后续建议
|
||||
|
||||
### 1. 应用集成 Redis
|
||||
|
||||
user-rpc 服务目前还没有配置 Redis 连接,需要:
|
||||
|
||||
#### 修改配置文件 `app/users/rpc/etc/pb.yaml`
|
||||
```yaml
|
||||
Name: pb.rpc
|
||||
ListenOn: 0.0.0.0:8080
|
||||
|
||||
# 添加 Redis 配置(使用 Sentinel 模式)
|
||||
Redis:
|
||||
- Host: user-redis-sentinel-sentinel:26379
|
||||
Type: sentinel
|
||||
MasterName: mymaster
|
||||
Pass: ${REDIS_PASSWORD}
|
||||
|
||||
# 或使用主从模式
|
||||
# Redis:
|
||||
# - Host: user-redis-master:6379 # 写
|
||||
# Type: node
|
||||
# Pass: ${REDIS_PASSWORD}
|
||||
# - Host: user-redis-replica:6379 # 读
|
||||
# Type: node
|
||||
# Pass: ${REDIS_PASSWORD}
|
||||
|
||||
Etcd:
|
||||
Hosts:
|
||||
- etcd-service:2379 # 需要配置实际的 Etcd 地址
|
||||
Key: pb.rpc
|
||||
```
|
||||
|
||||
#### 修改 Config 结构 `app/users/rpc/internal/config/config.go`
|
||||
```go
|
||||
package config
|
||||
|
||||
import (
|
||||
"github.com/zeromicro/go-zero/core/stores/redis"
|
||||
"github.com/zeromicro/go-zero/zrpc"
|
||||
)
|
||||
|
||||
type Config struct {
|
||||
zrpc.RpcServerConf
|
||||
Redis redis.RedisConf // 添加 Redis 配置
|
||||
}
|
||||
```
|
||||
|
||||
#### 初始化 Redis 客户端 `app/users/rpc/internal/svc/serviceContext.go`
|
||||
```go
|
||||
package svc
|
||||
|
||||
import (
|
||||
"github.com/zeromicro/go-zero/core/stores/redis"
|
||||
"juwan-backend/app/users/rpc/internal/config"
|
||||
)
|
||||
|
||||
type ServiceContext struct {
|
||||
Config config.Config
|
||||
Redis *redis.Redis // 添加 Redis 客户端
|
||||
}
|
||||
|
||||
func NewServiceContext(c config.Config) *ServiceContext {
|
||||
return &ServiceContext{
|
||||
Config: c,
|
||||
Redis: redis.MustNewRedis(c.Redis), // 初始化 Redis
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 更新 Deployment 环境变量
|
||||
```yaml
|
||||
# deploy/k8s/service/user/user-rpc.yaml
|
||||
env:
|
||||
- name: DB_URI
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: user-db-app
|
||||
key: uri
|
||||
- name: REDIS_PASSWORD # 添加 Redis 密码
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: user-redis
|
||||
key: password
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Redis 性能监控
|
||||
|
||||
已启用 Redis Exporter(端口 9121),可以配置 Prometheus 监控:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: user-redis-metrics
|
||||
namespace: juwan
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: user-redis
|
||||
endpoints:
|
||||
- port: redis-exporter
|
||||
interval: 30s
|
||||
```
|
||||
|
||||
**监控指标:**
|
||||
- redis_up: 实例状态
|
||||
- redis_connected_clients: 连接数
|
||||
- redis_memory_used_bytes: 内存使用
|
||||
- redis_commands_processed_total: 命令处理数
|
||||
- redis_master_repl_offset: 复制偏移量
|
||||
|
||||
---
|
||||
|
||||
### 3. 高可用性测试
|
||||
|
||||
#### 测试主节点故障转移
|
||||
```bash
|
||||
# 1. 查找当前主节点
|
||||
kubectl exec -it user-redis-sentinel-sentinel-0 -n juwan -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
|
||||
|
||||
# 2. 模拟主节点故障
|
||||
kubectl delete pod user-redis-0 -n juwan
|
||||
|
||||
# 3. 观察 Sentinel 的故障转移过程
|
||||
kubectl logs -f user-redis-sentinel-sentinel-0 -n juwan
|
||||
|
||||
# 4. 确认新主节点
|
||||
kubectl exec -it user-redis-sentinel-sentinel-0 -n juwan -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
|
||||
```
|
||||
|
||||
#### 预期结果
|
||||
- Sentinel 检测到主节点下线(5 秒)
|
||||
- 2/3 Sentinel 节点达成共识(quorum=2)
|
||||
- 自动提升一个从节点为主节点
|
||||
- 客户端自动重连到新主节点
|
||||
|
||||
---
|
||||
|
||||
### 4. 扩展为分片集群(未来)
|
||||
|
||||
当缓存数据量增长需要横向扩展时,可以迁移到 RedisCluster:
|
||||
|
||||
```yaml
|
||||
apiVersion: redis.redis.opstreelabs.in/v1beta2
|
||||
kind: RedisCluster
|
||||
metadata:
|
||||
name: user-redis-cluster
|
||||
namespace: juwan
|
||||
spec:
|
||||
clusterSize: 6 # 3 主 + 3 从
|
||||
kubernetesConfig:
|
||||
image: quay.io/opstree/redis:v7.0.12
|
||||
redisLeader:
|
||||
replicas: 3
|
||||
redisFollower:
|
||||
replicas: 3
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
```
|
||||
|
||||
**迁移步骤:**
|
||||
1. 部署新的 RedisCluster
|
||||
2. 使用 redis-cli --cluster import 迁移数据
|
||||
3. 更新应用配置指向新集群
|
||||
4. 下线旧的 Sentinel 集群
|
||||
|
||||
---
|
||||
|
||||
### 5. 备份策略
|
||||
|
||||
Redis Operator 不提供自动备份,建议配置定时任务:
|
||||
|
||||
```bash
|
||||
# 创建 CronJob 定期执行 BGSAVE
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: redis-backup
|
||||
namespace: juwan
|
||||
spec:
|
||||
schedule: "0 2 * * *" # 每天凌晨 2 点
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: backup
|
||||
image: redis:7.0.12
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
redis-cli -h user-redis-master -a $REDIS_PASSWORD BGSAVE
|
||||
# 将 /data/dump.rdb 上传到对象存储
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 总结
|
||||
|
||||
### 关键经验
|
||||
|
||||
1. **理解资源依赖关系:** RedisSentinel 依赖 RedisReplication,部署顺序很重要
|
||||
2. **资源命名规范:** 使用清晰的名称区分不同层次的资源(如 user-redis 和 user-redis-sentinel)
|
||||
3. **诊断思路:**
|
||||
- 从现象(Pod 缺失)→ 资源状态(CRD 存在)→ Operator 日志 → 依赖检查
|
||||
- 逐层排查,最终定位到 RedisReplication 缺失
|
||||
4. **验证完整性:** 不仅要检查 Pod,还要验证 Service、StatefulSet 等所有相关资源
|
||||
|
||||
### 文档价值
|
||||
|
||||
本文档可用于:
|
||||
- ✅ 团队知识传承
|
||||
- ✅ 类似问题的快速排查手册
|
||||
- ✅ 新成员的 Redis Operator 学习资料
|
||||
- ✅ 事后复盘和经验总结
|
||||
|
||||
---
|
||||
|
||||
**最后更新时间:** 2026年2月22日
|
||||
**文档状态:** ✅ 问题已解决,Redis 集群运行正常
|
||||
**下一步行动:** 配置应用连接 Redis
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user