add: envoy redis
This commit is contained in:
@@ -0,0 +1,385 @@
|
||||
# Kubernetes 部署问题排查与解决记录
|
||||
|
||||
**日期**: 2026年2月23日
|
||||
**问题**: user-rpc 和 Redis 部署失败
|
||||
**状态**: 已诊断,解决中
|
||||
|
||||
---
|
||||
|
||||
## 📋 问题描述
|
||||
|
||||
执行 `kubectl apply -f test.yaml` 后,资源虽然创建成功,但实际的应用 pods 并未正常运行:
|
||||
|
||||
```
|
||||
kubectl apply -f ..\test.yaml
|
||||
✓ deployment.apps/user-rpc created
|
||||
✓ service/user-rpc-svc created
|
||||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-c created
|
||||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-m created
|
||||
✓ redisreplication.redis.redis.opstreelabs.in/user-redis created
|
||||
✓ redissentinel.redis.redis.opstreelabs.in/user-redis-sentinel created
|
||||
✓ cluster.postgresql.cnpg.io/user-db created
|
||||
```
|
||||
|
||||
但执行 `kubectl get all` 后,发现:
|
||||
- ❌ **user-rpc pods 未创建**(Deployment 0/3 replicas ready)
|
||||
- ❌ **Redis pods 未创建**(RedisReplication 资源存在但无 pods)
|
||||
- ✅ user-db pods 正常运行(3/3)
|
||||
|
||||
---
|
||||
|
||||
## 🔍 排查过程
|
||||
|
||||
### 第一步:检查 Deployment 状态
|
||||
|
||||
```bash
|
||||
kubectl describe deployment user-rpc
|
||||
```
|
||||
|
||||
**发现**:
|
||||
```
|
||||
Conditions:
|
||||
Type Status Reason
|
||||
---- ------ ------
|
||||
Progressing True NewReplicaSetCreated
|
||||
Available False MinimumReplicasUnavailable
|
||||
ReplicaFailure True FailedCreate
|
||||
```
|
||||
|
||||
### 第二步:检查 ReplicaSet 详情
|
||||
|
||||
```bash
|
||||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||||
```
|
||||
|
||||
**发现关键错误**:
|
||||
```
|
||||
Events:
|
||||
Type Reason Age From Message
|
||||
---- ------ ---- ---- -------
|
||||
Warning FailedCreate 3m53s replicaset-controller Error creating:
|
||||
pods "user-rpc-6bf77fbcd9-" is forbidden: error looking up service
|
||||
account default/find-endpoints: serviceaccount "find-endpoints" not found
|
||||
```
|
||||
|
||||
**问题 #1 诊断完成**:❌ **缺失 ServiceAccount "find-endpoints"**
|
||||
|
||||
### 第三步:检查现有 ServiceAccounts
|
||||
|
||||
```bash
|
||||
kubectl get serviceaccount
|
||||
```
|
||||
|
||||
**结果**:
|
||||
```
|
||||
NAME AGE
|
||||
cluster-example 4d10h
|
||||
default 13d
|
||||
redis-operator 9h
|
||||
user-db 4m9s
|
||||
```
|
||||
|
||||
确认 `find-endpoints` 不存在。
|
||||
|
||||
### 第四步:检查 Secrets
|
||||
|
||||
```bash
|
||||
kubectl get secrets
|
||||
```
|
||||
|
||||
**结果**:默认 secrets 都存在,包括:
|
||||
- ✅ user-db-app
|
||||
- ✅ user-redis
|
||||
- ✅ user-db-ca, user-db-replication, user-db-server
|
||||
|
||||
### 第五步:检查 Redis 部署
|
||||
|
||||
```bash
|
||||
kubectl get redisreplication
|
||||
kubectl get pods | grep redis
|
||||
```
|
||||
|
||||
**发现**:
|
||||
- ✅ RedisReplication 资源存在
|
||||
- ❌ Redis pods **完全没有被创建**
|
||||
|
||||
**问题 #2 诊断**:❌ **Redis Operator 未响应 RedisReplication 资源**
|
||||
|
||||
---
|
||||
|
||||
## 🔧 第一次修复尝试
|
||||
|
||||
### 创建缺失的 ServiceAccount
|
||||
|
||||
```bash
|
||||
kubectl create serviceaccount find-endpoints
|
||||
```
|
||||
|
||||
**结果**:✅ ServiceAccount 创建成功
|
||||
|
||||
### 重启 Deployment
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment user-rpc
|
||||
```
|
||||
|
||||
**等待 5-10 秒后重新检查**:
|
||||
|
||||
```bash
|
||||
kubectl get pods -o wide
|
||||
```
|
||||
|
||||
**新的发现**:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
user-rpc-66f97fbdcc-ws7rc 0/1 ErrImagePull 0 26s
|
||||
user-rpc-6bf77fbcd9-njm2z 0/1 ErrImagePull 0 29s
|
||||
user-rpc-6bf77fbcd9-nwjtw 0/1 ImagePullBackOff 0 29s
|
||||
user-rpc-6bf77fbcd9-wjrf8 0/1 ErrImagePull 0 29s
|
||||
```
|
||||
|
||||
✅ **好消息**:Pods 现在被创建了!(说明 ServiceAccount 问题已解决)
|
||||
❌ **新问题**:镜像拉取失败
|
||||
|
||||
---
|
||||
|
||||
## 🐛 根因分析
|
||||
|
||||
### 问题 #1:缺失 ServiceAccount ✅ 已解决
|
||||
|
||||
**根本原因**:test.yaml 的 Deployment manifest 指定了:
|
||||
```yaml
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
serviceAccountName: find-endpoints # 这个 ServiceAccount 不存在
|
||||
```
|
||||
|
||||
但没有在 test.yaml 中创建 ServiceAccount 资源。
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
kubectl create serviceaccount find-endpoints
|
||||
```
|
||||
|
||||
或在 test.yaml 中添加:
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: find-endpoints
|
||||
namespace: default
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 问题 #2:镜像拉取失败 ❌ 需要修复
|
||||
|
||||
```bash
|
||||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||||
```
|
||||
|
||||
**详细错误日志**:
|
||||
|
||||
```
|
||||
Events:
|
||||
Warning Failed 38s kubelet Failed to pull image
|
||||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||||
Error response from daemon: Get "https://103.236.53.208:4418/v2/":
|
||||
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
|
||||
|
||||
Warning Failed 23s kubelet Failed to pull image
|
||||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||||
http: server gave HTTP response to HTTPS client
|
||||
```
|
||||
|
||||
**根本原因分析**:
|
||||
|
||||
1. **网络连接失败**:`context deadline exceeded` - 无法连接到镜像仓库
|
||||
2. **协议不匹配**:`http: server gave HTTP response to HTTPS client` -
|
||||
- 地址 `103.236.53.208:4418` 应该是 HTTP 而不是 HTTPS
|
||||
- Docker daemon 尝试用 HTTPS 连接,但服务器使用 HTTP
|
||||
|
||||
**可能原因**:
|
||||
- 镜像仓库地址错误或不可访问
|
||||
- 镜像仓库需要特定的网络配置
|
||||
- 仓库服务器离线或配置不当
|
||||
|
||||
---
|
||||
|
||||
### 问题 #3:Redis 部署失败 ❌ 需要诊断
|
||||
|
||||
**现象**:
|
||||
- RedisReplication 和 RedisSentinel CRD 资源创建成功
|
||||
- 但没有对应的 Redis pods 被创建
|
||||
- `kubectl get pods | grep redis` 返回空
|
||||
|
||||
**可能原因**:
|
||||
|
||||
1. **Redis Operator 未正常工作**
|
||||
- Operator pod 可能存在问题
|
||||
- Operator 未能监听到新的 RedisReplication 资源
|
||||
|
||||
2. **CRD 或 API 版本问题**
|
||||
- manifest 中使用的 API 版本 `v1beta2` 可能不匹配 Operator 版本
|
||||
|
||||
3. **资源限制或权限问题**
|
||||
- Operator 无权限创建 pods
|
||||
- 集群资源限制阻止了 pod 创建
|
||||
|
||||
---
|
||||
|
||||
## ✅ 已执行的修复
|
||||
|
||||
| # | 问题 | 修复方法 | 状态 |
|
||||
|---|------|---------|------|
|
||||
| 1 | 缺失 ServiceAccount | `kubectl create serviceaccount find-endpoints` | ✅ 完成 |
|
||||
| 2 | 镜像拉取失败 | 需要更新镜像地址或修复网络 | ⏳ 待处理 |
|
||||
| 3 | Redis pods 未创建 | 需要诊断 Operator 日志 | ⏳ 待诊断 |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 下一步解决方案
|
||||
|
||||
### 优先级 1:修复 user-rpc 镜像拉取
|
||||
|
||||
**选项 A:使用本地/内部镜像**
|
||||
```yaml
|
||||
# 修改 test.yaml 中的镜像地址
|
||||
image: localhost:5000/user-rpc:latest # 本地私有仓库
|
||||
# 或
|
||||
image: user-rpc:latest # 本地镜像(如果已通过 docker load 导入)
|
||||
```
|
||||
|
||||
**选项 B:修复仓库地址**
|
||||
```yaml
|
||||
# 如果 103.236.53.208:4418 确实是正确仓库
|
||||
image: http://103.236.53.208:4418/library/user-rpc:latest # 显式使用 HTTP
|
||||
```
|
||||
|
||||
**验证步骤**:
|
||||
```bash
|
||||
# 检查镜像仓库连接性
|
||||
curl -v http://103.236.53.208:4418/v2/
|
||||
```
|
||||
|
||||
### 优先级 2:诊断 Redis Operator
|
||||
|
||||
```bash
|
||||
# 查看 Operator 日志
|
||||
kubectl logs -l app.kubernetes.io/name=redis-operator -f
|
||||
|
||||
# 查看 Operator pod
|
||||
kubectl get pods -A | grep redis-operator
|
||||
|
||||
# 查看 RedisReplication 详细信息
|
||||
kubectl describe redisreplication user-redis
|
||||
|
||||
# 检查 Operator 权限(RBAC)
|
||||
kubectl get role,rolebinding,clusterrole,clusterrolebinding | grep redis
|
||||
```
|
||||
|
||||
### 优先级 3:增强 test.yaml
|
||||
|
||||
建议在 test.yaml 中添加缺失的资源定义:
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: find-endpoints
|
||||
namespace: default
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: registry-credentials
|
||||
namespace: default
|
||||
type: kubernetes.io/dockercfg
|
||||
data:
|
||||
.dockercfg: <base64-encoded-credentials> # 如果需要私有仓库认证
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 当前集群状态
|
||||
|
||||
### Pods 状态总结
|
||||
|
||||
| 应用 | 期望副本 | 实际运行 | 状态 |
|
||||
|------|---------|---------|------|
|
||||
| user-db | 3 | 3 | ✅ 正常 |
|
||||
| user-rpc | 3 | 0 | ❌ 镜像拉取失败 |
|
||||
| Redis | 3 | 0 | ❌ Operator 未创建 |
|
||||
| Sentinel | 3 | 0 | ❌ Operator 未创建 |
|
||||
|
||||
### Services 状态
|
||||
|
||||
```
|
||||
✅ kubernetes (内置)
|
||||
✅ user-rpc-svc:9001
|
||||
✅ user-db-r:5432 (只读副本)
|
||||
✅ user-db-ro:5432 (只读副本)
|
||||
✅ user-db-rw:5432 (读写主副本)
|
||||
```
|
||||
|
||||
### HPA 配置
|
||||
|
||||
```
|
||||
✅ user-rpc-hpa-c (CPU 目标: 80%) - 无法工作(pods 未运行)
|
||||
✅ user-rpc-hpa-m (Memory 目标: 80%) - 无法工作(pods 未运行)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 关键命令速查表
|
||||
|
||||
```bash
|
||||
# 查看 Deployment 状态
|
||||
kubectl describe deployment user-rpc
|
||||
|
||||
# 查看 ReplicaSet 错误事件
|
||||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||||
|
||||
# 查看 Pod 详细错误
|
||||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||||
|
||||
# 查看 Pod 日志
|
||||
kubectl logs user-rpc-6bf77fbcd9-njm2z
|
||||
|
||||
# 查看所有事件(按时间排序)
|
||||
kubectl get events --sort-by='.lastTimestamp'
|
||||
|
||||
# 查看特定命名空间的所有资源
|
||||
kubectl get all -n default
|
||||
|
||||
# 重新启动 deployment(强制重新创建 pods)
|
||||
kubectl rollout restart deployment user-rpc
|
||||
|
||||
# 查看 Operator 日志
|
||||
kubectl logs -l app.kubernetes.io/name=redis-operator
|
||||
|
||||
# 检查 CRD 注册状态
|
||||
kubectl api-resources | grep redis
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 总结
|
||||
|
||||
| 问题 | 原因 | 解决状态 |
|
||||
|------|------|---------|
|
||||
| **ServiceAccount 缺失** | manifest 中声明但未创建 | ✅ **已解决** |
|
||||
| **镜像拉取失败** | 仓库地址不可达或协议不匹配 | ⏳ **待处理** |
|
||||
| **Redis 未部署** | Operator 未响应 CRD | ⏳ **待诊断** |
|
||||
|
||||
**建议行动**:
|
||||
1. 确认/修复 user-rpc 镜像地址
|
||||
2. 诊断 Redis Operator 状态
|
||||
3. 验证所有依赖的 ServiceAccounts 和 Secrets 是否存在
|
||||
4. 考虑在 test.yaml 中添加完整的资源定义,避免手工创建
|
||||
|
||||
Reference in New Issue
Block a user