386 lines
9.3 KiB
Markdown
386 lines
9.3 KiB
Markdown
# Kubernetes 部署问题排查与解决记录
|
||
|
||
**日期**: 2026年2月23日
|
||
**问题**: user-rpc 和 Redis 部署失败
|
||
**状态**: 已诊断,解决中
|
||
|
||
---
|
||
|
||
## 📋 问题描述
|
||
|
||
执行 `kubectl apply -f test.yaml` 后,资源虽然创建成功,但实际的应用 pods 并未正常运行:
|
||
|
||
```
|
||
kubectl apply -f ..\test.yaml
|
||
✓ deployment.apps/user-rpc created
|
||
✓ service/user-rpc-svc created
|
||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-c created
|
||
✓ horizontalpodautoscaler.autoscaling/user-rpc-hpa-m created
|
||
✓ redisreplication.redis.redis.opstreelabs.in/user-redis created
|
||
✓ redissentinel.redis.redis.opstreelabs.in/user-redis-sentinel created
|
||
✓ cluster.postgresql.cnpg.io/user-db created
|
||
```
|
||
|
||
但执行 `kubectl get all` 后,发现:
|
||
- ❌ **user-rpc pods 未创建**(Deployment 0/3 replicas ready)
|
||
- ❌ **Redis pods 未创建**(RedisReplication 资源存在但无 pods)
|
||
- ✅ user-db pods 正常运行(3/3)
|
||
|
||
---
|
||
|
||
## 🔍 排查过程
|
||
|
||
### 第一步:检查 Deployment 状态
|
||
|
||
```bash
|
||
kubectl describe deployment user-rpc
|
||
```
|
||
|
||
**发现**:
|
||
```
|
||
Conditions:
|
||
Type Status Reason
|
||
---- ------ ------
|
||
Progressing True NewReplicaSetCreated
|
||
Available False MinimumReplicasUnavailable
|
||
ReplicaFailure True FailedCreate
|
||
```
|
||
|
||
### 第二步:检查 ReplicaSet 详情
|
||
|
||
```bash
|
||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||
```
|
||
|
||
**发现关键错误**:
|
||
```
|
||
Events:
|
||
Type Reason Age From Message
|
||
---- ------ ---- ---- -------
|
||
Warning FailedCreate 3m53s replicaset-controller Error creating:
|
||
pods "user-rpc-6bf77fbcd9-" is forbidden: error looking up service
|
||
account default/find-endpoints: serviceaccount "find-endpoints" not found
|
||
```
|
||
|
||
**问题 #1 诊断完成**:❌ **缺失 ServiceAccount "find-endpoints"**
|
||
|
||
### 第三步:检查现有 ServiceAccounts
|
||
|
||
```bash
|
||
kubectl get serviceaccount
|
||
```
|
||
|
||
**结果**:
|
||
```
|
||
NAME AGE
|
||
cluster-example 4d10h
|
||
default 13d
|
||
redis-operator 9h
|
||
user-db 4m9s
|
||
```
|
||
|
||
确认 `find-endpoints` 不存在。
|
||
|
||
### 第四步:检查 Secrets
|
||
|
||
```bash
|
||
kubectl get secrets
|
||
```
|
||
|
||
**结果**:默认 secrets 都存在,包括:
|
||
- ✅ user-db-app
|
||
- ✅ user-redis
|
||
- ✅ user-db-ca, user-db-replication, user-db-server
|
||
|
||
### 第五步:检查 Redis 部署
|
||
|
||
```bash
|
||
kubectl get redisreplication
|
||
kubectl get pods | grep redis
|
||
```
|
||
|
||
**发现**:
|
||
- ✅ RedisReplication 资源存在
|
||
- ❌ Redis pods **完全没有被创建**
|
||
|
||
**问题 #2 诊断**:❌ **Redis Operator 未响应 RedisReplication 资源**
|
||
|
||
---
|
||
|
||
## 🔧 第一次修复尝试
|
||
|
||
### 创建缺失的 ServiceAccount
|
||
|
||
```bash
|
||
kubectl create serviceaccount find-endpoints
|
||
```
|
||
|
||
**结果**:✅ ServiceAccount 创建成功
|
||
|
||
### 重启 Deployment
|
||
|
||
```bash
|
||
kubectl rollout restart deployment user-rpc
|
||
```
|
||
|
||
**等待 5-10 秒后重新检查**:
|
||
|
||
```bash
|
||
kubectl get pods -o wide
|
||
```
|
||
|
||
**新的发现**:
|
||
|
||
```
|
||
NAME READY STATUS RESTARTS AGE
|
||
user-rpc-66f97fbdcc-ws7rc 0/1 ErrImagePull 0 26s
|
||
user-rpc-6bf77fbcd9-njm2z 0/1 ErrImagePull 0 29s
|
||
user-rpc-6bf77fbcd9-nwjtw 0/1 ImagePullBackOff 0 29s
|
||
user-rpc-6bf77fbcd9-wjrf8 0/1 ErrImagePull 0 29s
|
||
```
|
||
|
||
✅ **好消息**:Pods 现在被创建了!(说明 ServiceAccount 问题已解决)
|
||
❌ **新问题**:镜像拉取失败
|
||
|
||
---
|
||
|
||
## 🐛 根因分析
|
||
|
||
### 问题 #1:缺失 ServiceAccount ✅ 已解决
|
||
|
||
**根本原因**:test.yaml 的 Deployment manifest 指定了:
|
||
```yaml
|
||
spec:
|
||
template:
|
||
spec:
|
||
serviceAccountName: find-endpoints # 这个 ServiceAccount 不存在
|
||
```
|
||
|
||
但没有在 test.yaml 中创建 ServiceAccount 资源。
|
||
|
||
**解决方案**:
|
||
```bash
|
||
kubectl create serviceaccount find-endpoints
|
||
```
|
||
|
||
或在 test.yaml 中添加:
|
||
```yaml
|
||
---
|
||
apiVersion: v1
|
||
kind: ServiceAccount
|
||
metadata:
|
||
name: find-endpoints
|
||
namespace: default
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 #2:镜像拉取失败 ❌ 需要修复
|
||
|
||
```bash
|
||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||
```
|
||
|
||
**详细错误日志**:
|
||
|
||
```
|
||
Events:
|
||
Warning Failed 38s kubelet Failed to pull image
|
||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||
Error response from daemon: Get "https://103.236.53.208:4418/v2/":
|
||
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
|
||
|
||
Warning Failed 23s kubelet Failed to pull image
|
||
"103.236.53.208:4418/library/user-rpc@sha256:76b27d3eb4d5d44e...":
|
||
http: server gave HTTP response to HTTPS client
|
||
```
|
||
|
||
**根本原因分析**:
|
||
|
||
1. **网络连接失败**:`context deadline exceeded` - 无法连接到镜像仓库
|
||
2. **协议不匹配**:`http: server gave HTTP response to HTTPS client` -
|
||
- 地址 `103.236.53.208:4418` 应该是 HTTP 而不是 HTTPS
|
||
- Docker daemon 尝试用 HTTPS 连接,但服务器使用 HTTP
|
||
|
||
**可能原因**:
|
||
- 镜像仓库地址错误或不可访问
|
||
- 镜像仓库需要特定的网络配置
|
||
- 仓库服务器离线或配置不当
|
||
|
||
---
|
||
|
||
### 问题 #3:Redis 部署失败 ❌ 需要诊断
|
||
|
||
**现象**:
|
||
- RedisReplication 和 RedisSentinel CRD 资源创建成功
|
||
- 但没有对应的 Redis pods 被创建
|
||
- `kubectl get pods | grep redis` 返回空
|
||
|
||
**可能原因**:
|
||
|
||
1. **Redis Operator 未正常工作**
|
||
- Operator pod 可能存在问题
|
||
- Operator 未能监听到新的 RedisReplication 资源
|
||
|
||
2. **CRD 或 API 版本问题**
|
||
- manifest 中使用的 API 版本 `v1beta2` 可能不匹配 Operator 版本
|
||
|
||
3. **资源限制或权限问题**
|
||
- Operator 无权限创建 pods
|
||
- 集群资源限制阻止了 pod 创建
|
||
|
||
---
|
||
|
||
## ✅ 已执行的修复
|
||
|
||
| # | 问题 | 修复方法 | 状态 |
|
||
|---|------|---------|------|
|
||
| 1 | 缺失 ServiceAccount | `kubectl create serviceaccount find-endpoints` | ✅ 完成 |
|
||
| 2 | 镜像拉取失败 | 需要更新镜像地址或修复网络 | ⏳ 待处理 |
|
||
| 3 | Redis pods 未创建 | 需要诊断 Operator 日志 | ⏳ 待诊断 |
|
||
|
||
---
|
||
|
||
## 🚀 下一步解决方案
|
||
|
||
### 优先级 1:修复 user-rpc 镜像拉取
|
||
|
||
**选项 A:使用本地/内部镜像**
|
||
```yaml
|
||
# 修改 test.yaml 中的镜像地址
|
||
image: localhost:5000/user-rpc:latest # 本地私有仓库
|
||
# 或
|
||
image: user-rpc:latest # 本地镜像(如果已通过 docker load 导入)
|
||
```
|
||
|
||
**选项 B:修复仓库地址**
|
||
```yaml
|
||
# 如果 103.236.53.208:4418 确实是正确仓库
|
||
image: http://103.236.53.208:4418/library/user-rpc:latest # 显式使用 HTTP
|
||
```
|
||
|
||
**验证步骤**:
|
||
```bash
|
||
# 检查镜像仓库连接性
|
||
curl -v http://103.236.53.208:4418/v2/
|
||
```
|
||
|
||
### 优先级 2:诊断 Redis Operator
|
||
|
||
```bash
|
||
# 查看 Operator 日志
|
||
kubectl logs -l app.kubernetes.io/name=redis-operator -f
|
||
|
||
# 查看 Operator pod
|
||
kubectl get pods -A | grep redis-operator
|
||
|
||
# 查看 RedisReplication 详细信息
|
||
kubectl describe redisreplication user-redis
|
||
|
||
# 检查 Operator 权限(RBAC)
|
||
kubectl get role,rolebinding,clusterrole,clusterrolebinding | grep redis
|
||
```
|
||
|
||
### 优先级 3:增强 test.yaml
|
||
|
||
建议在 test.yaml 中添加缺失的资源定义:
|
||
|
||
```yaml
|
||
---
|
||
apiVersion: v1
|
||
kind: ServiceAccount
|
||
metadata:
|
||
name: find-endpoints
|
||
namespace: default
|
||
|
||
---
|
||
apiVersion: v1
|
||
kind: Secret
|
||
metadata:
|
||
name: registry-credentials
|
||
namespace: default
|
||
type: kubernetes.io/dockercfg
|
||
data:
|
||
.dockercfg: <base64-encoded-credentials> # 如果需要私有仓库认证
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 当前集群状态
|
||
|
||
### Pods 状态总结
|
||
|
||
| 应用 | 期望副本 | 实际运行 | 状态 |
|
||
|------|---------|---------|------|
|
||
| user-db | 3 | 3 | ✅ 正常 |
|
||
| user-rpc | 3 | 0 | ❌ 镜像拉取失败 |
|
||
| Redis | 3 | 0 | ❌ Operator 未创建 |
|
||
| Sentinel | 3 | 0 | ❌ Operator 未创建 |
|
||
|
||
### Services 状态
|
||
|
||
```
|
||
✅ kubernetes (内置)
|
||
✅ user-rpc-svc:9001
|
||
✅ user-db-r:5432 (只读副本)
|
||
✅ user-db-ro:5432 (只读副本)
|
||
✅ user-db-rw:5432 (读写主副本)
|
||
```
|
||
|
||
### HPA 配置
|
||
|
||
```
|
||
✅ user-rpc-hpa-c (CPU 目标: 80%) - 无法工作(pods 未运行)
|
||
✅ user-rpc-hpa-m (Memory 目标: 80%) - 无法工作(pods 未运行)
|
||
```
|
||
|
||
---
|
||
|
||
## 📝 关键命令速查表
|
||
|
||
```bash
|
||
# 查看 Deployment 状态
|
||
kubectl describe deployment user-rpc
|
||
|
||
# 查看 ReplicaSet 错误事件
|
||
kubectl describe replicaset user-rpc-6bf77fbcd9
|
||
|
||
# 查看 Pod 详细错误
|
||
kubectl describe pod user-rpc-6bf77fbcd9-njm2z
|
||
|
||
# 查看 Pod 日志
|
||
kubectl logs user-rpc-6bf77fbcd9-njm2z
|
||
|
||
# 查看所有事件(按时间排序)
|
||
kubectl get events --sort-by='.lastTimestamp'
|
||
|
||
# 查看特定命名空间的所有资源
|
||
kubectl get all -n default
|
||
|
||
# 重新启动 deployment(强制重新创建 pods)
|
||
kubectl rollout restart deployment user-rpc
|
||
|
||
# 查看 Operator 日志
|
||
kubectl logs -l app.kubernetes.io/name=redis-operator
|
||
|
||
# 检查 CRD 注册状态
|
||
kubectl api-resources | grep redis
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 总结
|
||
|
||
| 问题 | 原因 | 解决状态 |
|
||
|------|------|---------|
|
||
| **ServiceAccount 缺失** | manifest 中声明但未创建 | ✅ **已解决** |
|
||
| **镜像拉取失败** | 仓库地址不可达或协议不匹配 | ⏳ **待处理** |
|
||
| **Redis 未部署** | Operator 未响应 CRD | ⏳ **待诊断** |
|
||
|
||
**建议行动**:
|
||
1. 确认/修复 user-rpc 镜像地址
|
||
2. 诊断 Redis Operator 状态
|
||
3. 验证所有依赖的 ServiceAccounts 和 Secrets 是否存在
|
||
4. 考虑在 test.yaml 中添加完整的资源定义,避免手工创建
|
||
|