名称: kubernetes
描述: |
全面的 Kubernetes 和 OpenShift 集群管理技能,涵盖运维、故障排除、清单生成、安全和 GitOps。在以下场景中使用此技能:
(1) 集群运维:升级、备份、节点管理、扩缩容、监控设置
(2) 故障排除:Pod 故障、网络问题、存储问题、性能分析
(3) 创建清单:Deployment、StatefulSet、Service、Ingress、NetworkPolicy、RBAC
(4) 安全:审计、Pod 安全标准、RBAC、密钥管理、漏洞扫描
(5) GitOps:ArgoCD、Flux、Kustomize、Helm、CI/CD 流水线、渐进式交付
(6) OpenShift 特定:SCC、Route、Operator、Build、ImageStream
(7) 多云:AKS、EKS、GKE、ARO、ROSA 运维
元数据:
author: cluster-skills
version: "1.0.0"
全面的 Kubernetes 和 OpenShift 集群管理技能,涵盖运维、故障排除、清单、安全和 GitOps。
| 平台 | 版本 | 文档 |
|---|---|---|
| Kubernetes | 1.31.x | https://kubernetes.io/docs/ |
| OpenShift | 4.17.x | https://docs.openshift.com/ |
| EKS | 1.31 | https://docs.aws.amazon.com/eks/ |
| AKS | 1.31 | https://learn.microsoft.com/azure/aks/ |
| GKE | 1.31 | https://cloud.google.com/kubernetes-engine/docs |
| 工具 | 版本 | 用途 |
|---|---|---|
| ArgoCD | v2.13.x | GitOps 部署 |
| Flux | v2.4.x | GitOps 工具包 |
| Kustomize | v5.5.x | 清单定制 |
| Helm | v3.16.x | 包管理 |
| Velero | 1.15.x | 备份/恢复 |
| Trivy | 0.58.x | 安全扫描 |
| Kyverno | 1.13.x | 策略引擎 |
重要提示:标准 Kubernetes 使用 kubectl。OpenShift/ARO 使用 oc。
# 查看节点
kubectl get nodes -o wide
# 排空节点以进行维护
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60
# 维护后恢复节点调度
kubectl uncordon ${NODE}
# 查看节点资源使用情况
kubectl top nodes
AKS:
az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table
az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION}
EKS:
aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION}
GKE:
gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION}
OpenShift:
oc adm upgrade --to=${VERSION}
oc get clusterversion
# 安装 Velero
velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}
# 创建备份
velero backup create ${BACKUP_NAME} --include-namespaces ${NS}
# 恢复
velero restore create --from-backup ${BACKUP_NAME}
运行捆绑脚本进行全面的健康检查:
bash scripts/cluster-health-check.sh
| 状态 | 含义 | 操作建议 |
|---|---|---|
Pending |
调度问题 | 检查资源、nodeSelector、容忍度 |
CrashLoopBackOff |
容器崩溃 | 检查日志:kubectl logs ${POD} --previous |
ImagePullBackOff |
镜像不可用 | 验证镜像名称、仓库访问权限 |
OOMKilled |
内存不足 | 增加内存限制 |
Evicted |
节点压力 | 检查节点资源 |
# Pod 日志(当前和上一次)
kubectl logs ${POD} -c ${CONTAINER} --previous
# 使用 stern 查看多 Pod 日志
stern ${LABEL_SELECTOR} -n ${NS}
# 进入 Pod 执行命令
kubectl exec -it ${POD} -- /bin/sh
# Pod 事件
kubectl describe pod ${POD} | grep -A 20 Events
# 集群事件(按时间排序)
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# 测试 DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# 测试服务连通性
kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}
# 检查端点
kubectl get endpoints ${SVC}
apiVersion: apps/v1
kind: Deployment
**元数据:**
name: ${APP_NAME}
namespace: ${NAMESPACE}
labels:
app.kubernetes.io/name: ${APP_NAME}
app.kubernetes.io/version: "${VERSION}"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
template:
metadata:
labels:
app.kubernetes.io/name: ${APP_NAME}
spec:
serviceAccountName: ${APP_NAME}
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: ${APP_NAME}
image: ${IMAGE}:${TAG}
ports:
- name: http
containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
topologyKey: kubernetes.io/hostname
apiVersion: v1
kind: Service
**元数据:**
name: ${APP_NAME}
spec:
selector:
app.kubernetes.io/name: ${APP_NAME}
ports:
- name: http
port: 80
targetPort: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
**元数据:**
name: ${APP_NAME}
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- ${HOST}
secretName: ${APP_NAME}-tls
rules:
- host: ${HOST}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ${APP_NAME}
port:
name: http
apiVersion: route.openshift.io/v1
kind: Route
**元数据:**
name: ${APP_NAME}
spec:
to:
kind: Service
name: ${APP_NAME}
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
使用捆绑脚本生成清单:
bash scripts/generate-manifest.sh deployment myapp production
运行捆绑脚本:
bash scripts/security-audit.sh [namespace]
apiVersion: v1
kind: Namespace
**元数据:**
name: ${NAMESPACE}
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: baseline
pod-security.kubernetes.io/warn: restricted
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
**元数据:**
name: ${APP_NAME}-policy
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: database
ports:
- protocol: TCP
port: 5432
# 允许 DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
apiVersion: v1
kind: ServiceAccount
**元数据:**
name: ${APP_NAME}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
**元数据:**
name: ${APP_NAME}-role
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
**元数据:**
name: ${APP_NAME}-binding
subjects:
- kind: ServiceAccount
name: ${APP_NAME}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ${APP_NAME}-role
# 使用 Trivy 扫描镜像
trivy image ${IMAGE}:${TAG}
# 按严重性过滤扫描
trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}
# 生成 SBOM
trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG}
apiVersion: argoproj.io/v1alpha1
kind: Application
**元数据:**
name: ${APP_NAME}
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: ${GIT_REPO}
targetRevision: main
path: k8s/overlays/${ENV}
destination:
server: https://kubernetes.default.svc
namespace: ${NAMESPACE}
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
k8s/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ └── service.yaml
└── overlays/
├── dev/
│ └── kustomization.yaml
├── staging/
│ └── kustomization.yaml
└── prod/
└── kustomization.yaml
base/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
overlays/prod/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namePrefix: prod-
namespace: production
replicas:
- name: myapp
count: 5
images:
- name: myregistry/myapp
newTag: v1.2.3
名称: 构建与部署
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: 构建并推送镜像
uses: docker/build-push-action@v5
with:
push: true
tags: ${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}
- name: 更新 Kustomize 镜像
run: |
cd k8s/overlays/prod
kustomize edit set image myapp=${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}
- name: 提交并推送
run: |
git config user.name "github-actions"
git config user.email "github-actions@github.com"
git add .
git commit -m "更新镜像至 ${{ github.sha }}"
git push
使用捆绑脚本进行 ArgoCD 同步:
bash scripts/argocd-app-sync.sh ${APP_NAME} --prune
此技能包含 scripts/ 目录下的自动化脚本:
| 脚本 | 用途 |
|---|---|
cluster-health-check.sh |
全面的集群健康评估与评分 |
security-audit.sh |
安全态势审计(特权、root、RBAC、NetworkPolicy) |
node-maintenance.sh |
安全的节点排空与维护准备 |
pre-upgrade-check.sh |
升级前验证清单 |
generate-manifest.sh |
生成生产就绪的 K8s 清单 |
argocd-app-sync.sh |
ArgoCD 应用同步助手 |
运行任意脚本:
bash scripts/<脚本名称>.sh [参数]