Kubernetes运行ZooKeeper,一个分布式系统协调器

运行ZooKeeper,一个分布式系统协调器

目标

在本教程之后,您将了解以下内容。

  • 如何使用StatefulSet部署ZooKeeper集合
  • 如何使用ConfigMaps一致地配置集合。
  • 如何在集合中扩展ZooKeeper服务器的部署。
  • 如何使用PodDisruptionBudgets确保计划维护期间的服务可用性。

创建ZooKeeper综合

下面的清单包含Headless ServiceServicePodDisruptionBudgetStatefulSet

apiVersion: v1
kind: Service
metadata:
  name: zk-hs
  labels:
    app: zk
spec:
  ports:
  - port: 2888
    name: server
  - port: 3888
    name: leader-election
  clusterIP: None
  selector:
    app: zk
---
apiVersion: v1
kind: Service
metadata:
  name: zk-cs
  labels:
    app: zk
spec:
  ports:
  - port: 2181
    name: client
  selector:
    app: zk
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  selector:
    matchLabels:
      app: zk
  maxUnavailable: 1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zk
spec:
  selector:
    matchLabels:
      app: zk
  serviceName: zk-hs
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: zk
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - zk
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: kubernetes-zookeeper
        imagePullPolicy: Always
        image: "k8s.gcr.io/kubernetes-zookeeper:1.0-3.4.10"
        resources:
          requests:
            memory: "1Gi"
            cpu: "0.5"
        ports:
        - containerPort: 2181
          name: client
        - containerPort: 2888
          name: server
        - containerPort: 3888
          name: leader-election
        command:
        - sh
        - -c
        - "start-zookeeper \
          --servers=3 \
          --data_dir=/var/lib/zookeeper/data \
          --data_log_dir=/var/lib/zookeeper/data/log \
          --conf_dir=/opt/zookeeper/conf \
          --client_port=2181 \
          --election_port=3888 \
          --server_port=2888 \
          --tick_time=2000 \
          --init_limit=10 \
          --sync_limit=5 \
          --heap=512M \
          --max_client_cnxns=60 \
          --snap_retain_count=3 \
          --purge_interval=12 \
          --max_session_timeout=40000 \
          --min_session_timeout=4000 \
          --log_level=INFO"
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/zookeeper
      securityContext:
        runAsUser: 1000
        fsGroup: 1000
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

打开终端,然后使用kubectl apply命令创建清单。

kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml

这将创建Headless Service为zk-hs,Service为zk-cs,PodDisruptionBudget是zk-pdb,StatefulSet为zk

service/zk-hs created
service/zk-cs created
poddisruptionbudget.policy/zk-pdb created
statefulset.apps/zk created

使用kubectl get来监视StatefulSet控制器创建StatefulSetPod

kubectl get pods -w -l app=zk

一旦zk-2 运行并准备就绪,使用CTRL-C终止kubectl

NAME      READY     STATUS    RESTARTS   AGE
zk-0      0/1       Pending   0          0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         19s
zk-0      1/1       Running   0         40s
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         0s
zk-1      0/1       ContainerCreating   0         0s
zk-1      0/1       Running   0         18s
zk-1      1/1       Running   0         40s
zk-2      0/1       Pending   0         0s
zk-2      0/1       Pending   0         0s
zk-2      0/1       ContainerCreating   0         0s
zk-2      0/1       Running   0         19s
zk-2      1/1       Running   0         40s

StatefulSet控制器创建三个Pod,每个Pod都有一个带ZooKeeper服务器的容器。

促进leader选举

由于在匿名网络中选择leader没有终止算法,因此Zab需要显式成员资格配置来执行leader选举。 集合中的每个服务器都需要具有唯一标识符,所有服务器都需要知道全局标识符集,并且每个标识符需要与网络地址相关联。

使用kubectl exec获取zk StatefulSetPod的主机名。

for i in 0 1 2; do kubectl exec zk-$i -- hostname; done

StatefulSet控制器根据其序数索引为每个Pod提供唯一的主机名。 主机名采用<statefulset name> - <ordinal index>的形式。 由于zk StatefulSet的副本字段设置为3,因此Set的控制器创建三个Pod,其主机名设置为zk-0zk-1zk-2

zk-0
zk-1
zk-2

ZooKeeper集合中的服务器使用自然数作为唯一标识符,并将每个服务器的标识符存储在服务器数据目录中名为myid的文件中。

要检查每个服务器的myid文件的内容,请使用以下命令。

for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done

因为标识符是自然数,而序数索引是非负整数,所以可以通过向序数加1来生成标识符。

myid zk-0
1
myid zk-1
2
myid zk-2
3

要获取zk StatefulSet中每个Pod的完全限定域名(FQDN),请使用以下命令。

for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done

zk-hs服务为所有Pod创建一个域,

zk-hs.default.svc.cluster.local.

zk-0.zk-hs.default.svc.cluster.local
zk-1.zk-hs.default.svc.cluster.local
zk-2.zk-hs.default.svc.cluster.local

Kubernetes DNS中的A记录将FQDN解析为Pod的IP地址。如果Kubernetes重新调度Pod,它将使用Pod的新IP地址更新A记录,但A记录名称不会更改。

ZooKeeper将其应用程序配置存储在名为zoo.cfg的文件中。 使用kubectl exec查看zk-0Pod中zoo.cfg文件的内容。

kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg

在文件底部的server.1server.2server.3属性中,1,23对应于ZooKeeper服务器的myid文件中的标识符。 它们被设置为zkStatefulSet中Pod的FQDN。

clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
tickTime=2000
initLimit=10
syncLimit=2000
maxClientCnxns=60
minSessionTimeout= 4000
maxSessionTimeout= 40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

达成共识

If two Pods are launched with the same ordinal, two ZooKeeper servers would both identify themselves as the same server.

共识协议要求每个参与者的标识符是唯一的。Zab协议中没有两个参与者应该声明相同的唯一标识符。这对于允许系统中的进程就哪些进程提交了哪些数据达成一致是必要的。如果使用相同的序号启动两个Pod,则两个ZooKeeper服务器都将自己标识为同一服务器。

kubectl get pods -w -l app=zk

NAME      READY     STATUS    RESTARTS   AGE
zk-0      0/1       Pending   0          0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         19s
zk-0      1/1       Running   0         40s
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         0s
zk-1      0/1       ContainerCreating   0         0s
zk-1      0/1       Running   0         18s
zk-1      1/1       Running   0         40s
zk-2      0/1       Pending   0         0s
zk-2      0/1       Pending   0         0s
zk-2      0/1       ContainerCreating   0         0s
zk-2      0/1       Running   0         19s
zk-2      1/1       Running   0         40s

当Pod变为就绪时,输入每个Pod的A记录。因此,ZooKeeper服务器的FQDN将解析为单个端点,该端点将是声称在其myid文件中配置的身份的唯一ZooKeeper服务器。

zk-0.zk-hs.default.svc.cluster.local
zk-1.zk-hs.default.svc.cluster.local
zk-2.zk-hs.default.svc.cluster.local

这可确保ZooKeepers的zoo.cfg文件中的服务器属性表示正确配置的集合。

server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

当服务器使用Zab协议尝试提交value时,他们将达成共识并提交value(如果leader选举成功并且至少有两个Pod正在运行和就绪),或者他们将无法做到(如果不符合任何一个条件)。如果一个服务器代表另一个服务器确认写入,则不会出现任何状态。

综合测试

最基本的健全性测试是将数据写入一个ZooKeeper服务器并从另一个服务器读取数据。

The command below executes the zkCli.sh script to write world to the path /hello on the zk-0 Pod in the ensemble.
下面的命令执行zkCli.sh脚本,将world写入集合中zk-0 Pod的路径/hello

kubectl exec zk-0 zkCli.sh create /hello world

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
Created /hello

zk-1获取数据。

kubectl exec zk-1 zkCli.sh get /hello

你在zk-0上创建的数据在所有服务器上都可用。

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0

提供耐用的存储

ZooKeeper Basics部分所述,ZooKeeper将所有条目提交给持久的WAL,并定期将内存状态的快照写入存储介质。使用WAL来提供持久性是使用共识协议来实现复制状态机的应用程序的常用技术。

使用kubectl delete删除zk StatefulSet

kubectl delete statefulset zk
statefulset.apps "zk" deleted

观察StatefulSet中Pod的终止。

kubectl get pods -w -l app=zk

zk-0完全终止时,使用CTRL-C终止kubectl

zk-2      1/1       Terminating   0         9m
zk-0      1/1       Terminating   0         11m
zk-1      1/1       Terminating   0         10m
zk-2      0/1       Terminating   0         9m
zk-2      0/1       Terminating   0         9m
zk-2      0/1       Terminating   0         9m
zk-1      0/1       Terminating   0         10m
zk-1      0/1       Terminating   0         10m
zk-1      0/1       Terminating   0         10m
zk-0      0/1       Terminating   0         11m
zk-0      0/1       Terminating   0         11m
zk-0      0/1       Terminating   0         11m

重新applyzookeeper.yaml。

kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml

这将创建zk StatefulSet对象,但清单中的其他API对象不会被修改,因为它们已经存在。

观察StatefulSet控制器重新创建StatefulSetPod

kubectl get pods -w -l app=zk
Once the zk-2 Pod is Running and Ready, use CTRL-C to terminate kubectl.
NAME      READY     STATUS    RESTARTS   AGE
zk-0      0/1       Pending   0          0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         19s
zk-0      1/1       Running   0         40s
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         0s
zk-1      0/1       ContainerCreating   0         0s
zk-1      0/1       Running   0         18s
zk-1      1/1       Running   0         40s
zk-2      0/1       Pending   0         0s
zk-2      0/1       Pending   0         0s
zk-2      0/1       ContainerCreating   0         0s
zk-2      0/1       Running   0         19s
zk-2      1/1       Running   0         40s

使用以下命令从zk-2 Pod获取在完整性测试期间输入的值。

kubectl exec zk-2 zkCli.sh get /hello

即使你终止并重新创建了zk StatefulSet中的所有Pod,该集合仍然提供原始值。

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0

zk StatefulSet规范的volumeClaimTemplates字段指定为每个Pod配置的PersistentVolume

volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 20Gi

StatefulSet控制器为StatefulSet中的每个Pod生成PersistentVolumeClaim

使用以下命令获取StatefulSet的PersistentVolumeClaims

kubectl get pvc -l app=zk
When the StatefulSet recreated its Pods, it remounts the Pods’ PersistentVolumes.
NAME           STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
datadir-zk-0   Bound     pvc-bed742cd-bcb1-11e6-994f-42010a800002   20Gi       RWO           1h
datadir-zk-1   Bound     pvc-bedd27d2-bcb1-11e6-994f-42010a800002   20Gi       RWO           1h
datadir-zk-2   Bound     pvc-bee0817e-bcb1-11e6-994f-42010a800002   20Gi       RWO           1h

StatefulSet容器模板的volumeMounts部分在ZooKeeper服务器的数据目录中安装PersistentVolumes

volumeMounts:
        - name: datadir
          mountPath: /var/lib/zookeeper

当(重新)调度zk StatefulSet中的Pod时,它将始终具有安装到ZooKeeper服务器的数据目录的相同PersistentVolume。 即使重新调度Pod,对ZooKeeper服务器的WAL及其所有快照的所有写入都保持持久。

确保一致的配置

如“促进leader选举和实现共识”部分所述,ZooKeeper集合中的服务器需要一致的配置来选举领导者并形成法定人数。 它们还需要一致配置Zab协议,以使协议在网络上正常工作。在我们的示例中,我们通过将配置直接嵌入清单来实现一致的配置。

获取zk StatefulSet。

kubectl get sts zk -o yaml
…
command:
      - sh
      - -c
      - "start-zookeeper \
        --servers=3 \
        --data_dir=/var/lib/zookeeper/data \
        --data_log_dir=/var/lib/zookeeper/data/log \
        --conf_dir=/opt/zookeeper/conf \
        --client_port=2181 \
        --election_port=3888 \
        --server_port=2888 \
        --tick_time=2000 \
        --init_limit=10 \
        --sync_limit=5 \
        --heap=512M \
        --max_client_cnxns=60 \
        --snap_retain_count=3 \
        --purge_interval=12 \
        --max_session_timeout=40000 \
        --min_session_timeout=4000 \
        --log_level=INFO"

用于启动ZooKeeper服务器的命令将配置作为命令行参数传递。 您还可以使用环境变量将配置传递。

配置日志记录

zkGenConfig.sh脚本生成的其中一个文件控制着ZooKeeper的日志记录。 ZooKeeper使用Log4j,默认情况下,它使用基于时间和大小的滚动文件追加器进行日志记录配置。

使用以下命令从zk StatefulSet中的一个Pod获取日志记录配置。

kubectl exec zk-0 cat /usr/etc/zookeeper/log4j.properties

下面的日志记录配置将导致ZooKeeper进程将其所有日志写入标准输出文件流。

zookeeper.root.logger=CONSOLE
zookeeper.console.threshold=INFO
log4j.rootLogger=${zookeeper.root.logger}
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n

这是安全日志容器的最简单方法。由于应用程序将日志写入标准输出,因此Kubernetes将为您处理日志循环。Kubernetes还实施了一种理智的保留策略,确保写入标准输出和标准错误的应用程序日志不会耗尽本地存储介质。

使用kubectl日志从其中一个Pod中检索最后20个日志行。

kubectl logs zk-0 --tail 20

您可以使用kubectl logsKubernetes Dashboard查看写入标准输出或标准错误的应用程序日志。

2016-12-06 19:34:16,236 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740
2016-12-06 19:34:16,237 [myid:1] - INFO  [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)
2016-12-06 19:34:26,155 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749
2016-12-06 19:34:26,155 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749
2016-12-06 19:34:26,156 [myid:1] - INFO  [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)
2016-12-06 19:34:26,222 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750
2016-12-06 19:34:26,222 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750
2016-12-06 19:34:26,226 [myid:1] - INFO  [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)
2016-12-06 19:34:36,151 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO  [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)
2016-12-06 19:34:36,230 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO  [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)
2016-12-06 19:34:46,149 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO  [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)
2016-12-06 19:34:46,230 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO  [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)

Kubernetes supports more powerful, but more complex, logging integrations with Stackdriver and Elasticsearch and Kibana. For cluster level log shipping and aggregation, consider deploying a sidecar container to rotate and ship your logs.
Kubernetes支持与Stackdriver,Elasticsearch和Kibana进行更强大更复杂的日志记录集成。对于集群级日志传送和聚合,请考虑部署sidecar容器以轮询和发送日志。

配置非特权用户

允许应用程序作为特权用户在容器内运行的最佳实践是一个有争议的问题。如果你们要求应用程序作为非特权用户运行,则可以使用SecurityContext来控制入口点运行的用户。

zk StatefulSet的Pod模板包含SecurityContext。

securityContext:
  runAsUser: 1000
  fsGroup: 1000

在Pods的容器中,UID 1000对应于zookeeper用户,GID 1000对应于zookeeper组。

zk-0 Pod获取ZooKeeper进程信息。

kubectl exec zk-0 -- ps -elf

由于securityContext对象的runAsUser字段设置为1000,而不是以root身份运行,因此ZooKeeper进程作为zookeeper用户运行。

F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S zookeep+     1     0  0  80   0 -  1127 -      20:46 ?        00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
0 S zookeep+    27     1  0  80   0 - 1155556 -    20:46 ?        00:00:19 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg

默认情况下,当Pod的PersistentVolumes挂载到ZooKeeper服务器的数据目录时,只有root用户才能访问它。 此配置可防止ZooKeeper进程写入其WAL并存储其快照。

使用以下命令获取zk-0 Pod上ZooKeeper数据目录的文件权限。

kubectl exec -ti zk-0 -- ls -ld /var/lib/zookeeper/data

由于securityContext对象的fsGroup字段设置为1000,因此Pods的PersistentVolumes的所有权设置为zookeeper组,ZooKeeper进程可以读取和写入其数据。

drwxr-sr-x 3 zookeeper zookeeper 4096 Dec  5 20:45 /var/lib/zookeeper/data

管理ZooKeeper进程

ZooKeeper文档提到“你将需要一个管理每个ZooKeeper服务器进程(JVM)的监督进程。” 利用监视程序(监督进程)重新启动分布式系统中的失败进程是一种常见的模式。在Kubernetes中部署应用程序时,不应使用外部实用程序作为监督过程,而应使用Kubernetes作为应用程序的监视程序。

更新整体

zk StatefulSet配置为使用RollingUpdate更新策略。

您可以使用kubectl patch来更新分配给服务器的cpu数量。

kubectl patch sts zk --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"0.3"}]'

statefulset.apps/zk patched

使用kubectl rollout status来监控更新的状态。

kubectl rollout status sts/zk

waiting for statefulset rolling update to complete 0 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 2 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 3 pods at revision zk-5db4499664...

This terminates the Pods, one at a time, in reverse ordinal order, and recreates them with the new configuration. This ensures that quorum is maintained during a rolling update.
这将按顺序以反向顺序终止Pod,并使用新配置重新创建它们。这可确保在滚动更新期间维护quorum

使用kubectl rollout history命令查看历史记录或以前的配置。

kubectl rollout history sts/zk

statefulsets "zk"
REVISION
1
2

使用kubectl rollout undo命令回滚修改。

kubectl rollout undo sts/zk

statefulset.apps/zk rolled back

处理过程失败

Restart Policies control how Kubernetes handles process failures for the entry point of the container in a Pod. For Pods in a StatefulSet, the only appropriate RestartPolicy is Always, and this is the default value. For stateful applications you should never override the default policy.

Use the following command to examine the process tree for the ZooKeeper server running in the zk-0 Pod.

kubectl exec zk-0 -- ps -ef

The command used as the container’s entry point has PID 1, and the ZooKeeper process, a child of the entry point, has PID 27.

UID        PID  PPID  C STIME TTY          TIME CMD
zookeep+     1     0  0 15:03 ?        00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
zookeep+    27     1  0 15:03 ?        00:00:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg

In another terminal watch the Pods in the zk StatefulSet with the following command.

kubectl get pod -w -l app=zk

In another terminal, terminate the ZooKeeper process in Pod zk-0 with the following command.

kubectl exec zk-0 -- pkill java

The termination of the ZooKeeper process caused its parent process to terminate. Because the RestartPolicy of the container is Always, it restarted the parent process.

NAME      READY     STATUS    RESTARTS   AGE
zk-0      1/1       Running   0          21m
zk-1      1/1       Running   0          20m
zk-2      1/1       Running   0          19m
NAME      READY     STATUS    RESTARTS   AGE
zk-0      0/1       Error     0          29m
zk-0      0/1       Running   1         29m
zk-0      1/1       Running   1         29m

If your application uses a script (such as zkServer.sh) to launch the process that implements the application’s business logic, the script must terminate with the child process. This ensures that Kubernetes will restart the application’s container when the process implementing the application’s business logic fails.

Testing for Liveness

Configuring your application to restart failed processes is not enough to keep a distributed system healthy. There are scenarios where a system’s processes can be both alive and unresponsive, or otherwise unhealthy. You should use liveness probes to notify Kubernetes that your application’s processes are unhealthy and it should restart them.

The Pod template for the zk StatefulSet specifies a liveness probe. ``
livenessProbe:
exec:
command:

        - sh
        - -c
        - "zookeeper-ready 2181"
      initialDelaySeconds: 15
      timeoutSeconds: 5

The probe calls a bash script that uses the ZooKeeper ruok four letter word to test the server’s health.

OK=$(echo ruok | nc 127.0.0.1 $1)
if [ "$OK" == "imok" ]; then
    exit 0
else
    exit 1
fi

In one terminal window, use the following command to watch the Pods in the zk StatefulSet.

kubectl get pod -w -l app=zk

In another window, using the following command to delete the zkOk.sh script from the file system of Pod zk-0.

kubectl exec zk-0 -- rm /usr/bin/zookeeper-ready

When the liveness probe for the ZooKeeper process fails, Kubernetes will automatically restart the process for you, ensuring that unhealthy processes in the ensemble are restarted.

kubectl get pod -w -l app=zk

NAME      READY     STATUS    RESTARTS   AGE
zk-0      1/1       Running   0          1h
zk-1      1/1       Running   0          1h
zk-2      1/1       Running   0          1h
NAME      READY     STATUS    RESTARTS   AGE
zk-0      0/1       Running   0          1h
zk-0      0/1       Running   1         1h
zk-0      1/1       Running   1         1h
Testing for Readiness

Readiness is not the same as liveness. If a process is alive, it is scheduled and healthy. If a process is ready, it is able to process input. Liveness is a necessary, but not sufficient, condition for readiness. There are cases, particularly during initialization and termination, when a process can be alive but not ready.

If you specify a readiness probe, Kubernetes will ensure that your application’s processes will not receive network traffic until their readiness checks pass.

For a ZooKeeper server, liveness implies readiness. Therefore, the readiness probe from the zookeeper.yaml manifest is identical to the liveness probe.

readinessProbe:
exec:
  command:
  - sh
  - -c
  - "zookeeper-ready 2181"
initialDelaySeconds: 15
timeoutSeconds: 5

Even though the liveness and readiness probes are identical, it is important to specify both. This ensures that only healthy servers in the ZooKeeper ensemble receive network traffic.

Tolerating Node Failure

ZooKeeper needs a quorum of servers to successfully commit mutations to data. For a three server ensemble, two servers must be healthy for writes to succeed. In quorum based systems, members are deployed across failure domains to ensure availability. To avoid an outage, due to the loss of an individual machine, best practices preclude co-locating multiple instances of the application on the same machine.

By default, Kubernetes may co-locate Pods in a StatefulSet on the same node. For the three server ensemble you created, if two servers are on the same node, and that node fails, the clients of your ZooKeeper service will experience an outage until at least one of the Pods can be rescheduled.

You should always provision additional capacity to allow the processes of critical systems to be rescheduled in the event of node failures. If you do so, then the outage will only last until the Kubernetes scheduler reschedules one of the ZooKeeper servers. However, if you want your service to tolerate node failures with no downtime, you should set podAntiAffinity.

Use the command below to get the nodes for Pods in the zk StatefulSet.

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

All of the Pods in the zk StatefulSet are deployed on different nodes.

kubernetes-minion-group-cxpk
kubernetes-minion-group-a5aq
kubernetes-minion-group-2g2d

This is because the Pods in the zk StatefulSet have a PodAntiAffinity specified.

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                - zk
          topologyKey: "kubernetes.io/hostname"

The requiredDuringSchedulingIgnoredDuringExecution field tells the Kubernetes Scheduler that it should never co-locate two Pods which have app label as zk in the domain defined by the topologyKey. The topologyKey kubernetes.io/hostname indicates that the domain is an individual node. Using different rules, labels, and selectors, you can extend this technique to spread your ensemble across physical, network, and power failure domains.

Surviving Maintenance

In this section you will cordon and drain nodes. If you are using this tutorial on a shared cluster, be sure that this will not adversely affect other tenants.

The previous section showed you how to spread your Pods across nodes to survive unplanned node failures, but you also need to plan for temporary node failures that occur due to planned maintenance.

Use this command to get the nodes in your cluster.

kubectl get nodes

Use kubectl cordon to cordon all but four of the nodes in your cluster.

kubectl cordon <node-name>

Use this command to get the zk-pdb PodDisruptionBudget.

kubectl get pdb zk-pdb

The max-unavailable field indicates to Kubernetes that at most one Pod from zk StatefulSet can be unavailable at any time.

NAME      MIN-AVAILABLE   MAX-UNAVAILABLE   ALLOWED-DISRUPTIONS   AGE
zk-pdb    N/A             1                 1

In one terminal, use this command to watch the Pods in the zk StatefulSet.

kubectl get pods -w -l app=zk

In another terminal, use this command to get the nodes that the Pods are currently scheduled on.

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

kubernetes-minion-group-pb41
kubernetes-minion-group-ixsl
kubernetes-minion-group-i4c4

Use kubectl drain to cordon and drain the node on which the zk-0 Pod is scheduled.

kubectl drain $(kubectl get pod zk-0 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-minion-group-pb41" cordoned

WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-pb41, kube-proxy-kubernetes-minion-group-pb41; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-o5elz
pod "zk-0" deleted
node "kubernetes-minion-group-pb41" drained

As there are four nodes in your cluster, kubectl drain, succeeds and the zk-0 is rescheduled to another node.

NAME      READY     STATUS    RESTARTS   AGE
zk-0      1/1       Running   2          1h
zk-1      1/1       Running   0          1h
zk-2      1/1       Running   0          1h
NAME      READY     STATUS        RESTARTS   AGE
zk-0      1/1       Terminating   2          2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Pending   0         0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         51s
zk-0      1/1       Running   0         1m

Keep watching the StatefulSet’s Pods in the first terminal and drain the node on which zk-1 is scheduled.

kubectl drain $(kubectl get pod zk-1 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data "kubernetes-minion-group-ixsl" cordoned

WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-ixsl, kube-proxy-kubernetes-minion-group-ixsl; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-voc74
pod "zk-1" deleted
node "kubernetes-minion-group-ixsl" drained

The zk-1 Pod cannot be scheduled because the zk StatefulSet contains a PodAntiAffinity rule preventing co-location of the Pods, and as only two nodes are schedulable, the Pod will remain in a Pending state.

kubectl get pods -w -l app=zk

NAME      READY     STATUS    RESTARTS   AGE
zk-0      1/1       Running   2          1h
zk-1      1/1       Running   0          1h
zk-2      1/1       Running   0          1h
NAME      READY     STATUS        RESTARTS   AGE
zk-0      1/1       Terminating   2          2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Pending   0         0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         51s
zk-0      1/1       Running   0         1m
zk-1      1/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         0s

Continue to watch the Pods of the stateful set, and drain the node on which zk-2 is scheduled.

kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-minion-group-i4c4" cordoned

WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
WARNING: Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4
There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.
pod/zk-2

Use CTRL-C to terminate to kubectl.

You cannot drain the third node because evicting zk-2 would violate zk-budget. However, the node will remain cordoned.

Use zkCli.sh to retrieve the value you entered during the sanity test from zk-0.

kubectl exec zk-0 zkCli.sh get /hello

The service is still available because its PodDisruptionBudget is respected.

WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x200000002
ctime = Wed Dec 07 00:08:59 UTC 2016
mZxid = 0x200000002
mtime = Wed Dec 07 00:08:59 UTC 2016
pZxid = 0x200000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0

Use kubectl uncordon to uncordon the first node.

kubectl uncordon kubernetes-minion-group-pb41

node "kubernetes-minion-group-pb41" uncordoned

zk-1 is rescheduled on this node. Wait until zk-1 is Running and Ready.

kubectl get pods -w -l app=zk

NAME      READY     STATUS    RESTARTS   AGE
zk-0      1/1       Running   2          1h
zk-1      1/1       Running   0          1h
zk-2      1/1       Running   0          1h
NAME      READY     STATUS        RESTARTS   AGE
zk-0      1/1       Terminating   2          2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Terminating   2         2h
zk-0      0/1       Pending   0         0s
zk-0      0/1       Pending   0         0s
zk-0      0/1       ContainerCreating   0         0s
zk-0      0/1       Running   0         51s
zk-0      1/1       Running   0         1m
zk-1      1/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Terminating   0         2h
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         0s
zk-1      0/1       Pending   0         12m
zk-1      0/1       ContainerCreating   0         12m
zk-1      0/1       Running   0         13m
zk-1      1/1       Running   0         13m

Attempt to drain the node on which zk-2 is scheduled.

kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data

The output:

node "kubernetes-minion-group-i4c4" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
pod "heapster-v1.2.0-2604621511-wht1r" deleted
pod "zk-2" deleted
node "kubernetes-minion-group-i4c4" drained

This time kubectl drain succeeds.

Uncordon the second node to allow zk-2 to be rescheduled.

kubectl uncordon kubernetes-minion-group-ixsl

node "kubernetes-minion-group-ixsl" uncordoned

You can use kubectl drain in conjunction with PodDisruptionBudgets to ensure that your services remain available during maintenance. If drain is used to cordon nodes and evict pods prior to taking the node offline for maintenance, services that express a disruption budget will have that budget respected. You should always allocate additional capacity for critical services so that their Pods can be immediately rescheduled.

Cleaning up

Use kubectl uncordon to uncordon all the nodes in your cluster.
You will need to delete the persistent storage media for the PersistentVolumes used in this tutorial. Follow the necessary steps, based on your environment, storage configuration, and provisioning method, to ensure that all storage is reclaimed.






发表于: 1月前   最后更新时间: 1月前   游览量:146
上一条: kubernetes的pod eviction
下一条: Kubernetes之ConfigMap

评论…


  • 评论…
    • in this conversation