kafka在做分区rebanlace的时候。有15个分区，一个分区160g的数据，这个过程大概需要多久

半兽人 -> lq0317。 4年前

原来有个哥们花了3天。。。
源源不断涌入的新消息，和迁移的速度的时间差，来决定了你的迁移的时间。

lq0317。 -> 半兽人 4年前

中间会不会出现中断的问题，我都2天了还没好，有没有什么方法能判断需要多久的

lq0317。 -> 半兽人 4年前

[root@prd-kafka-01 opt]# /usr/hdp/2.6.4.0-91/kafka/bin/kafka-topics.sh --describe --zookeeper 172.19.38.217:2181 --topic ods_be_monitor_item_detail
Topic:ods_be_monitor_item_detail        PartitionCount:15       ReplicationFactor:3     Configs:retention.ms=172800000
        Topic: ods_be_monitor_item_detail       Partition: 0    Leader: 1006    Replicas: 1006,1005,1008        Isr: 1006,1005,1008
        Topic: ods_be_monitor_item_detail       Partition: 1    Leader: 1008    Replicas: 1008,1006,1009,1005   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 2    Leader: 1005    Replicas: 1005,1006,1008,1010,1009      Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 3    Leader: 1006    Replicas: 1005,1006,1008,1010,1009      Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 4    Leader: 1008    Replicas: 1005,1010,1006,1008   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 5    Leader: 1005    Replicas: 1006,1008,1009,1005   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 6    Leader: 1006    Replicas: 1005,1006,1008,1010,1009      Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 7    Leader: 1008    Replicas: 1005,1006,1008,1010,1009      Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 8    Leader: 1005    Replicas: 1010,1005,1006,1008   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 9    Leader: 1006    Replicas: 1005,1006,1008        Isr: 1006,1005,1008
        Topic: ods_be_monitor_item_detail       Partition: 10   Leader: 1008    Replicas: 1005,1006,1008,1010,1009      Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 11   Leader: 1005    Replicas: 1008,1010,1005,1006   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 12   Leader: 1006    Replicas: 1009,1005,1006,1008   Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 13   Leader: 1008    Replicas: 1010,1006,1008,1005   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 14   Leader: 1005    Replicas: 1005,1008,1009,1006   Isr: 1005,1008,1006
这是正常的吗

lq0317。 -> 半兽人 4年前

[root@prd-kafka-01 opt]# /usr/hdp/2.6.4.0-91/kafka/bin/kafka-reassign-partitions.sh --zookeeper 172.19.38.217:2181 --reassignment-json-file expand-cluster-ods-be-reassignment.json --verify
Status of partition reassignment: 
Reassignment of partition [ods_be_monitor_item_detail,8] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,9] completed successfully
Reassignment of partition [ods_be_monitor_item_detail,6] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,14] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,5] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,11] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,13] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,3] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,2] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,4] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,1] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,10] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,12] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,0] completed successfully
Reassignment of partition [ods_be_monitor_item_detail,7] is still in progress

想喝好几罐八宝粥的男孩 -> lq0317。 4年前

你为什么要重平衡呢？是不是你的集群存在什么问题？这样15个分区，每个160G数据，你重平衡耗时耗力太大，严重影响kafka的吞吐量和效率吧。

lq0317。 -> 想喝好几罐八宝粥的男孩 4年前

扩容加了2个节点，所以要平衡

半兽人 -> lq0317。 4年前

不会，关注一下分区同步的offset，而且你有3个副本，这个量级确实很庞大

lq0317。 -> 半兽人 4年前

好的，我在观察一两天看看，十分感谢

lq0317。 -> 半兽人 4年前

您好，我想问下，分区同步offset怎么观察

lq0317。 -> 半兽人 4年前

我这个已经快5天了还没好，怕出问题

lq0317。 -> 半兽人 4年前

现在写入数据一直报错NotLeaderForPartitionError

lq0317。 -> 半兽人 4年前

[2020-08-14 17:39:01,727] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 572 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:01,877] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 578 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-2 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:01,928] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 583 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-7 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:02,079] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 589 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:02,129] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 594 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-2 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:40:06,193] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 601 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-7 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:40:06,295] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 608 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -&gt; org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)

半兽人 -> lq0317。 4年前

INFO日志，可以忽视额。
可以去机器上看看物理文件，同步的offset位置（手机上打字）

lq0317。 -> 半兽人 4年前

[2020-08-14 18:21:58,656] ERROR [KafkaApi-1009] Error when handling request {controller_id=1005,controller_epoch=22,partition_states=[{topic=monitor_shop_selltime_status_v3,partition=2,controller_epoch=22,leader=1005,leader_epoch=2,isr=[1005,1006,1008],zk_version=13,replicas=[1005,1006,1008,1010,1009]}],live_leaders=[{id=1005,host=kafka1.sh-internal.com,port=6667}]} (kafka.server.KafkaApis)
java.io.IOException: Malformed line in offset checkpoint file: pshop sell status topic 7 0'
        at kafka.server.OffsetCheckpoint.malformedLineException$1(OffsetCheckpoint.scala:81)
        at kafka.server.OffsetCheckpoint.liftedTree2$1(OffsetCheckpoint.scala:104)

lq0317。 -> 半兽人 4年前

现在有这个报错，我把之前分区从分配的任务删了，还是写不进去数据

半兽人 -> lq0317。 4年前

你动了迁移吗？

lq0317。 -> 半兽人 4年前

没有，但是别人给我创建了一个这种topic
pshop sell status topic 中间有空格

lq0317。 -> 半兽人 4年前

recovery-point-offset-checkpoint replication-offset-checkpoint 这两个文件一直会有pshop sell status topic 这个信息删了文件重启也不行，现在该怎么解决，线上的很着急，麻烦回复下

半兽人 -> lq0317。 4年前

是格式问题引起的，有空格的主题怎么会创建成功呢。
kafka什么版本，不清楚你存储的offset是在zk还在kafka自己的__consumer_offsets中。
要从里面删除掉。

lq0317。 -> 半兽人 4年前

[root@prd-kafka-01 kafka]# find ./libs/ -name *kafka_* | head -1 | grep -o '\kafka[^\n]*'
kafka_2.11-0.10.1.2.6.4.0-91.jar

lq0317。 -> 半兽人 4年前

我现在该怎么操作，好多方法用了也解决不了

lq0317。 -> lq0317。 4年前

能给一个联系方式吗

半兽人 -> lq0317。 4年前

我给你个建议。你的是生产的kafka，你需要手动清理掉问题的topic，但是在生产上操作是个比较高危的动作，而且你还在迁移数据中。
1、搭建一个新的kafka集群，将业务引导新的kafka上。
2、业务引走之后，你就可以安心修复旧的kafka集群了。

lq0317。 -> 半兽人 4年前

迁移数据我已经在zk上面给停止了。
之前是五个节点，现在其中一个节点还有问题，一直在修复数据，端口不监听

lq0317。 -> 半兽人 4年前

[2020-08-15 00:17:37,087] INFO Recovering unflushed segment 307489432 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:37,709] INFO Recovering unflushed segment 76574100245 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)
[2020-08-15 00:17:39,330] INFO Recovering unflushed segment 76634617423 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:39,730] INFO Recovering unflushed segment 76885515852 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:44,113] INFO Recovering unflushed segment 307582694 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:46,124] INFO Recovering unflushed segment 76635155355 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:48,664] INFO Recovering unflushed segment 76574648701 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)
[2020-08-15 00:17:48,928] INFO Recovering unflushed segment 76886065583 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:50,801] INFO Recovering unflushed segment 307675968 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:52,978] INFO Recovering unflushed segment 76635693417 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:57,688] INFO Recovering unflushed segment 76886609961 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:57,825] INFO Recovering unflushed segment 307770638 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:59,792] INFO Recovering unflushed segment 76636232706 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:59,811] INFO Recovering unflushed segment 76575198496 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)

半兽人 -> lq0317。 4年前

你把那个有问题的节点的log.dir指向一下新的目录（也可保留老数据），让这台broker重新同步数据吧。
如果的topic的副本都大于1的话，可以暴力一点。

lq0317。 -> 半兽人 4年前

从新搭建了一个集群

kafka在做分区rebanlace的时候。有15个分区，一个分区160g的数据，这个过程大概需要多久

你的答案

昵称