uohzoaix 2016-11-10T06:40:49+00:00 ccf.developer@gmail.com Kafka集群之异步通信 2016-11-10T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2016/11/10/kafka集群之异步通信 很久没写文章了,重新看了下新版kafka的异步通信逻辑,觉得有必要做个笔记。接下来以获取分片的offset为例子说明下客户端是如何从服务端获取到offset的。 首先看方法入口:

private long listOffset(TopicPartition partition, long timestamp) {
    while (true) {
        //向kafka集群发送获取offset请求,这里并未真正把请求发送出去,而是将请求缓存在queue中。而这里的RequestFuture实体实现了通信是异步的
        RequestFuture<Long> future = sendListOffsetRequest(partition, timestamp);
        //这里是关键的地方。包含发送请求和获取返回数据的逻辑
        client.poll(future);

        //成功发送并获取到返回数据
        if (future.succeeded())
            return future.value();

        if (!future.isRetriable())
            throw future.exception();

        if (future.exception() instanceof InvalidMetadataException)
            client.awaitMetadataUpdate();
        else
            time.sleep(retryBackoffMs);
    }
}

/**
 *并不真正发送请求,而是组装请求实体并缓存
 */
private RequestFuture<Long> sendListOffsetRequest(final TopicPartition topicPartition, long timestamp) {
    Map<TopicPartition, ListOffsetRequest.PartitionData> partitions = new HashMap<>(1);
    partitions.put(topicPartition, new ListOffsetRequest.PartitionData(timestamp, 1));
    PartitionInfo info = metadata.fetch().partition(topicPartition);
    if (info == null) {
        metadata.add(topicPartition.topic());
        log.debug("Partition {} is unknown for fetching offset, wait for metadata refresh", topicPartition);
        return RequestFuture.staleMetadata();
    } else if (info.leader() == null) {
        log.debug("Leader for partition {} unavailable for fetching offset, wait for metadata refresh", topicPartition);
        return RequestFuture.leaderNotAvailable();
    } else {
        //获取分片的leader,leader接收请求并获取相应数据返回给客户端
        Node node = info.leader();
        ListOffsetRequest request = new ListOffsetRequest(-1, partitions);
        return client.send(node, ApiKeys.LIST_OFFSETS, request)
                .compose(new RequestFutureAdapter<ClientResponse, Long>() {
                    @Override
                    public void onSuccess(ClientResponse response, RequestFuture<Long> future) {
                        handleListOffsetResponse(topicPartition, response, future);
                    }
                });
    }
}

下面来看具体的send方法:

/**
 *Send a new request. Note that the request is not actually transmitted on the
 *network until one of the {@link #poll(long)} variants is invoked. At this
 *point the request will either be transmitted successfully or will fail.
 *Use the returned future to obtain the result of the send. Note that there is no
 *need to check for disconnects explicitly on the {@link ClientResponse} object;
 *instead, the future will be failed with a {@link DisconnectException}.
 *@param node The destination of the request
 *@param api The Kafka API call
 *@param request The request payload 
 *@return A future which indicates the result of the send.
 */
 /**
 *注释说明了这里并不会直接发送请求,而是等下一次poll的时候才处理那些在list中的为发送请求
 */
public RequestFuture<ClientResponse> send(Node node,
                                          ApiKeys api,
                                          AbstractRequest request) {
    return send(node, api, ProtoUtils.latestVersion(api.id), request);
}

private RequestFuture<ClientResponse> send(Node node,
                                          ApiKeys api,
                                          short version,
                                          AbstractRequest request) {
    long now = time.milliseconds();
    //创建异步处理完成handler
    RequestFutureCompletionHandler completionHandler = new RequestFutureCompletionHandler();
    RequestHeader header = client.nextRequestHeader(api, version);
    RequestSend send = new RequestSend(node.idString(), header, request.toStruct());
    //将请求放到unsent列表中
    put(node, new ClientRequest(now, true, send, completionHandler));

    // wakeup the client in case it is blocking in poll so that we can send the queued request
    //唤醒因其他线程阻塞的io操作,表明现在可以退出来处理新的请求了
    client.wakeup();
    return completionHandler.future;
}

再来看下方法链里的compose方法:

/**
 *Convert from a request future of one type to another type
 *@param adapter The adapter which does the conversion
 *@param <S> The type of the future adapted to
 *@return The new future
 */
public <S> RequestFuture<S> compose(final RequestFutureAdapter<T, S> adapter) {
    //新建一个RequestFuture,只存储结果
    final RequestFuture<S> adapted = new RequestFuture<>();
    addListener(new RequestFutureListener<T>() {
        @Override
        public void onSuccess(T value) {
            adapter.onSuccess(value, adapted);
        }

        @Override
        public void onFailure(RuntimeException e) {
            adapter.onFailure(e, adapted);
        }
    });
    //返回新创建的future
    return adapted;
}

该方法的主要作用是将一种类型转换为另一种类型,这里是将ClientResponse转化为Long。 真正转化逻辑由适配器的onsuccess方法处理,如下:

private void handleListOffsetResponse(TopicPartition topicPartition,
                                      ClientResponse clientResponse,
                                      RequestFuture<Long> future) {
    ListOffsetResponse lor = new ListOffsetResponse(clientResponse.responseBody());
    short errorCode = lor.responseData().get(topicPartition).errorCode;
    if (errorCode == Errors.NONE.code()) {
        List<Long> offsets = lor.responseData().get(topicPartition).offsets;
        if (offsets.size() != 1)
            throw new IllegalStateException("This should not happen.");
        long offset = offsets.get(0);
        log.debug("Fetched offset {} for partition {}", offset, topicPartition);
        //由compose方法返回的RequestFuture处理结果
        future.complete(offset);
    } else if (errorCode == Errors.NOT_LEADER_FOR_PARTITION.code()
            || errorCode == Errors.UNKNOWN_TOPIC_OR_PARTITION.code()) {
        log.debug("Attempt to fetch offsets for partition {} failed due to obsolete leadership information, retrying.",
                topicPartition);
        future.raise(Errors.forCode(errorCode));
    } else {
        log.warn("Attempt to fetch offsets for partition {} failed due to: {}",
                topicPartition, Errors.forCode(errorCode).message());
        future.raise(new StaleMetadataException());
    }
}

上面已经把请求初始化好,接下来会真正的通过socket发送请求并接收服务端返回的数据:

/**
 *Block indefinitely until the given request future has finished.
 *@param future The request future to await.
 *@throws WakeupException if {@link #wakeup()} is called from another thread
 */
public void poll(RequestFuture<?> future) {、
    //如果没完成则循环直到获取到数据
    while (!future.isDone())
        poll(Long.MAX_VALUE, time.milliseconds(), future);
}

/**
 *Block until the provided request future request has finished or the timeout has expired.
 *@param future The request future to wait for
 *@param timeout The maximum duration (in ms) to wait for the request
 *@return true if the future is done, false otherwise
 *@throws WakeupException if {@link #wakeup()} is called from another thread
 */
public boolean poll(RequestFuture<?> future, long timeout) {
    long begin = time.milliseconds();
    long remaining = timeout;
    long now = begin;
    do {
        poll(remaining, now, future);
        now = time.milliseconds();
        long elapsed = now - begin;
        remaining = timeout - elapsed;
    } while (!future.isDone() && remaining > 0);
    return future.isDone();
}

public void poll(long timeout, long now, PollCondition pollCondition) {
    // there may be handlers which need to be invoked if we woke up the previous call to poll
    //在发送之前处理已经完成的请求
    firePendingCompletedRequests();

    synchronized (this) {
        // send all the requests we can send now
        trySend(now);

        // check whether the poll is still needed by the caller. Note that if the expected completion
        // condition becomes satisfied after the call to shouldBlock() (because of a fired completion
        // handler), the client will be woken up.
        if (pollCondition == null || pollCondition.shouldBlock()) {
            client.poll(timeout, now);
            now = time.milliseconds();
        } else {
            client.poll(0, now);
        }

        // handle any disconnects by failing the active requests. note that disconnects must
        // be checked immediately following poll since any subsequent call to client.ready()
        // will reset the disconnect status
        checkDisconnects(now);

        // trigger wakeups after checking for disconnects so that the callbacks will be ready
        // to be fired on the next call to poll()
        maybeTriggerWakeup();

        // try again to send requests since buffer space may have been
        // cleared or a connect finished in the poll
        trySend(now);

        // fail requests that couldn't be sent if they have expired
        failExpiredRequests(now);
    }

    // called without the lock to avoid deadlock potential if handlers need to acquire locks
    firePendingCompletedRequests();
}

private void firePendingCompletedRequests() {
    boolean completedRequestsFired = false;
    for (;;) {
        //处理handler
        RequestFutureCompletionHandler completionHandler = pendingCompletion.poll();
        if (completionHandler == null)
            break;

        completionHandler.fireCompletion();
        completedRequestsFired = true;
    }

    // wakeup the client in case it is blocking in poll for this future's completion
    if (completedRequestsFired)
        client.wakeup();
}

trySend(now)分支由以下方法完成,该方法主要是把列表里待发送的请求,对于同一个node,如果该node不包含处理完成的请求则不再向该node发送新的请求:

private boolean trySend(long now) {
    // send any requests that can be sent now
    boolean requestsSent = false;
    //循环请求列表并发送
    for (Map.Entry<Node, List<ClientRequest>> requestEntry: unsent.entrySet()) {
        Node node = requestEntry.getKey();
        Iterator<ClientRequest> iterator = requestEntry.getValue().iterator();
        while (iterator.hasNext()) {
            ClientRequest request = iterator.next();
            if (client.ready(node, now)) {
                client.send(request, now);
                iterator.remove();
                requestsSent = true;
            }
        }
    }
    return requestsSent;
}

public void send(ClientRequest request, long now) {
    String nodeId = request.request().destination();
    if (!canSendRequest(nodeId))
        throw new IllegalStateException("Attempt to send a request to node " + nodeId + " which is not ready.");
    doSend(request, now);
}

private void doSend(ClientRequest request, long now) {
    request.setSendTimeMs(now);
    //正在发送的请求放入queue中
    this.inFlightRequests.add(request);
    selector.send(request.request());
}

public void send(Send send) {
    KafkaChannel channel = channelOrFail(send.destination());
    try {
        channel.setSend(send);
    } catch (CancelledKeyException e) {
        this.failedSends.add(send.destination());
        close(channel);
    }
}

public void setSend(Send send) {
    if (this.send != null)
        throw new IllegalStateException("Attempt to begin a send operation with prior send operation still in progress.");
    this.send = send;
    this.transportLayer.addInterestOps(SelectionKey.OP_WRITE);
}

在把请求准备好之后开始真正发送和接收数据:

public List<ClientResponse> poll(long timeout, long now) {
    long metadataTimeout = metadataUpdater.maybeUpdate(now);
    try {
        //处理发送和接收
        this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
    } catch (IOException e) {
        log.error("Unexpected error during I/O", e);
    }

    // process completed actions
    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();
    handleCompletedSends(responses, updatedNow);
    handleCompletedReceives(responses, updatedNow);
    handleDisconnections(responses, updatedNow);
    handleConnections();
    handleTimedOutRequests(responses, updatedNow);

    // invoke callbacks
    for (ClientResponse response : responses) {
        if (response.request().hasCallback()) {
            try {
                response.request().callback().onComplete(response);
            } catch (Exception e) {
                log.error("Uncaught error in request completion:", e);
            }
        }
    }

    return responses;
}

public void poll(long timeout) throws IOException {
    if (timeout < 0)
        throw new IllegalArgumentException("timeout should be >= 0");

    //清除之前的io操作留下的缓存数据
    clear();

    if (hasStagedReceives() || !immediatelyConnectedKeys.isEmpty())
        timeout = 0;

    /* check ready keys */
    long startSelect = time.nanoseconds();
    int readyKeys = select(timeout);
    long endSelect = time.nanoseconds();
    this.sensors.selectTime.record(endSelect - startSelect, time.milliseconds());

    if (readyKeys > 0 || !immediatelyConnectedKeys.isEmpty()) {
        //开始发送接收操作
        pollSelectionKeys(this.nioSelector.selectedKeys(), false, endSelect);
        pollSelectionKeys(immediatelyConnectedKeys, true, endSelect);
    }

    addToCompletedReceives();

    long endIo = time.nanoseconds();
    this.sensors.ioTime.record(endIo - endSelect, time.milliseconds());

    // we use the time at the end of select to ensure that we don't close any connections that
    // have just been processed in pollSelectionKeys
    maybeCloseOldestConnection(endSelect);
}

private void pollSelectionKeys(Iterable<SelectionKey> selectionKeys,
                               boolean isImmediatelyConnected,
                               long currentTimeNanos) {
    Iterator<SelectionKey> iterator = selectionKeys.iterator();
    while (iterator.hasNext()) {
        SelectionKey key = iterator.next();
        iterator.remove();
        KafkaChannel channel = channel(key);

        // register all per-connection metrics at once
        sensors.maybeRegisterConnectionMetrics(channel.id());
        if (idleExpiryManager != null)
            idleExpiryManager.update(channel.id(), currentTimeNanos);

        try {

            /* complete any connections that have finished their handshake (either normally or immediately) */
            if (isImmediatelyConnected || key.isConnectable()) {
                //如果是连接请求则等待其完成连接
                if (channel.finishConnect()) {
                    this.connected.add(channel.id());
                    this.sensors.connectionCreated.record();
                    SocketChannel socketChannel = (SocketChannel) key.channel();
                    log.debug("Created socket with SO_RCVBUF = {}, SO_SNDBUF = {}, SO_TIMEOUT = {} to node {}",
                            socketChannel.socket().getReceiveBufferSize(),
                            socketChannel.socket().getSendBufferSize(),
                            socketChannel.socket().getSoTimeout(),
                            channel.id());
                } else
                    continue;
            }

            /* if channel is not ready finish prepare */
            if (channel.isConnected() && !channel.ready())
                channel.prepare();

            //这里读取和发送操作是放在一个方法体中,由于前面的while循环可以保证在发送操作完成之后可以立即不断的进行读取操作
            /* if channel is ready read from any connections that have readable data */
            if (channel.ready() && key.isReadable() && !hasStagedReceive(channel)) {
                //读取操作
                NetworkReceive networkReceive;
                while ((networkReceive = channel.read()) != null)
                    //将接收到的数据放入queue等待处理
                    addToStagedReceives(channel, networkReceive);
            }

            /* if channel is ready write to any sockets that have space in their buffer and for which we have data */
            if (channel.ready() && key.isWritable()) {
                //写操作,把上面准备好的请求发送给channel
                Send send = channel.write();
                if (send != null) {
                    //缓存发送完成的请求
                    this.completedSends.add(send);
                    this.sensors.recordBytesSent(channel.id(), send.size());
                }
            }

            /* cancel any defunct sockets */
            if (!key.isValid()) {
                //出现异常关闭channel
                close(channel);
                //缓存连接关闭的channel
                this.disconnected.add(channel.id());
            }

        } catch (Exception e) {
            String desc = channel.socketDescription();
            if (e instanceof IOException)
                log.debug("Connection with {} disconnected", desc, e);
            else
                log.warn("Unexpected error from {}; closing connection", desc, e);
            close(channel);
            this.disconnected.add(channel.id());
        }
    }
}

private void addToCompletedReceives() {
    //循环上面读取操作放入queue的数据,将其缓存进completedReceives
    if (!this.stagedReceives.isEmpty()) {
        Iterator<Map.Entry<KafkaChannel, Deque<NetworkReceive>>> iter = this.stagedReceives.entrySet().iterator();
        while (iter.hasNext()) {
            Map.Entry<KafkaChannel, Deque<NetworkReceive>> entry = iter.next();
            KafkaChannel channel = entry.getKey();
            if (!channel.isMute()) {
                Deque<NetworkReceive> deque = entry.getValue();
                NetworkReceive networkReceive = deque.poll();
                this.completedReceives.add(networkReceive);
                this.sensors.recordBytesReceived(channel.id(), networkReceive.payload().limit());
                if (deque.isEmpty())
                    iter.remove();
            }
        }
    }
}

在发送和接收操作完成之后,接下来就要处理接收到的数据了:

1.首先把已经发送成功的请求从队列中去掉,避免重发:

private void handleCompletedSends(List<ClientResponse> responses, long now) {
    // if no response is expected then when the send is completed, return it
    for (Send send : this.selector.completedSends()) {
        ClientRequest request = this.inFlightRequests.lastSent(send.destination());
        if (!request.expectResponse()) {
            this.inFlightRequests.completeLastSent(send.destination());
            responses.add(new ClientResponse(request, now, false, null));
        }
    }
}

2.处理已经接收到的数据:

private void handleCompletedReceives(List<ClientResponse> responses, long now) {
    for (NetworkReceive receive : this.selector.completedReceives()) {
        String source = receive.source();
        //获取该次返回数据对应的那次请求。上面的trySend方法保证了同一个source下不会有超过2个未完成的请求,所以这里不用担心获取的请求不是该次返回对应的那次请求
        ClientRequest req = inFlightRequests.completeNext(source);
        Struct body = parseResponse(receive.payload(), req.request().header());
        if (!metadataUpdater.maybeHandleCompletedReceive(req, now, body))
            responses.add(new ClientResponse(req, now, false, body));
    }
}

3.处理那些断开的连接:

private void handleDisconnections(List<ClientResponse> responses, long now) {
    for (String node : this.selector.disconnected()) {
        log.debug("Node {} disconnected.", node);
        processDisconnection(responses, node, now);
    }
    // we got a disconnect so we should probably refresh our metadata and see if that broker is dead
    if (this.selector.disconnected().size() > 0)
        metadataUpdater.requestUpdate();
}

4.处理新创建的连接:

private void handleConnections() {
    for (String node : this.selector.connected()) {
        log.debug("Completed connection to node {}", node);
        this.connectionStates.connected(node);
    }
}

5.处理超时的请求:

private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
    List<String> nodeIds = this.inFlightRequests.getNodesWithTimedOutRequests(now, this.requestTimeoutMs);
    for (String nodeId : nodeIds) {
        // close connection to the node
        this.selector.close(nodeId);
        log.debug("Disconnecting from node {} due to request timeout.", nodeId);
        processDisconnection(responses, nodeId, now);
    }

    // we disconnected, so we should probably refresh our metadata
    if (nodeIds.size() > 0)
        metadataUpdater.requestUpdate();
}

最后对所有请求进行回调处理,这里就要开始体现异步了,通过request的callback方法体,在最开始组装请求实体的时候,初始化了一个RequestFutureCompletionHandler作为该请求的callback:

for (ClientResponse response : responses) {
        if (response.request().hasCallback()) {
            try {
                //调用回调方法的onComplete方法
                response.request().callback().onComplete(response);
            } catch (Exception e) {
                log.error("Uncaught error in request completion:", e);
            }
        }
    }

看下RequestFutureCompletionHandler的定义:

public class RequestFutureCompletionHandler implements RequestCompletionHandler {
    private final RequestFuture<ClientResponse> future;
    private ClientResponse response;
    private RuntimeException e;

    public RequestFutureCompletionHandler() {
        this.future = new RequestFuture<>();
    }

    public void fireCompletion() {
        if (e != null) {
            future.raise(e);
        } else if (response.wasDisconnected()) {
            ClientRequest request = response.request();
            RequestSend send = request.request();
            ApiKeys api = ApiKeys.forId(send.header().apiKey());
            int correlation = send.header().correlationId();
            log.debug("Cancelled {} request {} with correlation id {} due to node {} being disconnected",
                    api, request, correlation, send.destination());
            future.raise(DisconnectException.INSTANCE);
        } else {
            //调用RequestFuture.complete方法
            future.complete(response);
        }
    }

    public void onFailure(RuntimeException e) {
        this.e = e;
        pendingCompletion.add(this);
    }

    @Override
    public void onComplete(ClientResponse response) {
        //将返回实体赋给该回调类的变量
        this.response = response;
        //将该回调类放入queue中等待在firePendingCompletedRequests统一处理,不在这里处理是为了避免发生死锁
        pendingCompletion.add(this);
    }
}

private void firePendingCompletedRequests() {
    boolean completedRequestsFired = false;
    for (;;) {
        RequestFutureCompletionHandler completionHandler = pendingCompletion.poll();
        if (completionHandler == null)
            break;
        //调用RequestFutureCompletionHandler的fireCompletion()方法
        completionHandler.fireCompletion();
        completedRequestsFired = true;
    }

    // wakeup the client in case it is blocking in poll for this future's completion
    if (completedRequestsFired)
        client.wakeup();
}

接下来看看RequestFuture的几个关键方法:

/**
 *Complete the request successfully. After this call, {@link #succeeded()} will return true
 *and the value can be obtained through {@link #value()}.
 *@param value corresponding value (or null if there is none)
 *@throws IllegalStateException if the future has already been completed
 *@throws IllegalArgumentException if the argument is an instance of {@link RuntimeException}
 */
public void complete(T value) {
    if (value instanceof RuntimeException)
        throw new IllegalArgumentException("The argument to complete can not be an instance of RuntimeException");
    //将value值赋给result
    if (!result.compareAndSet(INCOMPLETE_SENTINEL, value))
        throw new IllegalStateException("Invalid attempt to complete a request future which is already complete");
    fireSuccess();
}

private void fireSuccess() {
    T value = value();
    while (true) {
        //调用listener的onSuccess方法
        RequestFutureListener<T> listener = listeners.poll();
        if (listener == null)
            break;
        listener.onSuccess(value);
    }
}

这里可能会疑问:哪来的listener呢?就是在最开始compose方法初始化的:

addListener(new RequestFutureListener<T>() {
        @Override
        public void onSuccess(T value) {
            //这里会调用适配器的onSuccess方法
            adapter.onSuccess(value, adapted);
        }

        @Override
        public void onFailure(RuntimeException e) {
            adapter.onFailure(e, adapted);
        }
    });

new RequestFutureAdapter<ClientResponse, Long>() {
                    @Override
                    public void onSuccess(ClientResponse response, RequestFuture<Long> future) {
                        //处理最终结果,将response实体中解析出offset赋给future的result
                        handleListOffsetResponse(topicPartition, response, future);
                    }
                }

private void handleListOffsetResponse(TopicPartition topicPartition,
                                      ClientResponse clientResponse,
                                      RequestFuture<Long> future) {
    ListOffsetResponse lor = new ListOffsetResponse(clientResponse.responseBody());
    short errorCode = lor.responseData().get(topicPartition).errorCode;
    if (errorCode == Errors.NONE.code()) {
        List<Long> offsets = lor.responseData().get(topicPartition).offsets;
        if (offsets.size() != 1)
            throw new IllegalStateException("This should not happen.");
        long offset = offsets.get(0);
        log.debug("Fetched offset {} for partition {}", offset, topicPartition);
        //这里又来调用complete,这里不会再去调用什么listener的onSuccess了,因为没有为这个future设置listener。只是简单的把offset赋给future的result
        future.complete(offset);
    } else if (errorCode == Errors.NOT_LEADER_FOR_PARTITION.code()
            || errorCode == Errors.UNKNOWN_TOPIC_OR_PARTITION.code()) {
        log.debug("Attempt to fetch offsets for partition {} failed due to obsolete leadership information, retrying.",
                topicPartition);
        future.raise(Errors.forCode(errorCode));
    } else {
        log.warn("Attempt to fetch offsets for partition {} failed due to: {}",
                topicPartition, Errors.forCode(errorCode).message());
        future.raise(new StaleMetadataException());
    }
}

至此,整个过程已经基本上完成了,最后通过future.value获取为future的result赋的值。

其中比较绕的地方是RequestFuture中的各种回调操作,多看几遍基本上就能掌握。

]]>
Kafka集群之如何确定leader 2016-01-13T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2016/01/13/kafka集群之确定leader 在kafka启动后,会初始化一个非常关键的类:KfkaController,该类在kafka运行过程当中会做很多的事情,包括控制partition和replica的状态,选举kafka leader,autoblance等,选举kafka的leader是通过:

private val controllerElector = new ZookeeperLeaderElector(controllerContext, ZkUtils.ControllerPath, onControllerFailover,
onControllerResignation, config.brokerId)

完成的。

启动后会调用ZookeeperLeaderElector的elect方法将当前的broker作为leader写入zk的/controller节点中并对该节点注册LeaderChangeListener,当该节点的数据发生变化时说明有其他的broker已经被选举为leader了,这时候更新当前broker的内存数据即可,如果选举时发生异常则会删除该节点触发重新选举机制。

也就是说broker的leader选举并不想zk leader选举那么复杂,简单来说就是哪台broker先启动它就会成为leader,这时候如果其他的broker启动完成后会读取/controller节点的数据更新其各自的内存数据。

接着在向kafka生产数据时,都是通过DefaultEventHandler类来进行一系列的操作:

1.挑选指定topic中replica的leader不为空的partition作为存放数据的partition。

2.向partition的leader broker发送数据。

3.向某个partition未发送成功的数据会重试发送,重试一定次数后抛出异常。

在partition接收到生产的数据时:

def appendMessagesToLeader(messages: ByteBufferMessageSet, requiredAcks: Int = 0) = {
inReadLock(leaderIsrUpdateLock) {
  //只有在本地replica是leader的时候才会写日志
  val leaderReplicaOpt = leaderReplicaIfLocal()
  leaderReplicaOpt match {
    case Some(leaderReplica) =>
      val log = leaderReplica.log.get
      val minIsr = log.config.minInSyncReplicas
      val inSyncSize = inSyncReplicas.size

      // Avoid writing to leader if there are not enough insync replicas to make it safe
      if (inSyncSize < minIsr && requiredAcks == -1) {
        throw new NotEnoughReplicasException("Number of insync replicas for partition [%s,%d] is [%d], below required minimum [%d]"
          .format(topic, partitionId, inSyncSize, minIsr))
      }

      val info = log.append(messages, assignOffsets = true)
      // probably unblock some follower fetch requests since log end offset has been updated
      replicaManager.tryCompleteDelayedFetch(new TopicPartitionOperationKey(this.topic, this.partitionId))
      // we may need to increment high watermark since ISR could be down to 1
      maybeIncrementLeaderHW(leaderReplica)
      info
    case None =>
      throw new NotLeaderForPartitionException("Leader not local for partition [%s,%d] on broker %d"
        .format(topic, partitionId, localBrokerId))
  }
}
}

代码中首先会判断是否将数据发送到了leader中,如果不是,则不做写数据操作。在正常情况下接收到数据的broker就是leader,因为DefaultEventHandler已经指定了向leader发送数据:

def leaderReplicaIfLocal(): Option[Replica] = {
leaderReplicaIdOpt match {
  case Some(leaderReplicaId) =>
    //如果leaderReplica就是该partition所在的broker,就返回该replica否则返回None
    if (leaderReplicaId == localBrokerId)
      getReplica(localBrokerId)
    else
      None
  case None => None
}
}

kafka还是有很多的细节需要深入的去弄懂,下篇文章会讲一下数据如何在replica之间进行备份。

]]>
Kafka集群消费数据续 2015-10-15T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/10/15/kafka集群之consumer数据(3) 上篇文章留了一个疑问:数据由谁在什么时候放入queue中。这里来解答一下。

之前说到在ZookeeperConsumerConnector类中初始化consumer时会初始化存放数据的queue,我们再来看下初始化之后的操作:

private def addPartitionTopicInfo(currentTopicRegistry: Pool[String, Pool[Int, PartitionTopicInfo]],
                                  partition: Int, topic: String,
                                  offset: Long, consumerThreadId: ConsumerThreadId) {
      val partTopicInfoMap = currentTopicRegistry.getAndMaybePut(topic)

      //queue用来存放数据
      val queue = topicThreadIdAndQueues.get((topic, consumerThreadId))
      val consumedOffset = new AtomicLong(offset)
      val fetchedOffset = new AtomicLong(offset)
      val partTopicInfo = new PartitionTopicInfo(topic,
                                                 partition,
                                                 queue,
                                                 consumedOffset,
                                                 fetchedOffset,
                                                 new AtomicInteger(config.fetchMessageMaxBytes),
                                                 config.clientId)
      partTopicInfoMap.put(partition, partTopicInfo)
      debug(partTopicInfo + " selected new offset " + offset)
      checkpointedZkOffsets.put(TopicAndPartition(topic, partition), offset)
    }
}

其中val queue = topicThreadIdAndQueues.get((topic, consumerThreadId))就是初始化的queue,可以看出某个topic的一个消费者只对应一个queue,随后会实例化PartitionTopicInfo,即指定的queue只用来存放topic的某个partition接收到的数据。

回到之前说到的为partition创建fetcher数据线程:

override def createFetcherThread(fetcherId: Int, sourceBroker: BrokerEndPoint): AbstractFetcherThread = {
new ConsumerFetcherThread(
  "ConsumerFetcherThread-%s-%d-%d".format(consumerIdString, fetcherId, sourceBroker.id),
  config, sourceBroker, partitionMap, this)
}

ConsumerFetcherThread类继承了AbstractFetcherThread类,AbstractFetcherThread的doWork()方法:

override def doWork() {

    inLock(partitionMapLock) {
      partitionMap.foreach {
        case((topicAndPartition, partitionFetchState)) =>
          if(partitionFetchState.isActive)
            fetchRequestBuilder.addFetch(topicAndPartition.topic, topicAndPartition.partition,
              partitionFetchState.offset, fetchSize)
      }
    }

    val fetchRequest = fetchRequestBuilder.build()

    if (!fetchRequest.requestInfo.isEmpty)
      //处理请求
      processFetchRequest(fetchRequest)
    else {
      trace("There are no active partitions. Back off for %d ms before sending a fetch request".format(fetchBackOffMs))
      partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
    }
}

在processFetchRequest方法中首先通过SimpleConsumer.fetch()方法向topic请求数据,接着会将请求到的数据通过ConsumerFetcherThread.processPartitionData方法将具体数据放入上面说到的queue中:

def processPartitionData(topicAndPartition: TopicAndPartition, fetchOffset: Long, partitionData: FetchResponsePartitionData) {
val pti = partitionMap(topicAndPartition)
if (pti.getFetchOffset != fetchOffset)
  throw new RuntimeException("Offset doesn't match for partition [%s,%d] pti offset: %d fetch offset: %d"
                            .format(topicAndPartition.topic, topicAndPartition.partition, pti.getFetchOffset, fetchOffset))
pti.enqueue(partitionData.messages.asInstanceOf[ByteBufferMessageSet])
}

pti.enqueue方法的定义:

/**
 * Enqueue a message set for processing.
 */
def enqueue(messages: ByteBufferMessageSet) {
    val size = messages.validBytes
    if(size > 0) {
      val next = messages.shallowIterator.toSeq.last.nextOffset
      trace("Updating fetch offset = " + fetchedOffset.get + " to " + next)
      //这里的chunkQueue就是在实例化PartitionTopicInfo时传入的queue,可以查看ZookeeperConsumerConnector.addPartitionTopicInfo方法
      chunkQueue.put(new FetchedDataChunk(messages, this, fetchedOffset.get))
      fetchedOffset.set(next)
      debug("updated fetch offset of (%s) to %d".format(this, next))
      consumerTopicStats.getConsumerTopicStats(topic).byteRate.mark(size)
      consumerTopicStats.getConsumerAllTopicStats().byteRate.mark(size)
    } else if(messages.sizeInBytes > 0) {
      chunkQueue.put(new FetchedDataChunk(messages, this, fetchedOffset.get))
    }
}

这样就实现了数据从无到有的过程。

到这里基本上差不多已经清楚consumer消费数据是怎么来的了,不过还有两个问题: 1.在AbstractFetcherThread类的doWord()方法那些fetch请求是怎么来的?

2.通过SimpleConsumer如何获取数据?

对于第一个问题,在创建fetcher线程时会默认将当前消费的offset放入partitionMap(这个就是在doWork方法里存放请求的map):

fetcherThreadMap(brokerAndFetcherId).addPartitions(partitionAndOffsets.map { case (topicAndPartition, brokerAndInitOffset) =>
      topicAndPartition -> brokerAndInitOffset.initOffset
    })
//具体的存放请求方法
def addPartitions(partitionAndOffsets: Map[TopicAndPartition, Long]) {
partitionMapLock.lockInterruptibly()
    try {
      for ((topicAndPartition, offset) <- partitionAndOffsets) {
        // If the partitionMap already has the topic/partition, then do not update the map with the old offset
        if (!partitionMap.contains(topicAndPartition))
          partitionMap.put(
            topicAndPartition,
            if (PartitionTopicInfo.isOffsetInvalid(offset)) new PartitionFetchState(handleOffsetOutOfRange(topicAndPartition))
            else new PartitionFetchState(offset)
          )}
      partitionMapCond.signalAll()
    } finally {
      partitionMapLock.unlock()
    }
}

第二个问题先看下SimpleConsumer的sendRequest方法:

private def sendRequest(request: RequestOrResponse): Receive = {
    lock synchronized {
      var response: Receive = null
      try {
        //获取与broker之间的连接
        getOrMakeConnection()
        //向broker发送请求
        blockingChannel.send(request)
        //阻塞获取broker返回的消息
        response = blockingChannel.receive()
      } catch {
        case e : Throwable =>
          info("Reconnect due to socket error: %s".format(e.toString))
          // retry once
          try {
            reconnect()
            blockingChannel.send(request)
            response = blockingChannel.receive()
          } catch {
            case e: Throwable =>
              disconnect()
              throw e
          }
      }
      response
    }

将请求发送到broker后最后会由KafkaApis来处理相应的请求:

def handleFetchRequest(request: RequestChannel.Request) {
    val fetchRequest = request.requestObj.asInstanceOf[FetchRequest]

    // the callback for sending a fetch response
    def sendResponseCallback(responsePartitionData: Map[TopicAndPartition, FetchResponsePartitionData]) {
      responsePartitionData.foreach { case (topicAndPartition, data) =>
        // we only print warnings for known errors here; if it is unknown, it will cause
        // an error message in the replica manager already and hence can be ignored here
        if (data.error != ErrorMapping.NoError && data.error != ErrorMapping.UnknownCode) {
          debug("Fetch request with correlation id %d from client %s on partition %s failed due to %s"
            .format(fetchRequest.correlationId, fetchRequest.clientId,
            topicAndPartition, ErrorMapping.exceptionNameFor(data.error)))
        }

        // record the bytes out metrics only when the response is being sent
        BrokerTopicStats.getBrokerTopicStats(topicAndPartition.topic).bytesOutRate.mark(data.messages.sizeInBytes)
        BrokerTopicStats.getBrokerAllTopicsStats().bytesOutRate.mark(data.messages.sizeInBytes)
      }

      val response = FetchResponse(fetchRequest.correlationId, responsePartitionData)
      requestChannel.sendResponse(new RequestChannel.Response(request, new FetchResponseSend(response)))
    }

    // call the replica manager to fetch messages from the local replica
    replicaManager.fetchMessages(
      fetchRequest.maxWait.toLong,
      fetchRequest.replicaId,
      fetchRequest.minBytes,
      fetchRequest.requestInfo,
      sendResponseCallback)
}

对于KafkaApis如何获取相应的请求就不再说了,在之前的讲解SocketServer那篇文章就说过了。

总结:kafka的设计真的很完美,通过一个总控控制各个不同的请求,看它的代码会很开心。

全文完:)

]]>
Kafka集群消费数据 2015-09-28T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/09/28/kafka集群之consumer数据(2) 上上篇文章说到真正存储数据供客户端消费的类是KafkaStream这个类,该类的定义为:

class KafkaStream[K,V](private val queue: BlockingQueue[FetchedDataChunk],
                    consumerTimeoutMs: Int,
                    private val keyDecoder: Decoder[K],
                    private val valueDecoder: Decoder[V],
                    val clientId: String)
extends Iterable[MessageAndMetadata[K,V]] with java.lang.Iterable[MessageAndMetadata[K,V]]

可以看到它本身就是一个iterator,其中的queue是用来存储从broker fetch到的数据的。该类的iterator()方法返回的是ConsumerIterator实例:

private val iter: ConsumerIterator[K,V] = new ConsumerIterator[K,V](queue, consumerTimeoutMs, keyDecoder, valueDecoder, clientId)

该遍历器会一直阻塞直至queue中有数据可以读取,当读取到shutdownCommand则停止读取数据,客户端消费数据代码流程:while(hasNext())——>next(),最终将逻辑转变为:while(IteratorTemplate.hasNext())——>IteratorTemplate.maybeComputeNext()——>ConsumerIterator.makeNext()——>ConsumerIterator.next()——>IteratorTemplate.next()。主要的makeNext()方法如下:

protected def makeNext(): MessageAndMetadata[K, V] = {
    var currentDataChunk: FetchedDataChunk = null
    // if we don't have an iterator, get one
    var localCurrent = current.get()
    if(localCurrent == null || !localCurrent.hasNext) {
      if (consumerTimeoutMs < 0)
        //发出抓取数据请求
        currentDataChunk = channel.take
      else {
        currentDataChunk = channel.poll(consumerTimeoutMs, TimeUnit.MILLISECONDS)
        if (currentDataChunk == null) {
          // reset state to make the iterator re-iterable
          //没有数据则终止遍历器的状态为初始状态以便遍历器可以一直阻塞
          resetState()
          throw new ConsumerTimeoutException
        }
      }
      //数据为shutdownCommand则退出遍历
      if(currentDataChunk eq ZookeeperConsumerConnector.shutdownCommand) {
        //只有在发出shutdown命令时才认为queue没有消息
        debug("Received the shutdown command")
        return allDone
      } else {
        currentTopicInfo = currentDataChunk.topicInfo
        //当前抓取的offset
        val cdcFetchOffset = currentDataChunk.fetchOffset
        //上次抓取的offset
        val ctiConsumeOffset = currentTopicInfo.getConsumeOffset
        //offset以fetch的为主
        if (ctiConsumeOffset < cdcFetchOffset) {
          error("consumed offset: %d doesn't match fetch offset: %d for %s;\n Consumer may lose data"
            .format(ctiConsumeOffset, cdcFetchOffset, currentTopicInfo))
          currentTopicInfo.resetConsumeOffset(cdcFetchOffset)
        }
        localCurrent = currentDataChunk.messages.iterator

        current.set(localCurrent)
      }
      // if we just updated the current chunk and it is empty that means the fetch size is too small!
      if(currentDataChunk.messages.validBytes == 0)
        throw new MessageSizeTooLargeException("Found a message larger than the maximum fetch size of this consumer on topic " +
                                               "%s partition %d at fetch offset %d. Increase the fetch size, or decrease the maximum message size the broker will allow."
                                               .format(currentDataChunk.topicInfo.topic, currentDataChunk.topicInfo.partitionId, currentDataChunk.fetchOffset))
    }
    var item = localCurrent.next()
    // reject the messages that have already been consumed
    while (item.offset < currentTopicInfo.getConsumeOffset && localCurrent.hasNext) {
      item = localCurrent.next()
    }
    consumedOffset = item.nextOffset

    item.message.ensureValid() // validate checksum of message to ensure it is valid

    new MessageAndMetadata(currentTopicInfo.topic, currentTopicInfo.partitionId, item.message, item.offset, keyDecoder, valueDecoder)
}

其中该遍历器有以下四种状态:

class State
object DONE extends State
object READY extends State
object NOT_READY extends State
object FAILED extends State

初始状态为NOT_READY,正常消费过程中状态为READY,退出遍历为DONE,消费过程出错状态为FAILED。在正常消费过程中状态是在NOT_READY和READY之间来回切换的:

def hasNext(): Boolean = {
    if(state == FAILED)
      throw new IllegalStateException("Iterator is in failed state")
    state match {
        //当读取到shutdownCommand时,DONE被设置为true
      case DONE => false
        //不管有没有消息都是ready状态,保证了shallow iterator
      case READY => true
      case _ => maybeComputeNext()
    }
}

def next(): T = {
    if(!hasNext())
      throw new NoSuchElementException()
    state = NOT_READY
    if(nextItem == null)
      throw new IllegalStateException("Expected item but none found.")
    nextItem
}

def maybeComputeNext(): Boolean = {
    state = FAILED
    nextItem = makeNext()
    if(state == DONE) {
      false
    } else {
      state = READY
      true
    }
}

即当调用hasNext()方法时,初始状态为NOT_READY,通过case _ => maybeComputeNext()将状态设为READY,最后调用next()方法又将状态设为NOT_READY,这样保证了遍历器可以一直遍历下去。

最后,留一个疑问,上面说到的queue,是谁在什么时候会把数据放到这个queue里呢?

]]>
Kafka问题探索并解决 2015-09-25T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/09/25/kafka问题之partition没写进znode 好了,这几天忙于ctr预估的事情,心在终于有空写写前几天遇到的kafka问题了。

我们的kafka监控是用的开源的KafkaOffsetMonitor工具,这个工具之前文章也解析过源码,很简单的一个工具,可是前几天在监控界面上突然只能看到某一个topic的一个partition在消费,另一个partition并没有在消费,看着lag数字不断的往上涨,很是紧张(这些数字都是涉及到钱的),马上查看了统计程序日志,并没有报错,好好的在处理那个正在消费的partition的数据。又切换到kafka的日志里,发现有一个partition的offset确实没有在增加(只有增加才表示真正消费了数据)。又切换到zookeeper的znode上,查看了该group的消费者情况:get /consumers/group/ids,发现numChildren并不等于topic的数量,接着又查看该group下的该topic的partition的消费者情况:get /consumers/group/owners/topic/0及get /consumers/group/owners/topic/1,发现两者的消费者并不相同,上篇文章说到在kafka消费数据的时候会将topic的消费者注册到zookeeper中,并且某个group的某个topic只能被一个consumer消费,如果某个topic有多个消费者在消费,那么说明有其他的group也在消费该topic,找到问题了,删除另外一个consumer即可:delete /consumers/group/ids/consumeridString,这个时候重启消费程序正常了。

--------------我只是一段不想写太多--------------

但在监控界面上那个topic的消费信息不能显示了,没有显示任何一个partition,查看统计程序日志,愉快的消费着,查看partition的offset:get /consumers/group/offsets/topic/partitionId,两个partition的offset都愉快的在涨。好了,查看之前分析KafkaOffsetMonitor源码的文章,界面上显示的信息是读取/brokers/topics/topic/partitions的信息,查看znode发现该znode的numChildren为0,显然是不对的,正常的应该是有一个名为partitions的child path的,存储的是topic的各个partition的信息,如:get /brokers/topics/topic/partitions/partitionId/state的value:{"controller_epoch":4,"leader":4,"version":1,"leader_epoch":1,"isr":[1,2,3,4]}。回到kafka的源码,这个znode会在AddPartitionsListener这个listener中处理partition的增删,查看kafka的controller.log文件,报错了,好开心:

java.util.NoSuchElementException: key not found: [topic,1]
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
    at kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:109)
    at kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:108)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
    at scala.collection.SetLike$class.map(SetLike.scala:93)
    at scala.collection.AbstractSet.map(Set.scala:47)
    at kafka.controller.ControllerContext.replicasForPartition(KafkaController.scala:108)
    at kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:472)
    at kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply$mcV$sp(PartitionStateMachine.scala:503)
    at kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:492)
    at kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:492)
    at kafka.utils.Utils$.inLock(Utils.scala:538)
    at kafka.controller.PartitionStateMachine$AddPartitionsListener.handleDataChange(PartitionStateMachine.scala:491)
    at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
    at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)

回到源码,报错的是下面这行:

val replicas = partitionReplicaAssignment(p)

就是说partitionReplicaAssignment中没有相应的key,再读源码TopicChangeListener中会对partitionReplicaAssignment进行处理,在增删topic的时候即改变/brokers/topics数据时会触发该listener,这时我重新创建(先删除再添加)这个topic后还是报错,再分析报错信息,发现重建topic的时候并不是按照先处理TopicChangeListener再处理AddPartitionsListener的顺序执行,而是会反过来执行,这样就会出现在执行AddPartitionsListener的时候找不到相应的数据了,写入zookeeper前就发生错误了。

--------------我只是一段不想写太多--------------

最后没办法我只能手动创建partitions这个path,最后的路径为/brokers/topics/topic/partitions/partitionId/state,该路径的内容为{"controller_epoch":4,"leader":4,"version":1,"leader_epoch":1,"isr":[1,2,3,4]}。这样在监控页面上就看到了partition的消费情况了。

我们使用的kafka版本是0.8.1.1并不是最新版,升级以后再来继续写。

]]>
Kafka集群消费数据前准备工作 2015-09-22T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/09/22/kafka集群之consumer数据(1) kafka在消费数据时会通过Consumer对象创建ConsumerConnector,创建好ConsumerConnector之后会通过它创建KafkaStream,KafkaStream可以理解为一个数据流,数据的产出就是通过它完成,具体如何消费数据下篇文章再详细解析。这篇文章主要讲一下ConsumerConnector。

ConsumerConnector的一个子类为ZookeeperConsumerConnector,主要用来维护消费者与zookeeper之间的交互信息:

(1)每个消费者在一个消费者组中有唯一的id,生成方式为:

val consumerIdString = {
    var consumerUuid : String = null
    config.consumerId match {
      case Some(consumerId) // for testing only
      => consumerUuid = consumerId
      case None // generate unique consumerId automatically
      => val uuid = UUID.randomUUID()
      consumerUuid = "%s-%d-%s".format(
        InetAddress.getLocalHost.getHostName, System.currentTimeMillis,
        uuid.getMostSignificantBits().toHexString.substring(0,8))
    }
    config.groupId + "_" + consumerUuid
}

这个id在zookeeper中存放的路径为/consumers/[group_id]/ids/[consumer_id] -> topic1,...topicN,每个消费者会将它的id注册为临时znode并且将它所消费的topic设置为znode的值。

(2)在消费时,每个topic的partition只能被一个消费者组中的唯一的一个消费者消费(即一个partition可以被多个消费者消费,但这多个消费者必须在不同的消费者组中),partition对应消费者的关系为:/consumers/[group_id]/owner/[topic]/[broker_id-partition_id] --> consumer_node_id。

(3)上面提到,一个partition可以被不同的消费者组中的不同消费者消费,所以不同的消费者组必须维护他们各自对该partition消费的最新的offset,这个offset和partition的关系为:/consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value。

接下来该ConsumerConnector会进行连接zk,创建抓取数据线程等操作:

//连接zk
connectZk()
//初始化ConsumerFetcherManager,它是管理数据抓取线程的
createFetcher()
//管理与offset相关操作的连接
ensureOffsetManagerConnected()

//定时提交offset
if (config.autoCommitEnable) {
    scheduler.startup
    info("starting auto committer every " + config.autoCommitIntervalMs + " ms")
    scheduler.schedule("kafka-consumer-autocommit",
                       autoCommit,
                       delay = config.autoCommitIntervalMs,
                       period = config.autoCommitIntervalMs,
                       unit = TimeUnit.MILLISECONDS)
}

抓取数据的线程是通过ConsumerFetcherManager类来管理的,createFetcher()方法并不立即创建数据抓取线程,而是在createMessageStreams方法中初始化该线程,该方法留到后面再讲。

ensureOffsetManagerConnected()方法主要是创建与管理offset的broker之间的socket连接,主要通过channelToOffsetManager方法完成创建:

//该方法其实就是为了建立与group对应的partition所在的broker之间的连接
def channelToOffsetManager(group: String, zkClient: ZkClient, socketTimeoutMs: Int = 3000, retryBackOffMs: Int = 1000) = {
 //与任意一台broker之间创建BlockingChannel
 var queryChannel = channelToAnyBroker(zkClient)

 var offsetManagerChannelOpt: Option[BlockingChannel] = None

 while (!offsetManagerChannelOpt.isDefined) {

   var coordinatorOpt: Option[BrokerEndPoint] = None

   while (!coordinatorOpt.isDefined) {
     try {
       if (!queryChannel.isConnected)
         queryChannel = channelToAnyBroker(zkClient)
       debug("Querying %s:%d to locate offset manager for %s.".format(queryChannel.host, queryChannel.port, group))
       //向broker集群发送获取指定group对应的partition的消费者信息
       queryChannel.send(ConsumerMetadataRequest(group))
       val response = queryChannel.receive()
       val consumerMetadataResponse =  ConsumerMetadataResponse.readFrom(response.buffer)
       debug("Consumer metadata response: " + consumerMetadataResponse.toString)
       if (consumerMetadataResponse.errorCode == ErrorMapping.NoError)
         coordinatorOpt = consumerMetadataResponse.coordinatorOpt
       else {
         debug("Query to %s:%d to locate offset manager for %s failed - will retry in %d milliseconds."
              .format(queryChannel.host, queryChannel.port, group, retryBackOffMs))
         Thread.sleep(retryBackOffMs)
       }
     }
     catch {
       case ioe: IOException =>
         info("Failed to fetch consumer metadata from %s:%d.".format(queryChannel.host, queryChannel.port))
         queryChannel.disconnect()
     }
   }

   val coordinator = coordinatorOpt.get
   //partition的leader对应的broker与连接的broker一样则将之前创建的BlockingChannel直接作为offsetManagerChannel
   if (coordinator.host == queryChannel.host && coordinator.port == queryChannel.port) {
     offsetManagerChannelOpt = Some(queryChannel)
   } else {
     //否则重新创建与broker之间的连接
     val connectString = "%s:%d".format(coordinator.host, coordinator.port)
     var offsetManagerChannel: BlockingChannel = null
     try {
       debug("Connecting to offset manager %s.".format(connectString))
       offsetManagerChannel = new BlockingChannel(coordinator.host, coordinator.port,
                                                  BlockingChannel.UseDefaultBufferSize,
                                                  BlockingChannel.UseDefaultBufferSize,
                                                  socketTimeoutMs)
       offsetManagerChannel.connect()
       offsetManagerChannelOpt = Some(offsetManagerChannel)
       //关闭之前创建的与任意一台broker之间的连接
       queryChannel.disconnect()
     }
     catch {
       case ioe: IOException => // offsets manager may have moved
         info("Error while connecting to %s.".format(connectString))
         if (offsetManagerChannel != null) offsetManagerChannel.disconnect()
         Thread.sleep(retryBackOffMs)
         offsetManagerChannelOpt = None // just in case someone decides to change shutdownChannel to not swallow exceptions
     }
   }
 }

 offsetManagerChannelOpt.get
}

定时提交offset逻辑很简单,如果存储介质是zookeeper则直接写到zookeeper中,如果存储介质是kafka则通过上面创建的BlockingChannel写到文件中。需要注意的是dual.commit.enabled这个配置选项,默认情况下如果offsets.storage设置为kafka则改选项为true,说明在写offset的时候既会写到kafka也会写到zookeeper,读offset时也是读取zookeeper和kafka中最大的那个。该选项是为了避免在从基于zookeeper存取offset迁移到基于kafka存取时产生的offset错误。如果在之后的情况中不会存在迁移的情况,那么该选项可以设置为false。

至此,ZookeeperConsumerConnector的初始化工作就完成了,接下来就通过该connector的createMessageStreams方法创建KafkaStream:

def createMessageStreams[K,V](topicCountMap: Map[String,Int], keyDecoder: Decoder[K], valueDecoder: Decoder[V])
  : Map[String, List[KafkaStream[K,V]]] = {
    if (messageStreamCreated.getAndSet(true))
      throw new MessageStreamsExistException(this.getClass.getSimpleName +
                                   " can create message streams at most once",null)
    consume(topicCountMap, keyDecoder, valueDecoder)
}

该方法的topicCountMap参数是topic对应的consumer数量,其中consume方法的定义如下:

//该方法并不真正的读取数据,只是初始化存放数据的queue,真正消费数据的是对该queue进行shallow iterator(no stop)
//在kafka的运行过程中,会有线程将数据放入partition对应的queue中
def consume[K, V](topicCountMap: scala.collection.Map[String,Int], keyDecoder: Decoder[K], valueDecoder: Decoder[V])
  : Map[String,List[KafkaStream[K,V]]] = {
    debug("entering consume ")
    if (topicCountMap == null)
      throw new RuntimeException("topicCountMap is null")

    //创建制定数量的consumerid
    val topicCount = TopicCount.constructTopicCount(consumerIdString, topicCountMap)

    val topicThreadIds = topicCount.getConsumerThreadIdsPerTopic

    // make a list of (queue,stream) pairs, one pair for each threadId
    val queuesAndStreams = topicThreadIds.values.map(threadIdSet =>
      threadIdSet.map(_ => {
        val queue =  new LinkedBlockingQueue[FetchedDataChunk](config.queuedMaxMessages)
        val stream = new KafkaStream[K,V](
          queue, config.consumerTimeoutMs, keyDecoder, valueDecoder, config.clientId)
        (queue, stream)
      })
    ).flatten.toList

    val dirs = new ZKGroupDirs(config.groupId)
    //将消费者信息写到zookeeper
    registerConsumerInZK(dirs, consumerIdString, topicCount)
    reinitializeConsumer(topicCount, queuesAndStreams)

    //返回KafkaStream
    loadBalancerListener.kafkaMessageAndMetadataStreams.asInstanceOf[Map[String, List[KafkaStream[K,V]]]]
}

在该方法中核心代码是reinitializeConsumer(topicCount, queuesAndStreams),在reinitializeConsumer方法中主要做以下几件事:

(1)将topic对应的消费者线程id及对应的LinkedBlockingQueue放入topicThreadIdAndQueues中,LinkedBlockingQueue是真正存放数据的queue,下面会对该queue进行详细的讲解。

(2)注册sessionExpirationListener,在session失效重新创建session时调用:

def handleNewSession() {
  /**
   *  When we get a SessionExpired event, we lost all ephemeral nodes and zkclient has reestablished a
   *  connection for us. We need to release the ownership of the current consumer and re-register this
   *  consumer in the consumer registry and trigger a rebalance.
   */
  info("ZK expired; release old broker parition ownership; re-register consumer " + consumerIdString)
  //有新session创建,需要将原来的topic注册信息清除掉
  loadBalancerListener.resetState()
  //重新注册consumer到zk
  registerConsumerInZK(dirs, consumerIdString, topicCount)
  // explicitly trigger load balancing for this consumer
  //进行reblance
  loadBalancerListener.syncedRebalance()
  // There is no need to resubscribe to child and state changes.
  // The child change watchers will be set inside rebalance when we read the children list.
}

(2)向/consumers/group/ids注册loadBalancerListener,当该path的子path发生变化时(即consumer增删)会调用handleChildChange,该方法会触发syncedRebalance方法:

def syncedRebalance() {
  rebalanceLock synchronized {
    rebalanceTimer.time {
      for (i <- 0 until config.rebalanceMaxRetries) {
        if(isShuttingDown.get())  {
          return
        }
        info("begin rebalancing consumer " + consumerIdString + " try #" + i)
        var done = false
        var cluster: Cluster = null
        try {
          cluster = getCluster(zkClient)
          done = rebalance(cluster)
        } catch {
          case e: Throwable =>
            /** occasionally, we may hit a ZK exception because the ZK state is changing while we are iterating.
              * For example, a ZK node can disappear between the time we get all children and the time we try to get
              * the value of a child. Just let this go since another rebalance will be triggered.
              **/
            info("exception during rebalance ", e)
        }
        info("end rebalancing consumer " + consumerIdString + " try #" + i)
        if (done) {
          return
        } else {
          /* Here the cache is at a risk of being stale. To take future rebalancing decisions correctly, we should
           * clear the cache */
          info("Rebalancing attempt failed. Clearing the cache before the next rebalancing operation is triggered")
        }
        //reblance出现问题需要暂时将读取消息的线程关闭以免重试出现数据重复的现象
        // stop all fetchers and clear all the queues to avoid data duplication
        closeFetchersForQueues(cluster, kafkaMessageAndMetadataStreams, topicThreadIdAndQueues.map(q => q._2))
        Thread.sleep(config.rebalanceBackoffMs)
      }
    }
  }

  throw new ConsumerRebalanceFailedException(consumerIdString + " can't rebalance after " + config.rebalanceMaxRetries +" retries")
}

其中reblance方法主要做下面几件事情:

<1>关闭数据抓取线程,获取之前为topic设置的存放数据的queue并清空该queue

<2>为各个partition重新分配threadid <3>获取partition最新的offset并重新初始化新的PartitionTopicInfo(topic,partition,queue,consumedOffset,fetchedOffset,new AtomicInteger(config.fetchMessageMaxBytes),config.clientId),其中queue就是上面说的存放数据的那个queue,consumedOffset和fetchedOffset都为partition最新的offset。 <4>重新将partition对应的新的consumer信息写入zookeeper <5>重新创建partition的fetcher线程

全部代码如下:

private def rebalance(cluster: Cluster): Boolean = {
  val myTopicThreadIdsMap = TopicCount.constructTopicCount(
    group, consumerIdString, zkClient, config.excludeInternalTopics).getConsumerThreadIdsPerTopic
  //获取全部的broker进行reblance
  val brokers = getAllBrokersInCluster(zkClient)
  if (brokers.size == 0) {
    // This can happen in a rare case when there are no brokers available in the cluster when the consumer is started.
    // We log an warning and register for child changes on brokers/id so that rebalance can be triggered when the brokers
    // are up.
    warn("no brokers found when trying to rebalance.")
    //brokers/ids发生变化就会触发loadBalancerListener的handleChildChange方法
    zkClient.subscribeChildChanges(ZkUtils.BrokerIdsPath, loadBalancerListener)
    true
  }
  else {
    /**
     * fetchers must be stopped to avoid data duplication, since if the current
     * rebalancing attempt fails, the partitions that are released could be owned by another consumer.
     * But if we don't stop the fetchers first, this consumer would continue returning data for released
     * partitions in parallel. So, not stopping the fetchers leads to duplicate data.
     */
    //获取指定topic的消息队列,并清除该队列,以免数据冗余
    closeFetchers(cluster, kafkaMessageAndMetadataStreams, myTopicThreadIdsMap)
    if (consumerRebalanceListener != null) {
      info("Invoking rebalance listener before relasing partition ownerships.")
      consumerRebalanceListener.beforeReleasingPartitions(
        if (topicRegistry.size == 0)
          new java.util.HashMap[String, java.util.Set[java.lang.Integer]]
        else
          mapAsJavaMap(topicRegistry.map(topics =>
            topics._1 -> topics._2.keys
          ).toMap).asInstanceOf[java.util.Map[String, java.util.Set[java.lang.Integer]]]
      )
    }
    releasePartitionOwnership(topicRegistry)
    val assignmentContext = new AssignmentContext(group, consumerIdString, config.excludeInternalTopics, zkClient)
    //为各个partition重新分配threadid
    val globalPartitionAssignment = partitionAssignor.assign(assignmentContext)
    //当前consumerid持有的topicpartition——>threadids
    val partitionAssignment = globalPartitionAssignment.get(assignmentContext.consumerId)
    val currentTopicRegistry = new Pool[String, Pool[Int, PartitionTopicInfo]](
      valueFactory = Some((topic: String) => new Pool[Int, PartitionTopicInfo]))

    // fetch current offsets for all topic-partitions
    val topicPartitions = partitionAssignment.keySet.toSeq

    //获取各个partition的offset
    val offsetFetchResponseOpt = fetchOffsets(topicPartitions)

    if (isShuttingDown.get || !offsetFetchResponseOpt.isDefined)
      false
    else {
      val offsetFetchResponse = offsetFetchResponseOpt.get
      topicPartitions.foreach(topicAndPartition => {
        val (topic, partition) = topicAndPartition.asTuple
        val offset = offsetFetchResponse.requestInfo(topicAndPartition).offset
        val threadId = partitionAssignment(topicAndPartition)
        //将topic及其partition信息存入topicRegistry,并设置存放数据的queue(即存放消费partition数据的queue)
        addPartitionTopicInfo(currentTopicRegistry, partition, topic, offset, threadId)
      })

      /**
       * move the partition ownership here, since that can be used to indicate a truly successful rebalancing attempt
       * A rebalancing attempt is completed successfully only after the fetchers have been started correctly
       */
      //重新将partition对应的新的consumer信息写入zookeeper
      if(reflectPartitionOwnershipDecision(partitionAssignment)) {
        allTopicsOwnedPartitionsCount = partitionAssignment.size

        partitionAssignment.view.groupBy { case(topicPartition, consumerThreadId) => topicPartition.topic }
                                  .foreach { case (topic, partitionThreadPairs) =>
          newGauge("OwnedPartitionsCount",
            new Gauge[Int] {
              def value() = partitionThreadPairs.size
            },
            ownedPartitionsCountMetricTags(topic))
        }

        topicRegistry = currentTopicRegistry
        // Invoke beforeStartingFetchers callback if the consumerRebalanceListener is set.
        if (consumerRebalanceListener != null) {
          info("Invoking rebalance listener before starting fetchers.")

          // Partition assignor returns the global partition assignment organized as a map of [TopicPartition, ThreadId]
          // per consumer, and we need to re-organize it to a map of [Partition, ThreadId] per topic before passing
          // to the rebalance callback.
          val partitionAssginmentGroupByTopic = globalPartitionAssignment.values.flatten.groupBy[String] {
            case (topicPartition, _) => topicPartition.topic
          }
          val partitionAssigmentMapForCallback = partitionAssginmentGroupByTopic.map({
            case (topic, partitionOwnerShips) =>
              val partitionOwnershipForTopicScalaMap = partitionOwnerShips.map({
                case (topicAndPartition, consumerThreadId) =>
                  topicAndPartition.partition -> consumerThreadId
              })
              topic -> mapAsJavaMap(collection.mutable.Map(partitionOwnershipForTopicScalaMap.toSeq:_*))
                .asInstanceOf[java.util.Map[java.lang.Integer, ConsumerThreadId]]
          })
          consumerRebalanceListener.beforeStartingFetchers(
            consumerIdString,
            mapAsJavaMap(collection.mutable.Map(partitionAssigmentMapForCallback.toSeq:_*))
          )
        }
        updateFetcher(cluster)
        true
      } else {
        false
      }
    }
  }
}

(3)向/brokers/topics/topic注册topicPartitionChangeListener,在topic数据发生变化时调用:

def handleDataChange(dataPath : String, data: Object) {
  try {
    info("Topic info for path " + dataPath + " changed to " + data.toString + ", triggering rebalance")
    // queue up the rebalance event
    //触发loadBlance线程进行reblance
    loadBalancerListener.rebalanceEventTriggered()
    // There is no need to re-subscribe the watcher since it will be automatically
    // re-registered upon firing of this event by zkClient
  } catch {
    case e: Throwable => error("Error while handling topic partition change for data path " + dataPath, e )
  }
}

(4)显式调用loadBalancerListener.syncedRebalance(),即调用上面的reblance方法进行consumer的初始化工作

消费数据的准备工作就是这些,一句话概括就是:为指定topic的各个partition创建consumer线程。下文会具体讲数据是如何消费的。

]]>
Kafka集群生产数据 2015-09-06T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/09/06/kafka集群之produce数据 kafka生产数据主要通过Producer类完成,在生产数据时有两种方式可以使用:sync及async,顾名思义,sync方式是一旦有数据产生就马上进行处理,async是将产生的数据放入一个队列中,等待相关线程(ProducerSendThread)去批量处理数据。这两种方式最终都会由DefaultEventHandler的handle方法处理产生的数据。

在async方式中,将产生的数据放入queue时有三种不同的放入方式:

1)当queue.enqueue.timeout.ms=0,则立即放入queue中并返回true,若queue已满,则立即返回false

2)当queue.enqueue.timeout.ms<0,则立即放入queue,若queue已满,则一直等待queue释放空间

3)当queue.enqueue.timeout.ms>0,则立即放入queue中并返回true,若queue已满,则等待queue.enqueue.timeout.ms指定的时间以让queue释放空间,若时间到queue还是没有足够空间,则立即返回false

ProducerSendThread线程会不断地从queue中取出数据直到取出的数据为shutdownCommand,在取出数据后,如果取出的数据总量达到设置的batch.num.messages属性的值或者取出的数据为null则立即处理取出的所有数据。

最后真正处理数据由DefaultEventHandler.handle()方法进行:

def handle(events: Seq[KeyedMessage[K,V]]) {
    //序列化数据,分别由keyEncoder和valueEncoder对key和value进行序列化
    val serializedData = serialize(events)
    serializedData.foreach {
      keyed =>
        val dataSize = keyed.message.payloadSize
        producerTopicStats.getProducerTopicStats(keyed.topic).byteRate.mark(dataSize)
        producerTopicStats.getProducerAllTopicsStats.byteRate.mark(dataSize)
    }
    var outstandingProduceRequests = serializedData
    var remainingRetries = config.messageSendMaxRetries + 1
    val correlationIdStart = correlationId.get()
    debug("Handling %d events".format(events.size))
    //重试次数达到一定次数退出不再发送
    while (remainingRetries > 0 && outstandingProduceRequests.size > 0) {
      topicMetadataToRefresh ++= outstandingProduceRequests.map(_.topic)
      //超过刷新间隔
      if (topicMetadataRefreshInterval >= 0 &&
          SystemTime.milliseconds - lastTopicMetadataRefreshTime > topicMetadataRefreshInterval) {
        //更新topic信息
        CoreUtils.swallowError(brokerPartitionInfo.updateInfo(topicMetadataToRefresh.toSet, correlationId.getAndIncrement))
        //sendPartitionPerTopicCache存储的是上次将数据发送到了哪个partition,存储起来是为了在一段时间内能够一直发送到该partition,不会频繁的去取模决定发送到哪个partition,一段时间后clear掉该数据是为了能够使数据分布均匀
        sendPartitionPerTopicCache.clear()
        topicMetadataToRefresh.clear
        lastTopicMetadataRefreshTime = SystemTime.milliseconds
      }
      outstandingProduceRequests = dispatchSerializedData(outstandingProduceRequests)
      //发送之后如果还有数据未发送则sendPartitionPerTopicCache.clear(),确保这次发送的partition和上次的partition不一样,达到均衡分布的目的
      if (outstandingProduceRequests.size > 0) {
        //重试发送
        info("Back off for %d ms before retrying send. Remaining retries = %d".format(config.retryBackoffMs, remainingRetries-1))
        // back off and update the topic metadata cache before attempting another send operation
        Thread.sleep(config.retryBackoffMs)
        // get topics of the outstanding produce requests and refresh metadata for those
        CoreUtils.swallowError(brokerPartitionInfo.updateInfo(outstandingProduceRequests.map(_.topic).toSet, correlationId.getAndIncrement))
        sendPartitionPerTopicCache.clear()
        remainingRetries -= 1
        producerStats.resendRate.mark()
      }
    }
    if(outstandingProduceRequests.size > 0) {
      producerStats.failedSendRate.mark()
      val correlationIdEnd = correlationId.get()
      error("Failed to send requests for topics %s with correlation ids in [%d,%d]"
        .format(outstandingProduceRequests.map(_.topic).toSet.mkString(","),
        correlationIdStart, correlationIdEnd-1))
      throw new FailedToSendMessageException("Failed to send messages after " + config.messageSendMaxRetries + " tries.", null)
    }
}

真正发送数据的方法是dispatchSerializedData:

private def dispatchSerializedData(messages: Seq[KeyedMessage[K,Message]]): Seq[KeyedMessage[K, Message]] = {
    val partitionedDataOpt = partitionAndCollate(messages)
    partitionedDataOpt match {
      case Some(partitionedData) =>
        val failedProduceRequests = new ArrayBuffer[KeyedMessage[K, Message]]
        for ((brokerid, messagesPerBrokerMap) <- partitionedData) {
          if (logger.isTraceEnabled) {
            messagesPerBrokerMap.foreach(partitionAndEvent =>
              trace("Handling event for Topic: %s, Broker: %d, Partitions: %s".format(partitionAndEvent._1, brokerid, partitionAndEvent._2)))
          }
          val messageSetPerBrokerOpt = groupMessagesToSet(messagesPerBrokerMap)
          messageSetPerBrokerOpt match {
            case Some(messageSetPerBroker) =>
              //发送到broker对应的partition中
              val failedTopicPartitions = send(brokerid, messageSetPerBroker)
              failedTopicPartitions.foreach(topicPartition => {
                messagesPerBrokerMap.get(topicPartition) match {
                    //将发送有误的topicPartition对应的数据重新放入failedProduceRequests
                  case Some(data) => failedProduceRequests.appendAll(data)
                  case None => // nothing
                }
              })
            case None => // failed to group messages
              messagesPerBrokerMap.values.foreach(m => failedProduceRequests.appendAll(m))
          }
        }
        failedProduceRequests
      case None => // failed to collate messages
        messages
    }
}

partitionAndCollate方法是将数据决定放入topic的哪个partition中:

//将数据放入topic的指定partition对应的arrayBuffer中
def partitionAndCollate(messages: Seq[KeyedMessage[K,Message]]): Option[Map[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]] = {
    val ret = new HashMap[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]
    try {
      for (message <- messages) {
        //获取topic的所有partition
        val topicPartitionsList = getPartitionListForTopic(message)
        //决定将数据放在哪个partition,如果sendPartitionPerTopicCache有相应的数据(表示未到刷新时间)则直接返回相应的partition,否则通过取模算出并将结果放入sendPartitionPerTopicCache
        val partitionIndex = getPartition(message.topic, message.partitionKey, topicPartitionsList)
        val brokerPartition = topicPartitionsList(partitionIndex)

        // postpone the failure until the send operation, so that requests for other brokers are handled correctly
        val leaderBrokerId = brokerPartition.leaderBrokerIdOpt.getOrElse(-1)

        var dataPerBroker: HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]] = null
        ret.get(leaderBrokerId) match {
          case Some(element) =>
            dataPerBroker = element.asInstanceOf[HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]
          case None =>
            dataPerBroker = new HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]]
            ret.put(leaderBrokerId, dataPerBroker)
        }

        //将数据放入leader的指定partition中
        val topicAndPartition = TopicAndPartition(message.topic, brokerPartition.partitionId)
        var dataPerTopicPartition: ArrayBuffer[KeyedMessage[K,Message]] = null
        dataPerBroker.get(topicAndPartition) match {
          case Some(element) =>
            dataPerTopicPartition = element.asInstanceOf[ArrayBuffer[KeyedMessage[K,Message]]]
          case None =>
            dataPerTopicPartition = new ArrayBuffer[KeyedMessage[K,Message]]
            dataPerBroker.put(topicAndPartition, dataPerTopicPartition)
        }
        dataPerTopicPartition.append(message)
      }
      Some(ret)
    }catch {    // Swallow recoverable exceptions and return None so that they can be retried.
      case ute: UnknownTopicOrPartitionException => warn("Failed to collate messages by topic,partition due to: " + ute.getMessage); None
      case lnae: LeaderNotAvailableException => warn("Failed to collate messages by topic,partition due to: " + lnae.getMessage); None
      case oe: Throwable => error("Failed to collate messages by topic, partition due to: " + oe.getMessage); None
    }
}

在发送到指定的broker时需要将发送有错误的数据重新获取到以重新发送:

//返回发送有错误的topicPartition
private def send(brokerId: Int, messagesPerTopic: collection.mutable.Map[TopicAndPartition, ByteBufferMessageSet]) = {
    if(brokerId < 0) {
      warn("Failed to send data since partitions %s don't have a leader".format(messagesPerTopic.map(_._1).mkString(",")))
      messagesPerTopic.keys.toSeq
    } else if(messagesPerTopic.size > 0) {
      val currentCorrelationId = correlationId.getAndIncrement
      val producerRequest = new ProducerRequest(currentCorrelationId, config.clientId, config.requestRequiredAcks,
        config.requestTimeoutMs, messagesPerTopic)
      var failedTopicPartitions = Seq.empty[TopicAndPartition]
      try {
        //获取broker对应的SyncProducer(在启动broker时创建)
        val syncProducer = producerPool.getProducer(brokerId)
        debug("Producer sending messages with correlation id %d for topics %s to broker %d on %s:%d"
          .format(currentCorrelationId, messagesPerTopic.keySet.mkString(","), brokerId, syncProducer.config.host, syncProducer.config.port))
        val response = syncProducer.send(producerRequest)
        debug("Producer sent messages with correlation id %d for topics %s to broker %d on %s:%d"
          .format(currentCorrelationId, messagesPerTopic.keySet.mkString(","), brokerId, syncProducer.config.host, syncProducer.config.port))
        if(response != null) {
          //有数据未发送过去
          if (response.status.size != producerRequest.data.size)
            throw new KafkaException("Incomplete response (%s) for producer request (%s)".format(response, producerRequest))
          if (logger.isTraceEnabled) {
            val successfullySentData = response.status.filter(_._2.error == ErrorMapping.NoError)
            successfullySentData.foreach(m => messagesPerTopic(m._1).foreach(message =>
              trace("Successfully sent message: %s".format(if(message.message.isNull) null else message.message.toString()))))
          }
          //发送有错误或者处理有误的数据
          val failedPartitionsAndStatus = response.status.filter(_._2.error != ErrorMapping.NoError).toSeq
          failedTopicPartitions = failedPartitionsAndStatus.map(partitionStatus => partitionStatus._1)
          if(failedTopicPartitions.size > 0) {
            val errorString = failedPartitionsAndStatus
              .sortWith((p1, p2) => p1._1.topic.compareTo(p2._1.topic) < 0 ||
                                    (p1._1.topic.compareTo(p2._1.topic) == 0 && p1._1.partition < p2._1.partition))
              .map{
                case(topicAndPartition, status) =>
                  topicAndPartition.toString + ": " + ErrorMapping.exceptionFor(status.error).getClass.getName
              }.mkString(",")
            warn("Produce request with correlation id %d failed due to %s".format(currentCorrelationId, errorString))
          }
          failedTopicPartitions
        } else {
          Seq.empty[TopicAndPartition]
        }
      } catch {
        case t: Throwable =>
          warn("Failed to send producer request with correlation id %d to broker %d with data for partitions %s"
            .format(currentCorrelationId, brokerId, messagesPerTopic.map(_._1).mkString(",")), t)
          messagesPerTopic.keys.toSeq
      }
    } else {
      List.empty
    }
}

至此,数据的生产已经完成,但该过程并不涉及写磁盘操作,只是将批量的请求发送到socket(BlockingChannel)中,真正的持久化操作会在网络层接收到请求后再由具体的类来完成(详细过程会在讲KafkaApi的时候说到)。

]]>
KafkaController之重新被选为leader 2015-09-01T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/09/01/kafka集群之controllerResignation 在之前的leader选举过程一文当中说到当在选举过程当中发生异常时,会重新选举leader,这时候会触发LeaderChangeListener的handleDataDeleted方法,如果此时当前broker重新被选为leader,则会回调onControllerResignation:

//当前broker重新被选举为controller
def onControllerResignation() {
    // de-register listeners
    deregisterReassignedPartitionsListener()
    deregisterPreferredReplicaElectionListener()

    // shutdown delete topic manager
    if (deleteTopicManager != null)
      deleteTopicManager.shutdown()

    // shutdown leader rebalance scheduler
    if (config.autoLeaderRebalanceEnable)
      autoRebalanceScheduler.shutdown()

    inLock(controllerContext.controllerLock) {
      // de-register partition ISR listener for on-going partition reassignment task
      deregisterReassignedPartitionsIsrChangeListeners()
      // shutdown partition state machine
      partitionStateMachine.shutdown()
      // shutdown replica state machine
      replicaStateMachine.shutdown()
      // shutdown controller channel manager
      if(controllerContext.controllerChannelManager != null) {
        controllerContext.controllerChannelManager.shutdown()
        controllerContext.controllerChannelManager = null
      }
      // reset controller context
      controllerContext.epoch=0
      controllerContext.epochZkVersion=0
      brokerState.newState(RunningAsBroker)

      info("Broker %d resigned as the controller".format(config.brokerId))
    }
}

该方法主要做以下几件事情:

1.清除/admin/reassign_partitions路径的PartitionsReassignedListener。

2.清除/admin/preferred_replica_election路径的PreferredReplicaElectionListener。

3.清除/brokers/topics/topic/partitions/partitionId/state路径的ReassignedPartitionsIsrChangeListener。

这几个listener的作用在上文都有说过。

清除之前leader遗留的状态后会在leader选举时重新调用elect方法:

def handleDataDeleted(dataPath: String) {
      inLock(controllerContext.controllerLock) {
        debug("%s leader change listener fired for path %s to handle data deleted: trying to elect as a leader"
          .format(brokerId, dataPath))
        if(amILeader)
          onResigningAsLeader()//被重新选为leader
        elect
      }
}

elect方法又会回调onControllerFailover方法重新为leader设置一些必要的数据结构及相关路径的listener(具体见上文)。

]]>
KafkaController之leader选举成功 2015-08-27T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/08/27/kafka集群之controllerFailOver 上文说到在kafka集群中如何选择leader,在选完leader后就会调用onControllerFailOver方法。该方法会处理以下几个事情:

1.读取zk中上个controller写入的epoch和version

2.将第一步读取出来的epoch加1并写入zk中,以便让其他broker知晓已经有broker被选为controller了(通过比较zk中的epoch与自身的epoch)

3.向/admin/reassign_partitions注册PartitionsReassignedListener,该listener主要是监测/admin/reassign_partitions数据改变的事件,该路径的数据是由ReassignPartitionsCommand命令行写入,写入数据的具体格式类似为:{version:1,partitions:{topic:xxx,partition:1,replicas:[1,2]}}。当写入数据后,该listener就会触发handleDataChange方法:

def handleDataChange(dataPath: String, data: Object) {
    debug("Partitions reassigned listener fired for path %s. Record partitions to be reassigned %s"
      .format(dataPath, data))
    val partitionsReassignmentData = ZkUtils.parsePartitionReassignmentData(data.toString)
    //在所要分配的partition中去除正在被分配的那些partition
    val partitionsToBeReassigned = inLock(controllerContext.controllerLock) {
      partitionsReassignmentData.filterNot(p => controllerContext.partitionsBeingReassigned.contains(p._1))
    }
    partitionsToBeReassigned.foreach { partitionToBeReassigned =>
      inLock(controllerContext.controllerLock) {
        //判断该topic是否是待删除的topic
        if(controller.deleteTopicManager.isTopicQueuedUpForDeletion(partitionToBeReassigned._1.topic)) {
          error("Skipping reassignment of partition %s for topic %s since it is currently being deleted"
            .format(partitionToBeReassigned._1, partitionToBeReassigned._1.topic))
          controller.removePartitionFromReassignedPartitions(partitionToBeReassigned._1)
        } else {
          val context = new ReassignedPartitionsContext(partitionToBeReassigned._2)
          controller.initiateReassignReplicasForTopicPartition(partitionToBeReassigned._1, context)
        }
      }
    }
}

removePartitionFromReassignedPartitions方法是将待删除的partition从/admin/reassign_partitions路径删除并将最新的待分配的partition数据更新到/admin/reassign_partitions中:

def removePartitionFromReassignedPartitions(topicAndPartition: TopicAndPartition) {
    if(controllerContext.partitionsBeingReassigned.get(topicAndPartition).isDefined) {
      // stop watching the ISR changes for this partition
      zkClient.unsubscribeDataChanges(ZkUtils.getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
        controllerContext.partitionsBeingReassigned(topicAndPartition).isrChangeListener)
    }
    // read the current list of reassigned partitions from zookeeper
    ///admin/reassign_partitions的数据是正在分配replica的那些partition,当分配完后相应的partition会被删除
    val partitionsBeingReassigned = ZkUtils.getPartitionsBeingReassigned(zkClient)
    // remove this partition from that list
    //更新正在分配的那些partition
    val updatedPartitionsBeingReassigned = partitionsBeingReassigned - topicAndPartition
    // write the new list to zookeeper
    //如果没有正在分配的partition了,那么直接删除该path,否则修改该path的数据
    ZkUtils.updatePartitionReassignmentData(zkClient, updatedPartitionsBeingReassigned.mapValues(_.newReplicas))
    // update the cache. NO-OP if the partition's reassignment was never started
    controllerContext.partitionsBeingReassigned.remove(topicAndPartition)
}

initiateReassignReplicasForTopicPartition方法判断该partition是否需要重新分配replica,仅仅在所要分配的新的replica与之前向该partition分配的replica不一致并且新的replica都是有效的情况下才会为该partition重新分配replica:

def onPartitionReassignment(topicAndPartition: TopicAndPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
    val reassignedReplicas = reassignedPartitionContext.newReplicas
    //判断所需分配的replica是否都在之前分配的isr中
    areReplicasInIsr(topicAndPartition.topic, topicAndPartition.partition, reassignedReplicas) match {
      case false =>
        //需要新分配的replicas有些不在以前的isr中,这时候需要修改相应的数据
        info("New replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +
          "reassigned not yet caught up with the leader")
        //所需要分配的新的replica
        val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicAndPartition).toSet
        //所有replica
        val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicAndPartition)).toSet
        //1. Update AR in ZK with OAR + RAR.
        updateAssignedReplicasForPartition(topicAndPartition, newAndOldReplicas.toSeq)
        //2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR).
        //向所有replica发送修改leaderAndIsr请求
        updateLeaderEpochAndSendRequest(topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition),
          newAndOldReplicas.toSeq)
        //3. replicas in RAR - OAR -> NewReplica
        //将需要添加的新的replica状态设置为NewReplica
        startNewReplicasForReassignedPartition(topicAndPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
        info("Waiting for new replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +
          "reassigned to catch up with the leader")
      case true =>
        //所要分配的replica都在之前分配的isr中
        //4. Wait until all replicas in RAR are in sync with the leader.
        //不需要再进行分配的那些replica
        val oldReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).toSet -- reassignedReplicas.toSet
        //5. replicas in RAR -> OnlineReplica
        //将需要分配的replica的状态置为Online
        reassignedReplicas.foreach { replica =>
          replicaStateMachine.handleStateChanges(Set(new PartitionAndReplica(topicAndPartition.topic, topicAndPartition.partition,
            replica)), OnlineReplica)
        }
        //6. Set AR to RAR in memory.
        //7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and
        //   a new AR (using RAR) and same isr to every broker in RAR
        //把最新需要分配的replica放入缓存并且向所有replica发出修改leaderAndIsr的消息
        //当所需分配的replica不包含当前该partition的leader则需要在要重新分配的replicas中重新选取leader
        moveReassignedPartitionLeaderIfRequired(topicAndPartition, reassignedPartitionContext)
        //8. replicas in OAR - RAR -> Offline (force those replicas out of isr)
        //9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted)
        //将需要删除的replica的状态设置为NonExistentReplica
        stopOldReplicasOfReassignedPartition(topicAndPartition, reassignedPartitionContext, oldReplicas)
        //10. Update AR in ZK with RAR.
        //更新zk中的数据
        updateAssignedReplicasForPartition(topicAndPartition, reassignedReplicas)
        //11. Update the /admin/reassign_partitions path in ZK to remove this partition.
        //分配完成后更新/admin/reassign_partitions的数据,如果/admin/reassign_partitions中没有需要分配的partition则删除该路径
        removePartitionFromReassignedPartitions(topicAndPartition)
        info("Removed partition %s from the list of reassigned partitions in zookeeper".format(topicAndPartition))
        controllerContext.partitionsBeingReassigned.remove(topicAndPartition)
        //12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker
        //向所有broker发送UpdateMetadataRequest
        sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicAndPartition))
        // signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
        deleteTopicManager.resumeDeletionForTopics(Set(topicAndPartition.topic))
    }
}

4.向/admin/preferred_replica_election注册PreferredReplicaElectionListener,/admin/preferred_replica_election路径的数据是由PreferredReplicaLeaderElectionCommand命令行写入,该命令行的作用是为partition重新选举leader replica的,写入的数据格式为:{partitions:[{topic:foo,partition:1},{topic:foobar,partition:2}]},当/admin/preferred_replica_election路径的数据发生改变时,就会触发PreferredReplicaElectionListener的handleDataChange方法:

def handleDataChange(dataPath: String, data: Object) {
    debug("Preferred replica election listener fired for path %s. Record partitions to undergo preferred replica election %s"
            .format(dataPath, data.toString))
    inLock(controllerContext.controllerLock) {
      val partitionsForPreferredReplicaElection = PreferredReplicaLeaderElectionCommand.parsePreferredReplicaElectionData(data.toString)
      ////正在选举leader replica的那些partition
      if(controllerContext.partitionsUndergoingPreferredReplicaElection.size > 0)
        info("These partitions are already undergoing preferred replica election: %s"
          .format(controllerContext.partitionsUndergoingPreferredReplicaElection.mkString(",")))
      //去除那些正在选举replica的partition
      val partitions = partitionsForPreferredReplicaElection -- controllerContext.partitionsUndergoingPreferredReplicaElection
      //去除那些需要被删除的topic对应的partition
      val partitionsForTopicsToBeDeleted = partitions.filter(p => controller.deleteTopicManager.isTopicQueuedUpForDeletion(p.topic))
      if(partitionsForTopicsToBeDeleted.size > 0) {
        error("Skipping preferred replica election for partitions %s since the respective topics are being deleted"
          .format(partitionsForTopicsToBeDeleted))
      }
      controller.onPreferredReplicaElection(partitions -- partitionsForTopicsToBeDeleted)
    }
}

onPreferredReplicaElection方法的定义为:

def onPreferredReplicaElection(partitions: Set[TopicAndPartition], isTriggeredByAutoRebalance: Boolean = false) {
    info("Starting preferred replica leader election for partitions %s".format(partitions.mkString(",")))
    try {
      //修改正在选举leader replica的数据
      controllerContext.partitionsUndergoingPreferredReplicaElection ++= partitions
      //将这些topic设置为延迟删除
      deleteTopicManager.markTopicIneligibleForDeletion(partitions.map(_.topic))
      //将这些partition的状态置为Online并为这些partition分别选举一个replica作为leader,选举的方法是直接取分配的所有replica的第一个作为leader
      partitionStateMachine.handleStateChanges(partitions, OnlinePartition, preferredReplicaPartitionLeaderSelector)
    } catch {
      case e: Throwable => error("Error completing preferred replica leader election for partitions %s".format(partitions.mkString(",")), e)
    } finally {
      removePartitionsFromPreferredReplicaElection(partitions, isTriggeredByAutoRebalance)
      deleteTopicManager.resumeDeletionForTopics(partitions.map(_.topic))
    }
}

逻辑并不难,只是为指定的partition选举一个leader replica出来,选举的方法是直接取replicas的第一个作为leader。

5.为/brokers/topics注册TopicChangeListener和DeleteTopicsListener,当topic数量发生改变时,就会触发TopicChangeListener的handleChildChange方法:

def handleChildChange(parentPath : String, children : java.util.List[String]) {
  inLock(controllerContext.controllerLock) {
    if (hasStarted.get) {
      try {
        //当前在zk中最新的topic数据
        val currentChildren = {
          import JavaConversions._
          debug("Topic change listener fired for path %s with children %s".format(parentPath, children.mkString(",")))
          (children: Buffer[String]).toSet
        }
        //新增的topic
        val newTopics = currentChildren -- controllerContext.allTopics
        //被删除的topic
        val deletedTopics = controllerContext.allTopics -- currentChildren
        controllerContext.allTopics = currentChildren

        //获取新增的topic对应的replica分布情况
        val addedPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, newTopics.toSeq)
        //更新partitionReplicaAssignment数据
        controllerContext.partitionReplicaAssignment = controllerContext.partitionReplicaAssignment.filter(p =>
          !deletedTopics.contains(p._1.topic))
        controllerContext.partitionReplicaAssignment.++=(addedPartitionReplicaAssignment)
        info("New topics: [%s], deleted topics: [%s], new partition replica assignment [%s]".format(newTopics,
          deletedTopics, addedPartitionReplicaAssignment))
        if(newTopics.size > 0)
          //为新增的topic注册partitionChangeListener,并将这些topic对应的partition的状态设置为OnlinePartition
          controller.onNewTopicCreation(newTopics, addedPartitionReplicaAssignment.keySet.toSet)
      } catch {
        case e: Throwable => error("Error while handling new topic", e )
      }
    }
  }
}

当删除一个topic时(通过命令行:TopicCommand完成),比如删除xxx这个topic,这时就会在zk中创建/admin/delete_topics/xxx的路径,创建完成之后就会触发DeleteTopicsListener的handleChildChange方法:

def handleChildChange(parentPath : String, children : java.util.List[String]) {
  inLock(controllerContext.controllerLock) {
    //待删除的topic
    var topicsToBeDeleted = {
      import JavaConversions._
      (children: Buffer[String]).toSet
    }
    debug("Delete topics listener fired for topics %s to be deleted".format(topicsToBeDeleted.mkString(",")))
    val nonExistentTopics = topicsToBeDeleted.filter(t => !controllerContext.allTopics.contains(t))
    if(nonExistentTopics.size > 0) {
      warn("Ignoring request to delete non-existing topics " + nonExistentTopics.mkString(","))
      //不存在的topic直接删除路径
      nonExistentTopics.foreach(topic => ZkUtils.deletePathRecursive(zkClient, ZkUtils.getDeleteTopicPath(topic)))
    }
    topicsToBeDeleted --= nonExistentTopics
    if(topicsToBeDeleted.size > 0) {
      info("Starting topic deletion for topics " + topicsToBeDeleted.mkString(","))
      // mark topic ineligible for deletion if other state changes are in progress
      topicsToBeDeleted.foreach { topic =>
        //该topic正在选举leader replica
        val preferredReplicaElectionInProgress =
          controllerContext.partitionsUndergoingPreferredReplicaElection.map(_.topic).contains(topic)
        //该topic正在分配partition
        val partitionReassignmentInProgress =
          controllerContext.partitionsBeingReassigned.keySet.map(_.topic).contains(topic)
        if(preferredReplicaElectionInProgress || partitionReassignmentInProgress)
          //延迟删除该topic
          controller.deleteTopicManager.markTopicIneligibleForDeletion(Set(topic))
      }
      // add topic to deletion list
      //加入删除队列,唤醒TopicDeletionThread
      controller.deleteTopicManager.enqueueTopicsForDeletion(topicsToBeDeleted)
    }
  }
}

DeleteTopicsThread线程主要做三件事:

(1).向所有的broker发送UpdateMetadata请求,以使broker不再接受待删除的topic的请求

(2).设置topic的replica的状态为OffLine,这时会发送StopReplicaRequest到相应的replica并向leader replica发送LeaderAndIsrRequest,如果leader replica也被设置为OffLine,那么leader会被设置为-1

(3).设置topic的replica的状态为ReplicaDeletionStarted,这时会向broker发送StopReplicaRequest,进而删除replica的所有临时数据

主要代码如下:

private def startReplicaDeletion(replicasForTopicsToBeDeleted: Set[PartitionAndReplica]) {
    replicasForTopicsToBeDeleted.groupBy(_.topic).foreach { case(topic, replicas) =>
      //该topic的有效replica
      var aliveReplicasForTopic = controllerContext.allLiveReplicas().filter(p => p.topic.equals(topic))
      //无效的replica
      val deadReplicasForTopic = replicasForTopicsToBeDeleted -- aliveReplicasForTopic
      //已删除的replica
      val successfullyDeletedReplicas = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
      //待删除的replica
      val replicasForDeletionRetry = aliveReplicasForTopic -- successfullyDeletedReplicas
      // move dead replicas directly to failed state
      replicaStateMachine.handleStateChanges(deadReplicasForTopic, ReplicaDeletionIneligible)
      // send stop replica to all followers that are not in the OfflineReplica state so they stop sending fetch requests to the leader
      //将待删除的replica的状态设置为Offline
      replicaStateMachine.handleStateChanges(replicasForDeletionRetry, OfflineReplica)
      debug("Deletion started for replicas %s".format(replicasForDeletionRetry.mkString(",")))
      controller.replicaStateMachine.handleStateChanges(replicasForDeletionRetry, ReplicaDeletionStarted,
        new Callbacks.CallbackBuilder().stopReplicaCallback(deleteTopicStopReplicaCallback).build)
      if(deadReplicasForTopic.size > 0) {
        debug("Dead Replicas (%s) found for topic %s".format(deadReplicasForTopic.mkString(","), topic))
        markTopicIneligibleForDeletion(Set(topic))
      }
    }
}

6.为/brokers/ids注册BrokerChangeListener,/brokers/ids路径的数据由KafkaHealthcheck类写入,该类的startup方法会在KafkaServer中被调用,将当前的brokerid写入该路径中。当有broker增减时即/brokers/ids路径会有子路径发生变化就会触发该listener的handleChildChange方法:

def handleChildChange(parentPath : String, currentBrokerList : java.util.List[String]) {
  info("Broker change listener fired for path %s with children %s".format(parentPath, currentBrokerList.mkString(",")))
  inLock(controllerContext.controllerLock) {
    if (hasStarted.get) {
      ControllerStats.leaderElectionTimer.time {
        try {
          val curBrokerIds = currentBrokerList.map(_.toInt).toSet
          //新增的broker
          val newBrokerIds = curBrokerIds -- controllerContext.liveOrShuttingDownBrokerIds
          val newBrokerInfo = newBrokerIds.map(ZkUtils.getBrokerInfo(zkClient, _))
          val newBrokers = newBrokerInfo.filter(_.isDefined).map(_.get)
          //要删除的broker
          val deadBrokerIds = controllerContext.liveOrShuttingDownBrokerIds -- curBrokerIds
          //更新内存中broker数据
          controllerContext.liveBrokers = curBrokerIds.map(ZkUtils.getBrokerInfo(zkClient, _)).filter(_.isDefined).map(_.get)
          info("Newly added brokers: %s, deleted brokers: %s, all live brokers: %s"
            .format(newBrokerIds.mkString(","), deadBrokerIds.mkString(","), controllerContext.liveBrokerIds.mkString(",")))
          //创建新增broker的channel,用于发送和接收数据
          newBrokers.foreach(controllerContext.controllerChannelManager.addBroker(_))
          //删除待删broker对应的channel
          deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker(_))
          if(newBrokerIds.size > 0)
            controller.onBrokerStartup(newBrokerIds.toSeq)
          if(deadBrokerIds.size > 0)
            controller.onBrokerFailure(deadBrokerIds.toSeq)
        } catch {
          case e: Throwable => error("Error while handling broker changes", e)
        }
      }
    }
  }
}

该方法中的onBrokerStartup和onBrokerFailure方法比较重要,这两个方法确保partition和replica在集群中动态发生变化。

onBrokerStartup方法主要做下面几件事情:

(1)向新增的broker发送UpdateMetadata请求,该请求会修改broker的一些内存数据,比如partition及replica的分配情况。

(2)将新增broker对应的replica状态置为Online

(3)将之前该broker被置为New或Offline的partition的状态重新置为Online

(4)将新增broker中正在分配partition的topic重新分配partition(onPartitionReassignment)。

(5)删除新增broker中需要删除的topic。

onBrokerFailure方法主要做一下几件事情:

(1)将partition的leader在将要去除的broker中的那些partition的状态置为Offline。

(2)将那些在需要去除的broker中但不需要删除的topic对应的broker的状态置为Offline。

(3)删除那些需要删除的topic。

7.初始化ControllerContext,包括当前可用broker,所有topic,partition及其replica分配情况,partition的leaderIsr信息等数据:

private def initializeControllerContext() {
    // update controller cache with delete topic information
    controllerContext.liveBrokers = ZkUtils.getAllBrokersInCluster(zkClient).toSet
    controllerContext.allTopics = ZkUtils.getAllTopics(zkClient).toSet
    controllerContext.partitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, controllerContext.allTopics.toSeq)
    controllerContext.partitionLeadershipInfo = new mutable.HashMap[TopicAndPartition, LeaderIsrAndControllerEpoch]
    controllerContext.shuttingDownBrokerIds = mutable.Set.empty[Int]
    // update the leader and isr cache for all existing partitions from Zookeeper
    //将zk中的leaderIsr信息放入partitionLeadershipInfo中
    updateLeaderAndIsrCache()
    // start the channel manager
    startChannelManager()
    //更新partitionsUndergoingPreferredReplicaElection的数据,保留可以选举leader的partition
    initializePreferredReplicaElection()
    //更新partitionsBeingReassigned的数据,保留可以分配replica的partition
    initializePartitionReassignment()
    //初始化TopicDeletionManager
    initializeTopicDeletion()
    info("Currently active brokers in the cluster: %s".format(controllerContext.liveBrokerIds))
    info("Currently shutting brokers in the cluster: %s".format(controllerContext.shuttingDownBrokerIds))
    info("Current list of topics in the cluster: %s".format(controllerContext.allTopics))
}

8.为所有topic(路径为/brokers/topics/xxx)注册AddPartitionsListener,当某个topic的数据发生变化时就会触发handleDataChange方法:

def handleDataChange(dataPath : String, data: Object) {
  inLock(controllerContext.controllerLock) {
    try {
      info("Add Partition triggered " + data.toString + " for path " + dataPath)
      val partitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, List(topic))
      //新增的partition
      val partitionsToBeAdded = partitionReplicaAssignment.filter(p =>
        !controllerContext.partitionReplicaAssignment.contains(p._1))
      if(controller.deleteTopicManager.isTopicQueuedUpForDeletion(topic))
        error("Skipping adding partitions %s for topic %s since it is currently being deleted"
              .format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))
      else {
        if (partitionsToBeAdded.size > 0) {
          info("New partitions to be added %s".format(partitionsToBeAdded))
          //将新增partition的状态置为Online
          controller.onNewPartitionCreation(partitionsToBeAdded.keySet.toSet)
        }
      }
    } catch {
      case e: Throwable => error("Error while handling add partitions for data path " + dataPath, e )
    }
  }
}

9.启动checkAndTriggerPartitionRebalance线程,该线程主要任务是检查当前集群中topic的leader replica与分配的replicas的第一个不同的topic数量是否占到总topic的数量的指定比例,如果占到了就为这些topic重新选取leader replica:

private def checkAndTriggerPartitionRebalance(): Unit = {
    if (isActive()) {
      trace("checking need to trigger partition rebalance")
      // get all the active brokers
      var preferredReplicasForTopicsByBrokers: Map[Int, Map[TopicAndPartition, Seq[Int]]] = null
      inLock(controllerContext.controllerLock) {
        preferredReplicasForTopicsByBrokers =
          //topic-0:[1,2] topic-1:[1,2,3] topic-2:[2,3]
          //1:{topic-0:[1,2],topic-1:[1,2,3]} 2:{topic-2:[2,3]}
          //按分配的replica的第一个元素进行groupBy
          controllerContext.partitionReplicaAssignment.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p._1.topic)).groupBy {
            case(topicAndPartition, assignedReplicas) => assignedReplicas.head
          }
      }
      debug("preferred replicas by broker " + preferredReplicasForTopicsByBrokers)
      // for each broker, check if a preferred replica election needs to be triggered
      preferredReplicasForTopicsByBrokers.foreach {
        //这里的leaderBroker指的是assignedReplicas.head,即第一个replica作为leader
        case(leaderBroker, topicAndPartitionsForBroker) => {
          var imbalanceRatio: Double = 0
          var topicsNotInPreferredReplica: Map[TopicAndPartition, Seq[Int]] = null
          inLock(controllerContext.controllerLock) {
            //获取topic当前leader与分配的replica第一个元素不同的topic
            topicsNotInPreferredReplica =
              topicAndPartitionsForBroker.filter {
                case(topicPartition, replicas) => {
                  controllerContext.partitionLeadershipInfo.contains(topicPartition) &&
                  controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader != leaderBroker
                }
              }
            debug("topics not in preferred replica " + topicsNotInPreferredReplica)
            //broker下所有的topic数量
            val totalTopicPartitionsForBroker = topicAndPartitionsForBroker.size
            val totalTopicPartitionsNotLedByBroker = topicsNotInPreferredReplica.size
            imbalanceRatio = totalTopicPartitionsNotLedByBroker.toDouble / totalTopicPartitionsForBroker
            trace("leader imbalance ratio for broker %d is %f".format(leaderBroker, imbalanceRatio))
          }
          // check ratio and if greater than desired ratio, trigger a rebalance for the topic partitions
          // that need to be on this broker
          if (imbalanceRatio > (config.leaderImbalancePerBrokerPercentage.toDouble / 100)) {
            topicsNotInPreferredReplica.foreach {
              case(topicPartition, replicas) => {
                inLock(controllerContext.controllerLock) {
                  // do this check only if the broker is live and there are no partitions being reassigned currently
                  // and preferred replica election is not in progress
                  if (controllerContext.liveBrokerIds.contains(leaderBroker) &&
                      controllerContext.partitionsBeingReassigned.size == 0 &&
                      controllerContext.partitionsUndergoingPreferredReplicaElection.size == 0 &&
                      !deleteTopicManager.isTopicQueuedUpForDeletion(topicPartition.topic) &&
                      controllerContext.allTopics.contains(topicPartition.topic)) {
                    //重新选择leader
                    onPreferredReplicaElection(Set(topicPartition), true)
                  }
                }
              }
            }
          }
        }
      }
    }
}

至此,broker被选举为controller(leader)之后的操作就全部完成了。下文会讲broker在某些情况下重新被选举为leader之后的一些操作。

]]>
kafka集群leader(controller)的选举过程 2015-08-20T00:00:00+00:00 uohzoaix http://uohzoaix.github.io/2015/08/20/kafka集群leader选举过程 kafka集群启动时,KafkaServer会启动多个KafkaController(每个broker一个):

//在kafka集群启动时每个broker都会拥有一个kafkaController实例,但这时还未选出leader
def startup() = {
    inLock(controllerContext.controllerLock) {
      info("Controller starting up");
      registerSessionExpirationListener()
      isRunning = true
      controllerElector.startup
      info("Controller startup complete")
    }
}

在调用controllerElector.startup后集群就开始通过zookeeper选举leader了:

def startup {
    inLock(controllerContext.controllerLock) {
      //electionPath的data发生变化就会通知leaderChangeListener
      controllerContext.zkClient.subscribeDataChanges(electionPath, leaderChangeListener)
      elect
    }
}

LeaderChangeListener的定义为:

class LeaderChangeListener extends IZkDataListener with Logging {
    /**
     * Called when the leader information stored in zookeeper has changed. Record the new leader in memory
     * @throws Exception On any error.
     */
    @throws(classOf[Exception])
    def handleDataChange(dataPath: String, data: Object) {
      inLock(controllerContext.controllerLock) {
        leaderId = KafkaController.parseControllerId(data.toString)
        info("New leader is %d".format(leaderId))
      }
    }

    /**
     * Called when the leader information stored in zookeeper has been delete. Try to elect as the leader
     * @throws Exception
     *             On any error.
     */
    //在选举时发生错误时会删除path,这时会触发listener的该方法
    @throws(classOf[Exception])
    def handleDataDeleted(dataPath: String) {
      inLock(controllerContext.controllerLock) {
        debug("%s leader change listener fired for path %s to handle data deleted: trying to elect as a leader"
          .format(brokerId, dataPath))
        if(amILeader)
          onResigningAsLeader()//被重新选为leader
        elect
      }
    }
}

在主要的elect方法中:

def elect: Boolean = {
    val timestamp = SystemTime.milliseconds.toString
    val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))

    leaderId = getControllerID 
    /* 
     * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, 
     * it's possible that the controller has already been elected when we get here. This check will prevent the following 
     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
     */
    if(leaderId != -1) {
       debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
       return amILeader
    }

    try {
      createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, electionPath, electString, brokerId,
        (controllerString : String, leaderId : Any) => KafkaController.parseControllerId(controllerString) == leaderId.asInstanceOf[Int],
        controllerContext.zkSessionTimeout)
      info(brokerId + " successfully elected as leader")
      leaderId = brokerId
      onBecomingLeader()
    } catch {
      case e: ZkNodeExistsException =>
        // If someone else has written the path, then
        leaderId = getControllerID 

        if (leaderId != -1)
          debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
        else
          warn("A leader has been elected but just resigned, this will result in another round of election")

      case e2: Throwable =>
        error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
        resign()
    }
    amILeader
}

broker只是将自身的数据发送给zookeeper的/controller的path中,这个时候就会触发上面的LeaderChangeListener的handleDataChange方法,该方法就会直接将该broker作为leader,这个时候其他的broker同样会调用elect方法,在发送数据到zookeeper之前会读取/controller的数据,如果发现有broker抢先成为leader了,则直接返回:

leaderId = getControllerID 
/* 
 * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, 
 * it's possible that the controller has already been elected when we get here. This check will prevent the following 
 * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
 */
if(leaderId != -1) {
   debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
   return amILeader
}

如果发现没有其他broker发送数据到zookeeper,那么将自身数据发送过去并成为leader,当在发送数据到zookeeper过程中出现Throwable异常时,会调用resign()方法: def resign() = { leaderId = -1 deletePath(controllerContext.zkClient, electionPath) } 这时就会触发LeaderChangeListener的handleDataDeleted方法,该方法就会重新去选举leader。

从上面看出kafka集群的controller选举并没有很大的特点,只是将先来者看成为leader。发送到zookeeper只是为了让其他的broker知晓leader已经被选举出来了,你可以不用再去参加选举了。

]]>