RSS

A short introduction to Channelz

Channelz is a tool that provides comprehensive runtime info about connections at different levels in gRPC. It is designed to help debug live programs, which may be suffering from network, performance, configuration issues, etc. The gRFC provides a detailed explanation of channelz design and is the canonical reference for all channelz implementations across languages. The purpose of this blog is to familiarize readers with channelz service and how to use it for debugging issues. The context of this post is set in gRPC-Go, but the overall idea should be applicable across languages. At the time of writing, channelz is available for gRPC-Go and gRPC-Java. Support for C++ and wrapped languages is coming soon.

Let’s learn channelz through a simple example which uses channelz to help debug an issue. The helloworld example from our repo is slightly modified to set up a buggy scenario. You can find the full source code here: client, server.

Client setup: The client will make 100 SayHello RPCs to a specified target and load balance the workload with the round robin policy. Each call has a 150ms timeout. RPC responses and errors are logged for debugging purposes.

Running the program, we notice in the log that there are intermittent errors with error code DeadlineExceeded, as shown in Figure 1.

However, there’s no clue about what is causing the deadline exceeded error and there are many possibilities:

  • Network issue, for example: connection lost
  • Proxy issue, for example: dropped requests/responses in the middle
  • Server issue, for example: lost requests or just slow to respond

Figure 1. Program log screenshot

Let’s turn on grpc INFO logging for more debug info and see if we can find something helpful.

Figure 2. gRPC INFO log

As shown in Figure 2, the info log indicates that all three connections to the server are connected and ready for transmitting RPCs. No suspicious event shows up in the log. One thing that can be inferred from the info log is that all connections are up all the time, therefore the lost connection hypothesis can be ruled out.

To further narrow down the root cause of the issue, we will ask channelz for help.

Channelz provides gRPC internal networking machinery stats through a gRPC service. To enable channelz, users just need to register the channelz service to a gRPC server in their program and start the server. The code snippet below shows the API for registering channelz service to a grpc.Server. Note that this has already been done for our example client.

import "google.golang.org/grpc/channelz/service"

// s is a *grpc.Server
service.RegisterChannelzServiceToServer(s)

// call s.Serve() to serve channelz service

A web tool called grpc-zpages has been developed to conveniently serve channelz data through a web page. First, configure the web app to connect to the gRPC port that’s serving the channelz service (see instructions from the previous link). Then, open the channelz web page in the browser. You should see a web page like Figure 3. Now we can start querying channelz!

Figure 3. Channelz main page

As the error is on the client side, let’s first click on TopChannels. TopChannels is a collection of root channels which don’t have parents. In gRPC-Go, a top channel is a ClientConn created by the user through Dial or DialContext, and used for making RPC calls. Top channels are of Channel type in channelz, which is an abstraction of a connection that an RPC can be issued to.

Figure 4. TopChannels result

So we click on the TopChannels, and a page like Figure 4 appears, which lists all the live top channel(s) with related info.

As shown in Figure 5, there is only one top channel with id = 2 (Note that text in square brackets is the reference name of the in memory channel object, which may vary across languages).

Looking at the Data section, we can see there are 15 calls failed out of 100 on this channel.

Figure 5. Top Channel (id = 2)

On the right hand side, it shows the channel has no child Channels, 3 Subchannels (as highlighted in Figure 6), and 0 Sockets.

Figure 6. Subchannels owned by the Channel (id = 2)

A Subchannel is an abstraction over a connection and used for load balancing. For example, you want to send requests to “google.com”. The resolver resolves “google.com” to multiple backend addresses that serve “google.com”. In this example, the client is set up with the round robin load balancer, so all live backends are sent equal traffic. Then the (logical) connection to each backend is represented as a Subchannel. In gRPC-Go, a SubConn can be seen as a Subchannel.

The three Subchannels owned by the parent Channel means there are three connections to three different backends for sending the RPCs to. Let’s look inside each of them for more info.

So we click on the first Subchannel ID (i.e. “4[]") listed, and a page like Figure 7 renders. We can see that all calls on this Subchannel have succeeded. Thus it’s unlikely this Subchannel is related to the issue we are having.

Figure 7. Subchannel (id = 4)

So we go back, and click on Subchannel 5 (i.e. “5[]"). Again, the web page indicates that Subchannel 5 also never had any failed calls.

Figure 8. Subchannel (id = 6)

And finally, we click on Subchannel 6. This time, there’s something different. As we can see in Figure 8, there are 15 out of 34 RPC calls failed on this Subchannel. And remember that the parent Channel also has exactly 15 failed calls. Therefore, Subchannel 6 is where the issue comes from. The state of the Subchannel is READY, which means it is connected and is ready to transmit RPCs. That rules out network connection problems. To dig up more info, let’s look at the Socket owned by this Subchannel.

A Socket is roughly equivalent to a file descriptor, and can be generally regarded as the TCP connection between two endpoints. In grpc-go, http2Client and http2Server correspond to Socket. Note that a network listener is also considered a Socket, and will show up in the channelz Server info.

Figure 9. Subchannel (id = 6) owns Socket (id = 8)

We click on Socket 8, which is at the bottom of the page (see Figure 9). And we now see a page like Figure 10.

The page provides comprehensive info about the socket like the security mechanism in use, stream count, message count, keepalives, flow control numbers, etc. The socket options info is not shown in the screenshot, as there are a lot of them and not related to the issue we are investigating.

The Remote Address field suggests that the backend we are having a problem with is “127.0.0.1:10003”. The stream counts here correspond perfectly to the call counts of the parent Subchannel. From this, we can know that the server is not actively sending DeadlineExceeded errors. This is because if the DeadlineExceeded error is returned by the server, then the streams would all be successful. A client side stream’s success is independent of whether the call succeeds. It is determined by whether a HTTP2 frame with EOS bit set has been received (refer to the gRFC for more info). Also, we can see that the number of messages sent is 34, which equals the number of calls, and it rules out the possibility of the client being stuck somehow and results in deadline exceeded. In summary, we can narrow down the issue to the server which serves on 127.0.0.1:10003. It may be that the server is slow to respond, or some proxy in front of it is dropping requests.

Figure 10. Socket (id = 8)

As you see, channelz has helped us pinpoint the potential root cause of the issue with just a few clicks. You can now concentrate on what’s happening with the pinpointed server. And again, channelz may help expedite the debugging at the server side too.

We will stop here and let readers explore the server side channelz, which is simpler than the client side. In channelz, a Server is also an RPC entry point like a Channel, where incoming RPCs arrive and get processed. In grpc-go, a grpc.Server corresponds to a channelz Server. Unlike Channel, Server only has Sockets (both listen socket(s) and normal connected socket(s)) as its children.

Here are some hints for the readers:

  • Look for the server with the address (127.0.0.1:10003).
  • Look at the call counts.
  • Go to the Socket(s) owned by the server.
  • Look at the Socket stream counts and message counts.

You should notice that the number of messages received by the server socket is the same as sent by the client socket (Socket 8), which rules out the case of having a misbehaving proxy (dropping request) in the middle. And the number of messages sent by the server socket is equal to the messages received at client side, which means the server was not able to send back the response before deadline. You may now look at the server code to verify whether it is indeed the cause.

Server setup: The server side program starts up three GreeterServers, with two of them using an implementation (server) that imposes no delay when responding to the client, and one using an implementation (slowServer) which injects a variable delay of 100ms - 200ms before sending the response.

As you can see through this demo, channelz helped us quickly narrow down the possible causes of an issue and is easy to use. For more resources, see the detailed channelz gRFC. Find us on GitHub at github.com/grpc/grpc-go.