Background
There was a requirement to automatically switch to a certain interface, using Vue’s watch to listen to the gRPC connection state (subscription), and then calling an RPC interface to fetch a certain value. Occasionally, the value could not be retrieved, and the logs showed an Unavailable error:
1rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp xxxx: connectex: No connection could be made because the target machine actively refused it.
It was known that there were two gRPC servers locally, one port providing streaming mode and another port providing simple mode. After the streaming mode server started, it would trigger the client to fetch the value via the simple mode. Clearly, the client sent the request before the server was ready to accept it. gRPC adopts the fail fast philosophy, where RPC calls immediately fail when the Channel is in the TRANSIENT_FAILURE (transient failure) state.
- Fail Fast Philosophy This is a software design and system architecture concept where the core idea is to immediately stop the current operation and report an error when encountering errors or exceptional situations, rather than continuing to execute logic that could lead to more severe issues. The goal of Fail Fast is to detect problems early, reduce error propagation, and thereby lower the complexity of debugging and maintenance. Examples include input validation, system startup checks, assertions, and unit tests. In gRPC, this philosophy is reflected in the immediate failure of RPC calls when network connection issues (such as temporary interruptions) occur, rather than having them wait for a long time. The benefit is the quick release of resources, avoiding wasting time and computational resources on unavailable connections, thus improving the overall availability and stability of the system. The downside is poor fault tolerance, making it unsuitable for all scenarios.
- Channel State In gRPC, a Channel manages the connection between the client and the server. It has various states, such as IDLE (idle), CONNECTING (connecting), READY (ready), etc. When a Channel is in the TRANSIENT_FAILURE state, it means the current connection is temporarily unavailable but may recover in the future. In this state, following the Fail Fast principle, RPC calls will immediately fail rather than waiting for the connection to recover.
Wait for Ready
If an RPC is issued but the channel is in
TRANSIENT_FAILURE
orSHUTDOWN
states, the RPC is unable to be transmitted promptly. By default, gRPC implementations SHOULD fail such RPCs immediately. This is known as “fail fast,” but usage of the term is historical. RPCs SHOULD NOT fail as a result of the channel being in other states (CONNECTING
,READY
, orIDLE
).gRPC implementations MAY provide a per-RPC option to not fail RPCs as a result of the channel being in
TRANSIENT_FAILURE
state. Instead, the implementation queues the RPCs until the channel isREADY
. This is known as “wait for ready.” The RPCs SHOULD still fail beforeREADY
if there are unrelated reasons, such as the channel isSHUTDOWN
or the RPC’s deadline is reached.
When the wait for ready feature is set to true, even if the Channel is in the TRANSIENT_FAILURE state, it will not immediately return an error but will wait until it becomes READY before continuing with the RPC. It is important to note:
- When the context is canceled or the deadline is exceeded, the RPC will still end with an error and will not wait indefinitely.
- If cancellation and timeout are not handled correctly, in cases where the connection cannot be restored, the RPC may remain in the queue forever, posing risks of resource leaks and even OOM (Out of Memory).
Example
If you want to apply this to all interfaces, you can directly modify it in the interceptor:
1func interceptor(ctx context.Context, method string, req, reply interface{}, cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error {
2 opts = append(opts, grpc.WaitForReady(true))
3 return invoker(ctx, method, req, reply, cc, opts...)
4}
Since my requirement only ensures that a specific RPC request waits, while I still want other RPC requests to Fail Fast, I only need to set it on the corresponding RPC request:
1res, err := client.GetSth(context.Background(), req, grpc.WaitForReady(true))