Erlang application stop order

An EMQX troubleshooting case study.

Key words: emqx, shutdown crash, shutdown order, race condition

EMQX is an open-source MQTT broker built on Erlang/OTP which can serve massive amount of TCP/TLS connections. The underlying library for listening and accepting MQTT connections is called esockd

The trouble

Often (especially under heavy load), when shutdown the broker, we may observe error logs like below.

2021-04-20 20:44:39.035 [error]     supervisor: 'esockd_connection_sup - <0.2384.0>'
    errorContext: connection_shutdown_error
    reason: {shutdown,
                {gen_server,call,
                    [<0.2355.0>,{subscribe,<<"device/ca80169898/+/">>}]}}
    offender: [{pid,[<0.17517.29>]},
               {name,connection},
               {mfargs,
                   {emqx_connection,start_link,
                       [[{deflate_options,[]},
                         {max_conn_rate,1000},
                         {active_n,100},
                         {zone,external}]]}}]

The only way to explain it, is that by the time when shutting down connections from esockd_connection_sup, the emqx_broker process is being shutdown (hence causing gen_server caller to EXIT with reason {shutdown, TheCall} .

To reason why such shutdown order may happen, we need to take a closer look at the emqxand esockd applications.

In EMQX, emqx app depends on esockd app, so esockd is start prior to emqx. Shutdown is, as expected, done in the reversed order: emqx is stopped before esockd

EMQX delegates its process supervision to esockd, in emqx_app:stop/1, it first tries to stop all esockd listeners (and supervised connection processes) like this:

 -spec(stop(State :: term()) -> term()).
 stop(_State) ->
     ok = emqx_alarm_handler:unload(),
     emqx_boot:is_enabled(listeners)
       andalso emqx_listeners:stop().

All should work as expected ? No, not really, from the injected io:format logs, it looks confusing: it seems that emqx_sup is stopped before socket showdown, but the logs are mixed.

(emqx@127.0.0.1)1> Stop dashboard listener on 0.0.0.0:18083 successfully.
Stop http:management listener on 0.0.0.0:8081 successfully.
2021-07-27T14:47:54.614910+02:00 [notice] stomp#1:tcp stopped on 0.0.0.0:61613
stopped all children esockd_connection_sup
Stop stomp stomp#1:tcp listener on 0.0.0.0:61613 successfully.
Stop mqttsn mqttsn#1:udp listener on 0.0.0.0:1884 successfully.
2021-07-27T14:47:54.616168+02:00 [notice] exproto#1:tcp stopped on 0.0.0.0:7993
stopped all children esockd_connection_sup
Stop exproto exproto#1:tcp listener on 0.0.0.0:7993 successfully.
terminated emqx_broker {broker_pool,40} # supervisor shutdown
terminated emqx_broker {broker_pool,39}
terminated emqx_broker {broker_pool,38}
....
emqx_app:stop/1 start
2021-07-27T14:47:54.650683+02:00 [notice] default:mqtt_ssl stopped on 0.0.0.0:8883
stopped all children esockd_connection_sup # esockd stop children done
2021-07-27T14:47:54.651288+02:00 [notice] default:mqtt_tcp stopped on 0.0.0.0:1883
stopped all children esockd_connection_sup
2021-07-27T14:47:54.652625+02:00 [notice] internal:mqtt_internal stopped on 127.0.0.1:11883
stopped all children esockd_connection_sup
emqx_app:stop/1 end
2021-07-27T14:47:54.655330+02:00 [info] [Modules] Unload emqx_mod_presence module successfully.
2021-07-27T14:47:54.655457+02:00 [info] [Modules] Unload emqx_mod_recon module successfully.
esockd_app:stop/1
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed

Let's dig deeper:

In Erlang documentation

Module:stop(State)

This function is called whenever an application has stopped. It is intended to be the opposite of Module:start/2 and is to do any necessary cleaning up. The return value is ignored....

Confirm in source code: application_master.erl

loop_it(Parent, Child, Mod, AppState) ->
      receive
      {Parent, get_child, Ref} ->
          Parent ! {child, Ref, Child, Mod},
          loop_it(Parent, Child, Mod, AppState);
      {Parent, terminate} ->
          NewAppState = prep_stop(Mod, AppState),
          exit(Child, shutdown),
          receive
          {'EXIT', Child, _} -> ok
          end,
          catch Mod:stop(NewAppState),
          exit(normal);

Meaning, in EMQX, emqx_sup is stopped before the listeners.

The fix

The fix is simple, make use of the application behaviour's prep_stop/1 callback.

Try EMQX Cloud for Free
A fully managed, cloud-native MQTT service
Get Started →

Related Posts

Erlang VS Scala

Erlang is a general-purpose, concurrent, functional programming language, as well as a garbage-collected runtime system.

2018-09-12
Lightweight edge computing EMQX Kuiper and Azure IoT Hub integration solution

This article takes a common IoT usage scenario as an example and describes how to use edge computing to achieve fast, low-cost and efficient processing of business.

2019-11-20
MQTT for Elixir

This article introduces the MQTT protocol and its advantages, and demonstrates the use of MQTT in Elixir.

2022-01-19