My recommended IoT solution reliability checklist:
Ensure that the endpoint exposed by the IoT platform is Highly Available to retain connectivity with devices and external applications.
The IoT platform endpoint should be highly available to ensure that devices and external applications can always connect to it. Ways to implement:
- Load balancing: A load balancer distributes incoming traffic across multiple servers, ensuring no single server becomes overloaded. That improves the performance and reliability of the IoT platform.
- Multi-region deployment: Deploying the IoT platform in multiple regions or locations can help improve availability in case of a regional outage. If one area becomes unavailable, devices and applications can still connect to the platform in another location.
Expose the fallback endpoint and implement switching logic in devices to reduce outage risk during a major breakdown.
In case of a major breakdown of the primary endpoint, devices should be able to connect to a fallback endpoint (URL or IP address). Devices should automatically switch to the fallback endpoint if they cannot connect to the primary endpoint.
- Fallback endpoint: The fallback endpoint must not share the underlying infrastructure with the main one to increase resiliency.
- Switching logic: Configure devices to automatically switch to the fallback endpoint if they cannot connect to the primary endpoint.
Use proper Quality of Service (QoS) levels to guarantee the delivery of essential messages and avoid overflows caused by non-critical telemetry messages while restoring communication.
QoS levels ensure that essential messages are delivered reliably. Send non-critical telemetry messages with a lower QoS level to avoid overflows during communication outages.
- MQTT standard defines three QoS levels:
- QoS 0: At most once delivery. Messages may be lost.
- QoS 1: At least once delivery. Messages are guaranteed to be delivered at least once, but they may be duplicated.
- QoS 2: Exactly once delivery. Messages are guaranteed to be delivered exactly once.
- Essential messages: Essential messages must be delivered reliably, even in the event of a communication outage. Examples of essential messages include alarms, alerts, and control commands.
- Non-critical telemetry messages: Non-critical telemetry messages are not essential for the system’s operation. Examples of non-critical telemetry messages include sensor readings and status updates.
Save inbound messages in reliable storage before processing to avoid losing data in case of backend application failure.
Save inbound messages in reliable storage, such as a message queue, before processing by the backend application to ensure that messages are not lost if the backend application fails.
- Reliable storage: A storage designed to protect data from loss or corruption. Examples of reliable storage include message queues and databases.
- Inbound messages: Inbound messages are messages transmitted from devices or external applications.
- Backend application: The backend application processes the inbound messages.
Implement automatic credentials rotation to avoid connectivity loss with remote devices.
Rotate credentials regularly to prevent unauthorized access to devices and increase the security posture of the IoT system. Automatic credentials rotation can be implemented using a certificate authority or a key management service.
- Credentials: Credentials are the secrets used to authenticate devices and applications.
- Credentials rotation: Credentials rotation is the process of regularly changing the credentials devices and applications.
- Certificate authority: A certificate authority is a trusted third party that issues digital certificates.
- Key management service: A key management service provides secure storage and management of cryptographic keys.
What would you add to this list?