Monitoring
Overview
System monitoring is an imporant component of maintaining a reliable system. There are many open source packages that can be used for monitoring the system. These systems can monitor CPU, network, etc usage; send messages in the event of an error; or shutdown the system in the event of an emergency. Examples of industry standard systems are:
Nagios - https://www.nagios.org/
Ganglia - http://ganglia.info/
Zappix - https://www.zabbix.com/
Cacti - https://www.cacti.net/
Example of a monitoring web page: https://advance.colorado.edu/computing/ClusterStatus