Importance of TCP/IP Networking in Modern Mission Critical Applications

In the last two decades, Ethernet and the TCP/IP suite of protocols have become the de-facto standards for not only home networks and the Internet, but also for mission-critical control, automation and protection systems used in the industrial and utility sectors. Originally mostly just used for monitoring of devices and systems as well as facilitating production monitoring and communication between departments, the network is becoming more and more essential to actual protection and automation functionality. However this does mean that the network is becoming even more mission critical and stability and redundancy on the communications is essential. This article will explore the criticality of the network, its initial planning, design, commissioning and installation, as well as the importance of having dedicated personnel to maintain and expand it as necessary.

The modern day TCP/IP network can and should be viewed as the nervous system of an industrial or utility application, and should be treated with the same importance and respect. The network is often seen as just a secondary system that is not as important as the end devices connected to it. While this is true from a high-level sense, it is important to realize that the network services and supports all of these end devices, and as such is just as important, if not more so, than the end devices. Having redundancy and backups for the end devices means nothing if the network does not allow said devices to communication correctly with one another.

Due to this lack of understanding of the importance of the network, it is often the case that the incorrect person/people are put in charge of designing and/or maintaining the network, leading to substandard implementations that either are not working at full efficiency, are not stable and/or reliable enough, or are much more costly than they need to be. Alternatively a good network could be designed, but then not commissioned or monitored correctly leading to a loss of efficiency in the future.

This initial planning and design of the network is one of the more critical phases. As with most systems, if not planned correctly from the beginning a communications network can be inefficient and not provide the service, reliability and/or availability required for a mission critical control or automation network. Once can also end up with a bloated network that over-caters for future expansion and adaption, meaning a very large capital expenditure that often is not fully utilized over time. Catering for expansion is important, however these days this can be done using various modular hardware that allows expansion via modules installed at a later stage, rather than catering for high-port count (and high-cost) switches that are not planning on being fully populated with end devices.

A modular setup like this also can be used to cater for spares in a much more efficient way. Often on these networks, two or three general “categories” of switches may be identified, such as a small (low-port count) switch, a larger (higher port count for end devices) switch and possibly a backbone switch (low port count but high-bandwidth, connect various sections of the network together). Modular switch options often allow for the modules to be shared between different switches, so one could hold one or two of each chassis type (small, large, backbone) on hand, with a selection of common modules which can be installed into any of the chassis’ at a moment’s notice to replace a failed switch or expand a network section. Similarly SFP modules can be utilized, which allow installation of SFPs that can provide a variety of different copper and fibre options. A common option for instance could be to select a unit that comes with a set number of copper RJ45 ports and then a set number of SFP slots. SFPs could be kept separately, that could be installed without delay when required, providing the relevant fibre or copper cable interface. 

These type of replacement/spares strategies are cost-effective while still allowing network administrators to react promptly to any failure or change on the network, without having to wait for long business delays in procuring new hardware. This delay could be made non-critical by replacing out of on-hand stock and then ordering replacement stock while the network is able to continue running. They also allow for a much “slimmer” network that does not have a majority of unused ports, but which can still react very quickly to any requirement for expansion.

Once planned, designed and approved, the network still needs to be commissioned and configured for its required role. This again is an essential step that should not be underestimated. A proper and detailed network design means nothing if it is not implemented correctly. Configuration should be performed by qualified personnel in a controlled and comfortable fashion. Where possible initial configuration and testing should be done in a lab environment rather than on the live system itself, especially where interruptions to the network could result in production impacts. This lab environment should match the planned final system as closely as possible from a logical point of view. This means that if software in a control room will speak with a device on site via a routed connection, this routed connection must be in place during testing with as closely a logical match to the site as possible. Often a very well thought out design is wasted by an incorrect commissioning phase, leading to networks that do not fully match the design or have not been tested to identify possibly unseen issues.

Proper testing at this stage is also critical, both to confirm that the configuration and commissioning was done correctly, as well as to identify other possible issues as mentioned above. In this author’s experience, end systems are often tested across “flat” networks which do not run different VLANs, IP ranges, routers etc. This ensures that the system itself works at a base level, and as such the system is signed off during testing. When being commissioned on site it is often found that the site network is not as “flat” as the testing one, and thus unseen issues can arise, especially when firewalls are in place between different sections of the network. As such it is critical that testing be done on a closely matching logical network including routers, firewalls, correct end software etc. Rectifying an issue, whether major or minor, is much simpler in a controlled, test system than it is on a live system, and often on a live system troubleshooting is simply impossible without arranging an entire shutdown of the site. Ensuring that the initial commissioning is done properly (including regular configuration backups which are often also overlooked) and making sure it matches the documented design makes future troubleshooting much simpler and less intrusive.

On site installation is generally a much simpler step, especially when most of the initial commissioning is done in a lab environment. During the installation phase the goal is to have the hardware be simply plug-and-play, meaning that once mounted, powered and connected with relevant communications cables, the hardware should be all ready to go. A final on-site test can then be implemented, which does not have to be as comprehensive as the lab testing but must ensure the end systems and network are performing to expectations, but where possible no configuration of hardware should be required at this stage.

A critical part of industrial Ethernet networks is proper link redundancy, meaning that if certain backbone links are damaged or disconnected for any reason then a backup redundant cable link will be activated to allow traffic to continue around the network unimpeded. It is often seen that a network is planned and implemented with multiple “levels” of redundancy, meaning high availability and reliability which is key. However if this redundancy is not monitored, which is often the case, then there is no-one reacting to failures, meaning that over time the network loses redundancy. 

For instance if the backbone of the network is connected in a ring fashion, then we have a single level of link redundancy (meaning one link can be lost without experiencing network failure of any kind). If a link in this ring is damaged or disconnected for any reason, the network should still be fully available, i.e. the redundancy will do its job. However if the failed link is not rectified, then the network is no longer redundant, and a second link failure will then cause a break in communication between sections of the network. Correct monitoring and regular maintenance of the network, if implemented, would pick up the original link failure soon after it occurred, allowing pro-active rather than reactive maintenance. Replacing the original link is something that can be done over a period of time, knowing the network is at least operating correctly while the link is being sorted out. However in the event of two link failures communications will be interrupted, meaning operation and safety may be compromised until the link failures are resolved, which could affect production negatively. Similarly a device with two redundant power supplies can lose once and continue running, but if the faulty supply is not replaced, then the device has lost power supply redundancy.

Pro-active monitoring and maintenance of the network and attached devices can be performed in a number of different ways, but a highly important and recommended method is to use some form of NMS (Network Management System). This is a software application that automatically monitors the network and/or attached devices. Using a common open protocol called SNMP (Simple Network Management Protocol), one can have the NMS actively query network and end devices based on a schedule (good for non-critical information such as port utilization etc.). Similarly one could set the end devices or network devices themselves to send an active notification on any issues, which are often used for more critical notices such as port going up or down. The NMS then can be used to store and conglomerate all these network events, as well as monitor port utilization etc. over time. Any detected issues can be configured to send user notifications, which normally can be pushed to an email address or cell number, allowing the engineer to step in and resolve any issues. These systems automate a large portion of the network monitoring and eliminate a large portion of the required manual maintenance. Often they can also be used to monitor end devices at the same time, reporting on things such as HDD utilization in servers or temperatures and conditions in cameras etc. They also provide other useful functionality such as asset accounting, statistics reporting, visual topologies and more.

Another important consideration at the beginning and throughout the lifetime of a network are the policies surrounding the maintenance and changes on the network. This includes not only security considerations (such as password control and availability, access to devices, firewall implementations etc.) which are outside the scope of this article, but also more standard maintenance considerations. For instance keeping track of IP addresses. Often, especially in the case of many 3rd party contractors or different departments within a single organisation, IP addresses assigned to devices are not properly documented and administered. This could lead to duplicate IP addresses on the network which will cause issues, or to incorrectly subnetted and supernetted IP ranges, leading to breakdowns in security and communications in many cases. IP address assignment (and other logical design changes or additions) should be handled by a single individual or team, with all requests being formally submitted, approved/denied and then documented.

It is important to realise that the long term maintenance of a mission critical network is not a hugely time consuming operation, especially when the initial design and implementation of the network were done correctly and according to best practices. However being able to react quickly to failures and/or changes is essential. As such it is often not critical to have a permanent on-book staff member handling the day-to-day network maintenance, but also it should not be handed off as a separate responsibility for an engineer whose focus should be elsewhere. Rather it is worth having an agreement with a service provider who can provide the technical knowledge and professional services for the network maintenance as required. Initial design, planning and implementation phases should enlist the services of a professional as well to ensure a strong, reliable and cost-effective network that provides the uptime and reliability you require without completely breaking the bank.

For more information on industrial and utility networks, as well as all the planning, design, implementation and maintenance services mentioned above contact H3iSquared Trading CC.

www.h3isquared.com

sales@h3isquared.com

+27 (0)11 454 6025