Choosing, Designing and Implementing the Right Network for Mission Critical Systems

Introduction

In any industrial or utility control, protection, production and/or safety system the communication network is becoming one of the most critical components, however this is often overlooked and as such a sub-par network can be designed and implemented. These networks will often barely meet the minimum requirements to work, meaning that while the attached systems all test out as working correctly, the network is not actually as stable and reliable as could be believed. However if the network is not directly tested and checked, this could lead to sign-off of a network that is one failed switch or cable away from crashing. Another common issue is sourcing switches that do not cater for all the requirements that might arise in the future. This could mean hardware that is not rugged enough to cater for a harsh environment, or devices that do not offer certain diagnostic and troubleshooting tools, leading to long term issues that can be very costly or difficult to resolve correctly. 

Rectifying such issues can also be extremely time consuming, and can affect production on a live plant, meaning even greater losses. Redesigning and recommissioning a network takes time and effort, and may require physical changes such as cabling etc. if not correctly planned for from the beginning.

Another component that can have an even greater impact than incorrectly selected hardware, is the logical design and configuration of the network. This includes everything from the VLAN and IP subnet design, to the layer 3 routing between and across subnets, to firewalls on WAN links or links to other organisation’s/networks. Fixing physical issues like hardware and cabling can at least in most cases be done in a piece-by-piece fashion across a network, replacing a single switch or cable at a time can be planned properly so as to have minimal impact on the network and attached systems. However redesigning something like an IP subnet or routing infrastructure can lead to outages on entire sections as not only the network devices will require reconfiguration, but also most of the end devices as well. This not only means downtime of these end devices for maintenance, but also this requires getting assistance from users that know the edge device requirements and configuration, which may in some cases be from 3rd party companies and thus much harder to coordinate with than just a single company handling the network itself.

Physical Layout and Site Specific Considerations

The most important step in putting together a truly reliable, resilient network that properly caters for the system it is being built to support is the initial design of this network, and specifically some initial decisions that will greatly affect everything from cabling to hardware layout. Before anything else, we need to consider the physical site/system layout that this network is going to support, and how the cables will be laid across this site. If no cabling or trunking exists, this can be one of the mostly costly components of the network itself, as the civil work required to install the cable can be expensive. We need to cater for a couple of things here. First we must of course make sure that network connection points exist for all end devices that require it. In most cases the end devices in industrial and utility environments are supporting fibre connections directly now, and with multimode 100mbps fibre supporting distances of up to 2km as standard this normally allows us to be quite flexible. However fibre (especially the ruggedized fibre required for long runs) can be quite costly, and so in other cases copper (Cat5e or Cat6) cable is used instead. The important thing here is to remember that copper cable can be susceptible to Electro-Magnetic Interference (EMI), which can be quite strong in high power environments such as substations or near arc furnaces or similar high current machinery. In these cases either shielded copper cable should be used, or in cases of highly critical end devices fibre (even at the increased cost) should be considered. The cost of changing from copper to fibre at a later stage will of course cost a lot more than going with fibre from the beginning.

Physical and Logical Topology Design

Once we have decided upon the basic connection points for all edge devices, we can start looking at the actual physical topology of the network, meaning the interconnections between the switches that will have the edge devices connected to them. In some cases we may look at separate switches for edge device connections, which connect to a central backbone of switches running a high-level of redundancy. This is the more common design used in power grid environments such as substations, with each bay containing one or two edge switches for edge devices, connecting to a central mesh of backbone switches interconnecting the bays with the process and station levels, as well as any WAN connections back to the control centre. In these cases we can distinguish between backbone and edge switches specifically, with edge switches focusing on high number of 100mbps connections, and backbones focusing on gigabit connections to each other and the bay switches.

In an industrial environment we may instead see more commonly a single loop of network switches, or a number of interconnected loops. Often each of these loops will service a separate function in the overall plant/factory, such as one loop for exterior security, another for monitoring of shipping and trucks, another for conveyors etc. These switches will all interconnect at one or more central points, such as in the control room for the site. In these cases it is harder to distinguish between backbone and edge switches, as most switches will be both. In some cases a branched off switch for expansion or connection of specific remote edge devices may clearly be an edge switch rather than backbone, but most of the time in these type networks we do not specifically distinguish between backbone and edge switches. In these cases it is recommended to look at a more flexible, modular network switch which allows on the fly changing of modules according to requirements.

Either way we need to now design the backbone of the network. With the decisions above of where individual edge devices will connect, we can start working with those locations as the nodes for the backbone. We then need to interconnect all these individual locations, keeping in mind this will translate to a physical interconnection between the locations/nodes. This means cable runs etc. must also be considered. Cables obviously cannot always just be run directly between points, and so the restrictions and limitations of the actual site must be considered with the relevant specialists involved. At this time we can start confirming all the distances between nodes, and can use the longest and shortest distances to decide what cabling to use.

Cabling Selections

Copper cabling has a maximum run of 100m (recommended to stick to 95m for actual cable runs due to losses from connections, patch leads etc.), which really limits it to single buildings or very short outside runs. It is also susceptible to EMI as mentioned above, meaning that in certain environments it must be properly shielded. The general rule of thumb in utility and industrial environments is to only use copper within cabinets, and normally only in the control centre. For field devices multimode fibre is recommended instead. Multimode fibre offers distances of 2km for 100mbps connections (edge device to edge switch normally) and 500m for gigabit connections (backbone connections and uplinks to the backbone switches), and so is quite well suited to most sites. In cases where this is still too short, Singlemode fibre can instead be used, however the cost of Singlemode fibre is higher than multimode in most cases, and increases further as the distances required increase (due to the requirement for more precise lasers, cabling etc. so as to minimize signal loss). 

In some cases one may be able to standardize a local site to use multimode fibre throughout (plus some copper connections for HMIs, SCADA machines etc.) with Singlemode fibre only required for WAN breakout (or not required at all in some small site cases). This is generally the case once again in the power grid environment, where substations are small enough to use multimode fibre, and then these are connected to each other and the control room through a wider scale Singlemode fibre connection. In other cases, which can often be seen in mining or similar applications, the majority of the network can use a combination of multimode fibre and copper (especially when EMI is not a major concern), with Singlemode only used for certain longer cable runs (down shafts etc.).

When using fibre for cable runs, one can also consider using multi-core fibre cables where possible/required. These cables include a number of fibre cores within a single armoured/protected cable. This can be much more efficient when multiple cables are required (such as cables out to the field), but also are great for future expansion/maintenance. Having a multiple core cable with a number of dark fibre cores (unconnected to the network) allows for quick resolution of individual cable core breaks in the future as well as for possible expansion when/if needed in the future. However the flip side of this coin is to not over-rely on multi-core cable for redundancy. Having two redundant connections between sections of the site is good practice, however when both these redundant connections are within a single multi-core cable, then a complete break of that cable will break both redundant connections. This is a commonly seen issue where the logical/physical topology of the network is not correctly tied to the actual physical site, leading to designs that on paper seem to be highly resilient but are actually very susceptible to single points of failure in the physical world. A similar issue is seen when devices are specified with redundant power inputs, but on install these are both bridged to a single power source, meaning once again a single point of failure.

Link Redundancy

While working on the various options for cable runs, we must also ensure to keep redundancy in mind. We obviously need at least one physical connection to each location, but we can use various redundancy protocols available to provide higher resiliency and reliability on the network. The choice of the redundancy to be used will depend on or specify the topology that must be used. Some redundancy protocols, such as Media Redundancy Protocol (MRP), specify a specific ring topology must be used, but others such as Rapid Spanning Tree Protocol (RSTP) allow for full meshes of a limited number of switches. Newer redundancy protocols are being developed all the time, such as the recent Parallel Redundancy Protocol (PRP) which actually requires two completely separate independent networks running in parallel. As such it is critical to consider the options available from a site-specific view in terms of where cables can be run, as well as a logical view in terms of what redundancy can or will be used. 

One must also be realistic in the requirements of the site. PRP mentioned above provides incredible levels of redundancy, being able to layer other redundancy protocols within its infrastructure, while also providing instant recovery (rather than having to recover the network, the protocol duplicates all traffic from the beginning, so loss of an entire internal network still means the duplicate packet around a separate independent network is already in transmission). However PRP requires not only having two completely separate (normally exact physical duplicates) networks, but also requiring specialized hardware to allow edge devices to interface with both networks correctly. Very few systems outside of the utility market require such levels of redundancy, and even within utility systems one must be careful to not over-specify PRP which can lead to extremely costly network capital and operational expenditure. Alternatively one can look at using PRP only for certain high criticality systems, while using other redundancy protocols for less critical parts of the system. PRP even allows one to use the internal networks as networks in their own right (with some restrictions on where data can flow), which again can be used to balance redundancy and the cost of implementing that redundancy.

Edge Device Considerations

At this stage already it is important that one has an intermediate understanding of industrial networking design, or employs someone to help with the process, even if just on a rough level for now. These initial decisions on the network can be critical to the entire process, but are often not given the respect they deserve, leading to subpar or overly costly networks being implemented. In fact these earlier decisions are generally much more costly and time consuming to fix than the decisions we need to make in later stages of the design. An incorrect cable design and installation can require an upheaval of the entire site to solve, not to mention the man-hours required. Changing an IP subnet on the other hand, while painful and disruptive, can be done in a few hours of downtime normally. As with many things in life, the impact of not doing things properly from the beginning generally completely outweighs the cost of doing it correctly.

So far everything discussed has been mostly revolving around layer 1 of the OSI model, or the physical layer. There is one last major physical layer point that must be considered, and that is the actual port count available at each connection point of the network for edge devices. In some case a switch may be exclusively for backbone interconnection, meaning no edge port devices required, while in other case the switch may sit closer to the logical edge of the network and will provide network connection for tens of devices. A switch/network’s main purpose in any solution is to provide connectivity to the end devices and allow them to intercommunicate. With this very obvious functionality in mind, it is surprising how often the edge device count that a network must cater for is considered almost an afterthought. The edge port connections required on the network should at this stage be at least partially understood, and these requirements should be kept in mind for the rest of the design phase. 

Hardware Choice and Spares Strategy

Different manufacturers will offer different models of switches, most of which these days are at least slightly modular. This will allow normally a degree of flexibility, however whether this will cater for your requirements must be carefully checked, in best case by contacting your supplier for the hardware and asking for their assistance. Picking hardware at this stage based only on port count and physical details like power supply can easily lead to the incorrect hardware being provided, which will often lead to overpaying for the hardware as you end up purchasing features that will never be implemented, or even worse hardware that does not cater for a critical feature needed at some point. Another big consideration at this stage is to consider spares and expansions management as well. Spares, especially in today’s uncertain world where devices can take over half a year to be delivered due to component shortage, are quite essential to reliable network operations. 

However we also want to ensure that we do not need to waste storage space etc. on too many spares of differing types. The best strategy in these cases is to look at standardizing the network hardware as much as possible. In some networks this might be easier, with the network requirements leading to a single switch build throughout. In other we might want some system where we can install modules into a switch chassis to provide the port counts we require. This could mean for a more central, backbone switch we could look at a few modules with high bandwidth ports like gigabit fibre ports, while for a more edge switch we could swap some of the fibre modules with higher count 100mbps copper or fibre modules, providing more ports with lower individual transfer speeds. In most mission critical networks these days the rule of thumb is to look at gigabit speed backbones, with 100Mbps for edge device connections. Unless your network is meant to also handle high traffic systems such as CCTV, a 100Mbps edge connection/gigabit backbone philosophy should suffice in most cases. However for higher traffic networks such as those handling the CCTV, gigabit edge connections may be needed and even in some cases 10 Gigabit backbones.

From Physical to Logical Design

At this stage we should have considered the physical topology of the network and have an idea of the edge devices counts per switch. This will allow us to start getting an idea of what switches we can look at using, although there may be some upcoming decisions still that will determine what features the switches must support. However more importantly, we should now have a rough idea of the size of the network, and what devices are going to be using it. The next step is to consider these end devices and how they can be logically segregated on the network. For instance we can generally have different groups of devices for different functions. So SCADA related devices could be one group, engineering access (HMIs etc.) may be another group. In some cases we may have security specific devices we want in their own group, or allow for guest access for contractors etc. to have limited internet and network access. In other sites the segregation may be geographical, or may relate to the actual functionality of the plant (such as separating different stages of manufacture into their own groups). This logical segregation is very site and scenario specific, and should be done with input from both the users of the end devices as well as a network design specialist. 

VLANs and IP Subnetting

The goal in the end however is to split devices into these logical groups, which can then be each assigned their own VLAN and IP subnet. IP subnetting is very common to all larger networks, whether a corporate office network or a mission critical network such as we are considering. IP subnetting means that devices cannot share traffic at layer 3 (the IP layer), meaning that to all intents and purposes they cannot communicate with one another without a router in place to route the traffic between different subnets. However even with IP subnets in place there are certain traffic types that will be able to reach between devices, such as various broadcast traffic. 

VLANs allow us to rather segregate the traffic at the switching level, meaning that we can stop any traffic from one VLAN reaching a device in another VLAN without us specifically routing and allowing this traffic across. This more stringent segregation has a number of advantages, not the least of which is security, as viruses and malware will often use abnormal methods to transmit themselves between devices, such as by exploiting broadcasts and similar. It also increases things like reliability and availability, while reducing traffic amounts that the end devices have to deal with. Problems are also much better kept within a VLAN, meaning that issues in one system are less likely to affect other systems on the network. In many cases as well, such as in utility networks that must comply with IEC61850 standards, VLANs and other protocols are not only recommended but in fact required in some cases to provide the relevant functionality.

The final VLAN and IP subnet designs will both be very specific to the application in question, and many different approaches can be correct. These are two cases where the design will benefit a lot from the expertise of someone familiar with mission critical network design. Knowing some of the best industry practices as well as having hands on familiarity with these type of systems becomes quite essential to provide a well-balanced and efficient VLAN design. Similarly the IP subnetting will be dependent on certain factors, not least of which is the question of is this network segment is part of a larger network that needs to be interfaced with at some point. For instance in something like a substation network design for a new IPP that needs to interface to an existing government institute’s systems at some point once needs to consider their existing networking layout to a degree and cater for things like routing and remote access where needed. The VLAN and IP subnetting setup however is quite critical to efficient and reliable operation of the network and attached systems, and also creates the foundation of the logical network design, upon which the rest of the network design will be based.

Routing, Cyber-Security and Beyond

For instance our next major consideration will be routing on the network, both for those cases where devices in one VLAN need to communicate to devices in another VLAN, as well as to potentially allow devices on the network to communicate outside to a Wide Area Network such as a country-wide private network, or more commonly, the Internet. Routing between VLANs within your private network is generally secure and does not raise security concerns, however any network not under your direct control should always be considered insecure, even if this is another company network such as the corporate IT network. From a security point of view we must always protect from attached from such unsecure networks, which can very easily originate within a corporate network. These attacks can be specifically targeted attacks from users with malicious intent, but can also originate from users without malicious intent that are simply in the wrong place doing the wrong things. In either case we should protect ourselves using an appropriate level of cyber security.

Cyber security is another very open ended spectrum when talking about a mission critical network, and can also be very difficult to justify, especially in today’s world where more than just a basic firewall is generally required. Depending on the levels of security applied, this can run in the region of tens or hundreds of thousands of Rand, or even more. Security does not add convenience to the network or its systems (in fact it generally restricts convenience quite severely), and will not actively increase profits or productivity. In fact once again it will generally have a detrimental effect on profits (especially when monthly or annual license renewals are required etc.) and also productivity (there are often extra steps required in many processes that are added by increasing security). As such security can be very difficult to quantify properly, especially when the security portions can outweigh the rest of the network in some cases. However the cost of having a security breach these days can be devastating, and the cost of not having proper cyber security can end up crippling a company or completely putting them out of business. It is not uncommon to hear of a new major data breach every few days, and things like ransomware or similar can not only completely shut down production on a site, but even worse could interfere with things like safety and security, leading to even further losses or even worse a risk to employee lives. A full discussion on designing and choosing a proper network security system is beyond the scope of this editorial, but should not be overlooked and should have the same time and care dedicated to it as was spent on the rest of the network design at the very least. 

Conclusion

In summary it is important to realize that the network is effectively the nervous system for a modern site, allowing all the various protection, control, monitoring, security and safety devices to intercommunicate in a quick, reliable and efficient fashion. Often times the design and implementation of the network is handed off as a side project to someone involved with one of the main control systems on a site, and they are simply worried about getting basic communications to work. However this generally leads to an under-designed and unreliable network. In many cases such a network will work fine for the first year or two, but as soon as devices and cables start aging or something gets damaged the network starts to struggle, and resolving the issue at this stage can be much more costly and disruptive that simply doing a proper resilient network design from the beginning. This design and implementation should be handled by specialists in industrial networking, who understand the process and have a level of experience in dealing with the unforeseen problems that can arise on these systems. As with many such components of a mission critical system, skimping on effort and cost early on in the network design can lead to much more effort and money being expended in the long run, whereas spending the time and capex early on will lead to a resilient network that requires very minimal maintenance, leaving budget and time available to spend on maintaining and upgrading the actual control, protection and monitoring systems on the site.