April 5, 2026

How to Diagnose Random PLC Faults and Intermittent Machine Stoppages

Random machine stops are among the most expensive automation problems because normal troubleshooting begins after the evidence has disappeared. The machine restarts, all inputs look healthy and the fault may not return for hours. The event feels unpredictable, but most intermittent failures follow a condition that is simply rare, brief or poorly recorded. The objective is to make the invisible event observable.



Define “random” with precision

Replace the statement “it stops sometimes” with measurable facts. Which controller, equipment module and program state were active? Did the PLC fault, enter Program mode, lose remote I/O or execute a normal stop path? How often does it occur, how long after startup and under which product, speed, shift or weather condition?

Separate controller faults from process stoppages. A controller major fault may leave a fault code and task information. A machine sequence timeout means the PLC likely remained healthy but did not receive expected feedback. Network connection loss, safety demand and drive trip each require different evidence.

Preserve the first event

Secondary alarms can appear milliseconds after the initiating condition. Capture the first-out fault with a timestamp and do not overwrite it until an authorized reset. Store current sequence state, previous state, command source, permissive status, input word, output word and key analog values.

A circular event buffer can record the most recent state changes continuously. When a trigger occurs, freeze a portion of pre-event data and continue recording briefly afterward. This before-trip-after view reveals whether a sensor dropped first, a voltage dip affected several devices or the PLC command disappeared before the actuator stopped.

Use a common time base

Evidence from PLCs, drives, HMIs and managed switches is difficult to compare when clocks differ. Configure approved time synchronization and periodically verify it. Record event time at the source when possible; a historian arrival timestamp includes network and polling delay.

For very fast phenomena, ordinary timestamps are insufficient. Use high-speed input capture, sequence-of-events modules, power-quality instruments or an oscilloscope appropriate to the circuit. Choose sampling according to the suspected event, not according to convenient historian defaults.

Investigate power and grounding

Brief voltage dips can reset remote I/O, disturb sensors or trip drives without stopping the main PLC. Monitor the 24 VDC supply near the affected load, not only at the power supply terminals. Check loading, inrush, loose terminals, protective-device behavior and shared commons. Look for correlation with contactors, solenoids, heaters or large motors switching.

Inspect grounding and shielding against the system design. Noise problems often depend on cable routing, shield termination and cabinet bonding. Do not mask a power problem by increasing software delays until the electrical cause is understood.

Examine intermittent field devices

Loose conductors, damaged flex cables and marginal sensors can produce pulses too short to notice on an HMI. Trend the raw input and the conditioned signal separately. Add a diagnostic counter for unexpected transitions and measure pulse duration where the platform permits.

Mechanical movement can guide the search. If a fault follows a cable-carrier position, vibration level or cylinder stroke, inspect that physical region. Temperature-related failures may appear only after warm-up or washdown. Swapping parts can help, but label and document swaps so the experiment remains interpretable.

Analyze network evidence

Managed switches can reveal link flaps, errors, discards and topology changes. Controller diagnostics may show connection timeouts, rejected requests or resource limits. Check duplicate IP addresses, marginal connectors, duplex or speed negotiation where relevant, multicast handling and excessive broadcast traffic.

Avoid indiscriminate packet capture as the first step. Begin with the failing connection and time window, then collect targeted traffic if switch and device diagnostics cannot distinguish the cause. Confirm whether communication loss triggered the stop or resulted from a device power interruption.

Review software race conditions

Intermittent software faults often involve timing. An input changes near a state transition, two tasks write shared data, a one-shot is instantiated incorrectly or two machines wait for acknowledgements in an unexpected order. Search for multiple writers, unbounded transitions, reused timers and assumptions about task sequence.

Recreate the timing in simulation. Vary input order by a scan, delay acknowledgements and restart devices in different sequences. Add assertions or diagnostic codes at impossible transitions. If the code cannot explain how it reached the captured state, its observability or state model needs improvement.

Change one variable at a time

Under production pressure, teams may replace a sensor, move a cable, edit a timer and reset a drive simultaneously. If the fault disappears, nobody knows which action mattered. Use a hypothesis log: suspected cause, evidence, test, result and next decision. Make the smallest safe change that discriminates between explanations.

After identifying the cause, remove temporary instrumentation or convert valuable diagnostics into permanent features. Verify the repair under the conditions that previously correlated with failure. Monitor long enough to cover the original occurrence interval.

Intermittent-fault capture kit

Prepare reusable PLC blocks for first-out capture, transition history and triggered trends. Keep a portable power-quality logger, approved network tap, spare shielded cables and a standard incident form available. Preparation matters because the next event may last milliseconds while the outage around it lasts hours. A known toolkit reduces improvisation and preserves comparable evidence across incidents.

Intermittent faults become manageable when the system remembers what people cannot witness. First-out logic, synchronized clocks, event buffers, targeted electrical measurements and disciplined experiments transform “random” into a timeline. The engineer’s most valuable action is often not resetting the machine faster, but preserving the few milliseconds that explain why it stopped.

No comments: