Red Wolf Reliability

Papers and Case Studies

Availability vs. Reliability Part 2

As a follow up to my last post, I thought I might add a bit more grit.

It's easy to talk about tracking failures in order to compute Reliability but not so easy to do in the real world. Believe me, I know... My life in the private sector far exceeds my time in academia.

In the real world, a work order is written, given some level of priority, and then the work is assigned. The craftsman performing the repair receives the work order tucked neatly within a stack of them to complete ASAP. He or she finds the parts required and then completes the work at the best of their ability given the current time constraints, tools, and parts available.

The time down is logged in the control room and the work order can be connected to that log. But here is the first obstacle. How many hours were on the asset since the last failure? Sure, the date and time of the failure is tracked, but who knows the actual run time on that asset? How many starts and stops in between? Are starts recorded? Is this asset part of a redundant set and is included in some type of functional switching program? From my point of view, the only reasonable answer to determine run time is to install hour meters. I would be interested to hear other ideas.

That is only obstacle one. Now the question remains: Can the failure be tracked to a particular mode on a specific component? If so, then the failure can be included in a parametric analysis (see previous article) in order to compute Reliability. How can we know this? It's really up to the repair staff to record. Typically there is a spot for free-form text in the work order documentation. But how does one "mine" these data? A routine could be written to look for keywords to trigger tracking, but what if the same data can be recorded with vastly different keywords? Typically a CMMS has the ability to use Problem, Cause, and Remedy codes in order to better capture data. The combination of such codes could be used to determine failure mode and component details.

The real challenge with failure codes lies in two areas: The actual lists of codes, and the interpretation of those lists. The creator of those codes has to work backwards. By this I mean first determine which data are required in order to make useful calculations. For example, in order to track pump cartridge seal failures the codes might be: Problem = leaking, Cause = seal malfunction, Remedy = replacement. Then the Reliability Staff might be able to track hours (using the run-time hour meter) on a new seal installed in a specific pump.

Creating the lists is one thing, interpreting the lists is another. Consider how many different causes would have to be generated in order to handle every failure in the system. And then consider how much time is given to the crafts to document the failure. Even using drop-down lists in the CMMS is cumbersome if the desired code doesn't pop up high on the list. So, those entering these data must see the value in taking the time, in spite of the pressures, to find the right combination of codes.

In my experience, most people will do amazing work if they feel their efforts are valuable. The only way to do this is to actively present what is done as a result of the effort. The results of the data mining must result in actionable tasks that bring improved machine performance, and that improvement must be on display for all to see. For example, if the Reliability Staff takes the data from those pump seal failures, and with that is able to compute seal Reliability. And, if that Reliability value is used to either select a better seal, add a scheduled replacement, or to change how the pump is operated then the value is recognized. The improvement in Reliability as a result of those actions is recorded and used as a benchmark for expanding the program.

I recommend to start small - assign codes and give training for codes on a short list of bad actors. Choose very low hanging fruit to start. Learn from this pilot program, and then let it grow.