How To Build A Maintenance Shop

Mac Davis
2 days ago
11 min read

A core function approach to eliminating maintenance reactivity.

Walk into almost any industrial facility and ask the maintenance manager how things are going.

You will hear some version of the same answer: "We're getting killed and everyone is mad at us.” 

It's typically not a laziness or staffing problem. It's usually an organizational issue and sometimes a matter of technical knowledge. But it's fixable. 

Running maintenance is like driving through a forest. It's easier if you don't hit all the trees.  

So avoidance is the game, and learning to avoid your biggest issues is how you make life easy.

I have done this. I have taught this. It's doable. It's repeatable.

First, about this premise:  

The Pareto rule says that no matter where you are, 80% of your losses will be from 20% of your problems which could be restated to say that, today's problem is probably your biggest problem and preventing it from happening again is probably the most valuable thing you could do.

Winning in industrial maintenance is all about figuring out how not to have the same problem again, to learn to avoid each tree after each impact. Then build the fix into the organizational processes so it becomes permanent.

That is, fundamentally, organizational learning. 

The way we win is we modify our system every day so we don't have to experience the same problems again.

The Gravitational Pull of the Floor

Here is the honest reality of industrial maintenance: firefighting has gravity.

Fires are loud, visible, and urgent. A line that is down demands a response right now. The pressure is immediate and the feedback is clear: you fix it, production resumes, and for a moment everyone is relieved.  

That relief feels like success. 

Decisions about the future are quiet. Nobody pages you because a PM needs to be revised. Nobody calls an emergency meeting because a critical spare isn't stocked. Nobody sounds an alarm because a technician doesn't remember how to test a 3 phase motor.

Those things aren't urgent but they are important. In fact, handling them well is more important than handling the urgencies well.

Someone has to rise above the daily fire drill and focus on fixing the process. This means limiting participation in the reactive behavior of the floor and delivering calm thought and administrative actions that drive the organization forward.

That someone is the maintenance manager. The mechanism needed is a set of five operational decisions that have to be made every single day for every single problem.

The 5 Decisions That Change the Future

The daily analytical work of a maintenance manager begins with a simple premise: every failure from yesterday is a learning and organizational improvement opportunity today. 

For every problem that occurred yesterday, the manager works through five questions. Each one represents a lever the maintenance organization can pull to change its future state. Together, they cover every countermeasure available. 

1. Could this be eliminated with preventive maintenance?

If yes, the next question is whether an adequate PM already exists.  

If it does and wasn't followed, that is a standards enforcement failure and the manager holds the supervisor accountable for it. Not the technician. The supervisor owns execution-to-standard within their team; that accountability chain cannot be bypassed without destroying it.

If the PM doesn't exist or isn't sufficient, it gets fixed.

What PM doctrine do we use? 

Quadrants of Failure

Quadrants of failure is a quadrant chart that shows how fast things fail on one axis and whether failures are random or age based on the other.

So when we write PMs, we want to both create conditions that slow/prevent failures (this should be done frequently) and which inspect the state of degradation of components so we can work-order them to be replaced (this should be done on the scale at which degradation is expected).

2. Could the downtime have been reduced by stocking a spare part?

Some failures can't be prevented. PLC failures, for example. We run PLCs to failure because that's the most efficient way to handle them (you can keep them cool, clean, and all your connections tight, though).

However, extended downtime because the team was waiting on freight is a solvable problem. If the failure mode has a chance of repetition, the cost of carrying the part almost always beats the cost of the next unplanned event.

Now, buying parts because you can't keep up with your stocking system is a different problem with a different doctrine to fix it. But for this one, most of the items that you use are items you will use again. Having a spare usually makes sense.

If you reference the quadrants of failure doctrine discussed in the previous section, good PM practices should allow low stock levels on everything but fast-random failures (electronics, largely).

Stocking stuff you don't need consumes money you do need. You shouldn't be buying a 2 year supply of anything.

It's a good idea to have some stock of even the inspection categories from the quadrants of failure, but you're unlikely to have a freak sprocket or filter failure. If you're losing those items in an emergency manner, it's either a catastrophic crash or it's because you didn't inspect and respond.

Building your inspection processes greatly alleviates the pressure to get your stock right because it gives you lead time on problems.

3. Could a machine modification eliminate the issue?

Some failures are not just maintenance problems. They are engineering problems wearing a maintenance uniform. If a failure mode could be prevented by a machine modification then it may be worth doing. Identifying modification opportunities is legitimate analytical output from this daily review.

Once identified, you would implement review and change management processes to handle this correctly.

4. Could a tool mitigate or eliminate the problem?

Diagnostic time is downtime. If the team spent two hours troubleshooting blind when the right instrument would have provided clarity in twenty minutes, that is a tool gap.

And it's not just instrumentation. Pullers instead of hammers, for instance, are much faster and less destructive. Buy tools and put them in an organized crib so you can tell they're accounted for. Check them daily. 

5. Could better information have helped?

This is the most nuanced question, because it forces a second decision: does this information belong in someone's memory, or does it belong in a reference?

These are not interchangeable. Tasks that technicians perform regularly (procedures, sequences, safety steps) need to be trained to retention through repetitive practice. Expecting people to perform reliably under pressure from a half-remembered walkthrough is a setup for failure.

Those tasks belong in an annual training plan so that they're retrained until they can be performed from memory.

Everything else: complex troubleshooting logic, rare failure modes, machine-specific schematics, vendor procedures... that all belongs in a job plan database that the team can access at the moment of need. A reference nobody can find is not a reference. It's a wasted resource.

Your reference information needs to be easy to search. It needs to be organized and new information needs to be added to it daily. Records of previous failures and how we solved them are absolutely necessary as references. Lessons learned are references too.

The discipline is deciding clearly which category each piece of knowledge belongs to and then building the infrastructure for both.

If you have a training policy and a database for referencing information you should be updating each as daily failures outline needed changes.

Why These Decisions Have to Happen Every Day

Each of these decisions, executed consistently, removes a source of variance from the department's future.

Failures that recur stop recurring. Diagnostic dead-ends get shorter. New technicians onboard faster because the knowledge is in the system rather than locked in senior heads.

But the effect only compounds if the decisions are made at operational tempo - daily, against fresh data, while the specific context of each failure is still intact. 

Review failures weekly and you lose granularity. Review them monthly and they become statistics, useful for trend analysis but nearly useless for operational decision-making.

The daily rhythm is not a preference. It is what makes the mechanism work. The documents that these 5 questions drive have to be updated daily. 

If the manager does not force these five decisions to happen every single day, everyone defaults to firefighting.

The gravitational pull of the floor takes over. The planner has no inputs to work from. The supervisors have no new standards to enforce. The technicians keep solving the same problems the same way. The department runs hard and goes nowhere.

The five decisions are the only things that change the future.

Everything else is just treading water.

An example to illustrate: belt failure: when deciding how to react to an issue that happened, frame it by the time it cost you.

IE: I had 24 hours of downtime because a belt broke. How can I reduce that downtime?

Scenario: If a belt goes out because it was hard to inspect so the tech skipped it, the junior tech on night shift couldn't figure out what the issue was, it failed because it was dry rotted, and it took 24 hours to get another - the issue isn't that the belt failed from age. 

1) Your team didn't follow the PM or the belt would have been checked. The supervisor needs to own that, be held accountable, and fix it. Supervisors are responsible for delivering compliance... they are the ones you hold accountable when compliance isn't delivered. If you don't hold them accountable they will never talk to techs about standards.

2) Maybe the PM needs to be updated with better instructions. If it's a "hard to inspect", a change in the PM or annual replacement instead of inspection could be considered. If it’s dry rotted, you might want to start changing it more often.

3) If a tech couldn't figure out what to do, you need to either create repetitive training or a reference the tech can use. That reference could be a change in the job plan. You want competence next time this happens.

4) You didn't have the belt on-hand - this likely accounts for most of your downtime. You might need to stock it.

5) You may have an opportunity to modify the machine to make it easier to maintain.

That’s all 5 questions from a single belt failure. This is the work maintenance managers (and planners) need to do though. 

The Structure That Makes It Possible

Here is where most well-intentioned maintenance improvement efforts break down: the decision logic is sound, but the organizational structure needed to execute on it doesn't exist.

A maintenance manager making five decisions a day will generate a continuous stream of required actions. The manager's staff is critical to help push these tasks into implementation. PMs need to be revised and written (and with daily changes, the techs absolutely must perform them with reference in-hand). Parts need to be evaluated and stocked. Reference documents need to be built. Training needs to be designed and drilled. Modifications need to be scoped and escalated.

None of that happens automatically. It happens because the right people, with the right competency profiles, are in the right seats. In a well-built maintenance organization, three distinct functional roles have to exist.

The Manager

The manager's job is to own the daily decision rhythm. Not to execute the downstream work but to ensure it happens through clear processes and accountable people.

The manager needs to work on processes and, for that reason, the supervisors must be compelled to own compliance to the standards and processes the manager creates.

The manager's time goes to three things: making the five daily decisions and assigning their implied actions, owning the procedural systems that ensure those actions get completed, and developing supervisors to be better at leading their teams.

This role requires genuine technical depth... enough to correctly evaluate whether a PM is insufficient or simply wasn't followed, whether a failure mode is a maintenance problem or an engineering problem, whether a tool gap is real or a training gap in disguise. A manager without that technical foundation cannot make these decisions accurately. They can only relay information.

This role also requires maintenance doctrine expertise: knowledge of reliability principles, PM strategy, information management, and how accountability structures function in an industrial organization. Technical skill alone is not enough. You have to understand the system you are building.

The Planner

Planning and scheduling are two distinct, critical functions. They are not afterthoughts. They are not shared duties. And they do not get absorbed into the supervisor role when headcount is tight.

The planner is the execution arm of the manager's decisions and is the primary player who handles the ingesting of information into the processes. When the manager determines that a PM needs to be revised, the planner is the person who rewrites it correctly. When a new procedure needs to be documented, the planner builds it.

This doesn't require a full beautiful procedure for every task. That's nice, but not necessary. If a job goes terribly today and the tech learns something important... that lesson alone, written in sufficient detail, is a good job plan if it's clear enough and delivered effectively enough to prevent recurrence.

This role requires two competencies that rarely live in the same person: the technical knowledge to understand how a task should actually be performed, and the administrative skill to write it clearly and executably. A PM written by someone with only one of those qualifications is a PM that technicians will either mis-execute or ignore. It doesn't have to be elegantly written, but it's got to be clear.

The planner also owns scheduling, ensuring that the right work gets done at the right time with the right resources. This is a cognitive function, not a clerical one. Done well, it is one of the highest-leverage positions in the department.

The Supervisor

The supervisor owns training execution and standards enforcement. These are not separate responsibilities. They are the same responsibility expressed in two directions. 

The supervisor must personally master every skill in the training deck. Not be familiar with it. Not have reviewed it. Master it. You cannot lead training if you do not know the material, and you cannot hold a technician accountable to a standard you cannot yourself demonstrate.

Once trained, the standard exists. The supervisor's job is to ensure it is followed consistently, visibly, without exception. When a failure event traces back to a standard that wasn't followed, the manager holds the supervisor accountable. The supervisor holds the technician accountable. That chain of accountability is not bureaucratic formality. It is the mechanism that keeps standards alive.

A supervisor who cannot train and cannot or will not enforce standards is a liability.

Building the Shop 

Most maintenance departments don't fail because they lack good people. They fail because they lack role clarity. The manager is absorbed in execution with no time for analysis and doesn't have clarity on how to quickly and efficiently change processes... or what changes are needed. The planner doesn't exist as a distinct function or is wasted making parts lists instead of handling organizational learning. The supervisor is running firefighting instead of ensuring everyone meets standards.

Everyone is doing something that's urgent, but not actually part of moving the organization forward.

The path out is deliberate construction.

Build the daily decision rhythm first, the manager protecting time every day to ask the five questions about yesterday's failures and route each one to an action. That is the engine. Without it, nothing downstream matters because there are no inputs.

Build the planning function as a real role with a real competency profile. Give the planner the technical depth and the administrative skill to turn decisions into executable standards. Invest in their development. 

Build the supervisor into a trainer and an accountable leader. Define the training deck, it is a finite number of tasks that need to be able to be performed from memory. Require mastery. Create the accountability structure that makes enforcement possible and expected.

Then run the system. Daily, consistently, without letting the gravitational pull of the floor collapse the systems that prevent recurrence.

The department will not transform overnight. But every day the five decisions are made, the future gets slightly different from the past. Failures stop recurring. Knowledge gaps stop costing downtime. The team gets faster and more reliable. The fires get smaller and further apart.

As more problems are prevented, the world keeps getting simpler. Time and resources become more available, problem solving gets easier, error rates decrease. Everything gets easier and every line runs better.

That is what a well-built maintenance shop looks like. And it starts with understanding that the structure has to be built on purpose because it will never assemble itself.

What does the planning function look like in your facility? I'd be interested to hear where maintenance organizations are finding the competency profile hardest to fill.

How To Build A Maintenance Shop

Recent Posts

Comments