I’d like to introduce a new concept to the designers of offroad vehicles. It’s the same one used to design complex IT systems when the consequences of failure cost a lot of money, loss of reputation or even lives. It’s called graceful degradation, and it’s not that hard.
Let’s say you’re designing a high-volume retail website. It has to be available to customers 24×7, which means it needs to be what’s known as “highly available”. That’s because if the system goes down it has a big effect on the business. But what’s “down”? Like any site, complex retail applications have a large number of features. There’s a personalisation engine, product search, browsing, email notification, shopping cart, payment gateway, support for many types of browser and much more. Not all of those features are critically important to the site’s operation – for example, personalisation could be lost for a while and nobody would particularly care. But if payments goes down or has errors…that’s serious, so you protect that. And here’s the concept – graceful degradation is the loss of capability due to failure in such a way that the core functions of a system are maintained as long as possible. So you design the site taking into account all these different requirements for availability, and to meet those requirements there’s a few principles you’d follow.
The first one is separation of concerns. Let’s say the search engine develops a fault. That shouldn’t also take down the browse or payment engine, or anything else. I wouldn’t want to explain to the CEO why for example a failure of the embedded weather feed melted the entire system.
The next one is redundancy. Certain systems are likely to fail, so you’d have multiple webservers sharing the load, and within that multiple CPUs, disks and so on – this is eliminating single points of failure. Allied to that is pre-emptive information about failure so action can be taken before things go wrong – excess use of memory here, higher than normal temperatures there, unusual log activity and so on.
Another principle is failure management. That starts with information about what’s failed, why, and how. If the failed component can’t report its state, then you should at least know what’s happened by the absence of a report (positives from other components) and be able to isolate the failed component. But there’s not much point in having information if you can’t do anything with it, and that’s why there is controllability, or the ability to do precisely what you want with the system.
Let’s say the payment system is sucking up lots a CPU cycles – you might decide to shut down other, fully-functional components in order to give the payment system whatever it needs even though it wants more than its share. Or let’s say there’s a problem with the security system and you decide to risk turning the profanity filter off to see if traffic gets through after that, trading security against function. Complete control.
These principles are interdependent and are used to design the world’s highest-performing computer systems where the consequences of failure are dire.
These exact same principles apply equally well to an overlanding vehicle, because again the consequences of failure there may be equally dire. However, no manufacturer truly implements them and even worse, several actually reverse the principles.
The reason the principles aren’t implemented is in 4X4s simple. They are not cost-effective. Following the principles costs money as it adds complexity and effort, all the way through the stages from initial concept to design to build, test and maintenance. Other important factors may also be compromised as there is only so much effort and cost to be spent on any given project. The problem is that cost and effort involved won’t pay off in additional sales, and therefore the return on investment does not stack up. This is simple commercial logic.
However, I’m not writing about simple commercial logic here, I’m writing about what overlanders need from their cars, and that is indeed a measure of graceful degradation and high availability design, or at the very least car manufacturers not entirely ignoring the principles. Let’s look at a typical overland scenario to illustrate the point.
Overland travel means journey in rough terrain, far from any form of assistance. Vehicle reliability becomes critical, as indeed it does for the likes of the emergency services and even people who rely on the car for their daily living. The reliability of a utility 4X4 may well be a matter of life and death, and this is more likely to be so than a normal car. Yet 4X4s have systems that, from a high-availability perspective, are very poorly designed.
Most subsystems on a car are interlinked. The wheel speed sensors are used as an input to traction control, ABS, stability control, gearshift change patterns and much more. Failure of one component has a detrimental effect on many car functions, and when a failure occurs the car goes into limp mode. This is a state of reduced function designed to protect the vehicle from further damage. For example, the revs may be limited to 2000, speed restricted, gears limited to just second or third, low range not be made available, the likes of traction control disabled and in the case of Land Rover, the air suspension may lower. You’re left with a car that can just about move itself on flat, level ground. Apart from protecting the car this also protects the driver. If a driver is used to braking with ABS then suddenly having to do without will be a big shock and potentially lead to an accident. This design is also simple – anything wrong, default into limp mode. No complex and/or/if logic needed to determine what’s gone wrong and why, just kill it all. And most customers don’t care, but overlanders aren’t most customers.
So when car fails and goes into limp mode it is usually the case that more functions that necessary are disabled. This is wrong. Let’s say a wheel speed sensor has failed. This need have no effect on the engine or gears, as the engine is unaffected and the vehicle speed can still be determined from the remaining three sensors. Even if all sensors fail then the system should fail back to determining shift points by other means as was done before electronics, and ultimately just give the driver control if it can’t work it out at all. And if the suspension isn’t faulty…leave it alone! This goes back to the principle of graceful degradation. One example I know of concerned a new 4X4 that refused to start. The fault lay in the satnav – I forget the details, but the satnav had a problem and as that was connected to the rest of the vehicle’s systems the end result was the car would not even start its engine.
Next up is the tradeoffs. Let’s say you’re in the middle of a huge desert with lots of dunes, and your engine cooling system has failed, and the car is in limp mode. Unfortunately, 2000rpm won’t be enough to get over the dune. A solution would be to idle the engine with the bonnet open to cool it, then get up over the next dune with a few seconds high revs, then stop again to slow, all the time keeping the engine temperature within acceptable limits. Or maybe wait until evening when it is cooler. But again a catch-all limp mode doesn’t allow the driver that sort of flexibility. The driver may also decide that as they have seriously injured passengers aboard they will sacrifice the engine and risk a complete failure in exchange for being able to drive quickly the last few kilometres to medical assistance. So in another scenario, can you imagine the conversation at the bottom of a valley with people needing help?
“What’s wrong with the car?”
“Limp mode, caused by a steering sensor failure”
“How does that stop us moving?”
“The steering sensor is needed for the electronic stability systems, which have shut down. As a precaution, the car goes into limp mode and now the engine won’t give us any more than 2000rpm. That’s not enough to climb this hill to safety.”
“What’s wrong with the engine?”
“Nothing at all.”
“So why can’t we just go? He’s dying!!!”
Why not indeed. Should I ever be in that situation I’ll be sure to invite the vehicle designers and their accountants to the funeral.
So what can be done?
For utility 4X4 vehicles the high-availability design needs to be more robust than that of a normal car or even pusedo-4X4. Here are the important systems to keep running, in order:
Everything else can disintegrate. If need be, you can drive without traction control control, ABS, stability control, satnav, electric windows, power steering, adaptive terrain systems and even, yes, brakes if you’re careful especially in low range with good engine braking. Steering is pretty easy as that’s largely mechanical even today, so it leaves the engine and gears as the big ones. And if part of a system fails…don’t kill it all. If third gear is a problem leave first, second, fourth and above. If the engine is overheating, let it run anyway, if it must be rev limited warn but do not place a hard limit. If the alternator fails let the electrics run off the battery without complaint, just a warning. All this is another of the design principles of giving the driver information about what’s happening and allowing them to make the decision about what will happen, when, and how. This again is where modern cars fail. Usually you just get short message saying “something wrong” and that’s about it. There’s not enough information to be able to figure out what’s gone wrong and what can be done, or what tradeoffs can be made.
The basic physical design of the car also needs work. ECUs are electronic control units, and they fail. So make them FRUs, or field-replaceable units. Unsnap one from its mount, snap in a replacement just like the military do with their high-tech equipment, and just like you would with a desktop computer. In days gone by we carried spare axles. Nowadays we would carry spare ECUs, if it was possible to change them in the field. If the computer doesn’t have enough information to work out what to do, for example whether to lock the torque splitter, then hand control back to the driver.
Ultimately what’s needed is the electronic equivalent of an old Land Rover where you can complete control over the entire vehicle, and that in effect means complete control over what the computers do. Fully implementing these principles will be too expensive, but manufacturers should be able to at least go part-way by having sensible use of limp modes and more information about what’s gone wrong. Here’s a list of 4WD design principles:
Information – detailed data on what’s failed, about to fail, perhaps why and how
Graceful degradation – preserve steering, engine and gears at all costs
Manual override controls for everything – anything the computer controls the driver should be able to override, from locking diffs to gearshift points to air suspension
No computer-controlled limits – computers can recommend a maximum speed, revs, suspension settings or whatever it is, but the driver should be able to override
FRUs – make the critical parts of the critical components easily swappable field-replaceable units.
Now the average driver won’t know or care about it and in the wrong hands control is dangerous, so to solve that problem all we need is the override controls buried somewhere non-obvious. Those that take the time to know their vehicle will find them, those that don’t, won’t. And if you haven’t taken the time to research and learn your car before overlanding then you shouldn’t be on the trip in the first place. That will never change, no matter what sort of vehicle we use in the future. All we need is the ability to use our skills and knowledge when the situation demands it.