![post-mortem debugging post-mortem debugging](http://ithare.com/wp-content/uploads/BB_part209_Post-mortem_v1-640x427.png)
…sometimes I’ve believed as many as six impossible things before breakfast.
— The Queen from Alice in Wonderland —
Continued from Part I. Changes in IT Over Last 50 Years
Ok, as it was discussed in Part I, a Damn Lot
![™ ™](https://s.w.org/images/core/emoji/11.2.0/72x72/2122.png)
However, there are LOTS of improvements OS users (ranging from driver developers to app developers to end-users) would like operating systems to provide (well, as soon as we realize that such improvements are possible).
Desirable Improvement #1. Flexibility – Choices in Deployment-Time rather than During Development
![BB_emotion_0017b.png](http://ithare.com/wp-content/uploads/BB_emotion_0017b.png)
Indeed, even the same app (leave alone the same library) can be used in tons of different deployment scenarios – and providing an ability, say, to run a DB process within the kernel to improve performance (which can fly if DB server is a single-purpose box) – or within user space to improve stability – is a Good Thing
![™ ™](https://s.w.org/images/core/emoji/11.2.0/72x72/2122.png)
As such Flexibility is not really the case with existing operating systems – it is one of those improvements which we’re speaking about; let’s write this improvement down as #1.
Sure, achieving it would be difficult (if at all possible) with binary code – but with Source Code being available (which wasn’t really the case when Multics or VAX kernels were designed), it becomes a viable option.
Desirable Improvement #2. Same App/Driver Code from low-end MCUs to high-end CPUs
![BB_emotion_0009b.png](http://ithare.com/wp-content/uploads/BB_emotion_0009b.png)
IF we manage to achieve having majority of the same code running on both low-end MCUs and high-end CPUs, this will simplify embedded development (which is extremely important today, in particular for IoT) a lot, and will also allow to avoid a lot of abominations (such as busy-loops instead of WFI), which are much more frequent in embedded development then they should be, and are causing lots of problems down the road, especially for battery-powered devices. On the other hand, quite a few CPU-oriented systems happen to be horrible resource hogs and can take a page or three from the embedded books.
1 making it run on world’s smallest computer [WuEtAl] with only 4K RAM and without ROM is going to be challenging but given time, we can still hope
Desirable Improvement #3. Improved App/Driver Stability. Testability. Post-mortem Debugging of Production Crashes
One obvious thing which is highly desirable by everybody – is improving stability of apps and drivers running under the OS. While some people may say “hey, it is not a problem of the OS but of bad app devs” – I’d say that if there are two OS’s, everything else being equal, but one OS stimulating good development practices and providing tools which help to prevent app crashes, and another one which doesn’t – I’d take the 1st one any day of the week. Therefore, at least in my books (pun intended), improving stability of OS-related apps/drivers does qualify as an OS improvement.
Translating it into the admin/app developer world, I’d count at least the following items as very significant and highly desirable improvements:
- Ensuring that the app is testable. Non-testable apps carry an extremely high risk of crashing once per day on a client’s box, without any chance to debug it (if you did deploy your app to a million of devices, you know that feeling when after your new release 0.1% of boxes – which means 1000 clients – start to experience a crash every few hours). BTW, testability implies determinism [Fowler].
- An ability to have a checkbox saying ‘move this driver into user space’ (and another one, saying ‘turn detection of memory bugs on’). Both options will mean a performance hit, but if an admin has badly-needed driver/app crashing, it might save everybody’s bacon in quite a few real-world cases – especially if we’re speaking about 30% performance hit, not about 30x one.
Personally (as somebody who oversaw the deployment of Rather Serious Systems(tm) such as a G20 stock exchange), I see these three improvements in app/driver stability as so important that they alone justify migrating to a new OS.
Desirable Improvement #4. Built-In Fault Tolerance and Scalability to Multiple Boxes
It is a pity that in XXI century we still need 3rd-party stuff to make our apps fault-tolerant (tolerant to hardware malfunctions, that is); this is not to mention that lots of these 3rd-party fault-tolerant mechanisms are themselves faulty and can easily decrease MTBF [Hare]. OTOH, good fault-tolerance designs DO exist – and IMNSHO they should be a part of a standard OS deployment.
The same goes for an ability to scale a single app – whether interactive one or an HPC one – onto multiple boxes (this is a classical task of Load Balancing/HPC Scheduling, but for quite a few reasons it is not a part of existing OS designs).
Desirable Improvement #5. Improved Security (wherever desirable)
There is little doubt that security is one Big Fat Problem
![™ ™](https://s.w.org/images/core/emoji/11.2.0/72x72/2122.png)
2 to security purists who deny such a heretical choice outright – let’s not forget that the goal of any security is defined as increasing the cost of breaking in above certain pre-defined level – usually a multiple of the cost of loss in case of break-in; this is known as Cost-Benefit Analysis often used as a part of Risk Analysis. As Bruce Schneiner Himself has once said: “Figure 5 shows all attacks that cost less than $100,000. If you are only concerned with attacks that are less expensive (maybe the contents of the safe are only worth $100,000), then you should only concern yourself with those attacks.”[Schneier]
Desirable Improvement #6. Trying to address Tragedy of the Commons
![BB_emotion_0027b.png](http://ithare.com/wp-content/uploads/BB_emotion_0027b.png)
IF we could create an incentive for developers to use as little resources as possible – it would mitigate the problem at least to some extent. As one example of such an incentive we could say that those apps which take more time to process their events – will have lower priority compared to those apps which take less time (which, in turn, will make faster apps more responsive not only directly, but also indirectly via prioritizing them); this is expected to create at least some reason for apps to be less resource-hungry (in a way somewhat similar to Google policies on website access speed affecting ranking, creating a strong incentive to make more responsive sites). All other suggestions in this regard are very welcome too.
In addition, OS SHOULD provide very clear tracking of the resources used by certain processes when they’re performing tasks on behalf of the other processes. In other words, as an end-user I want to see not only that it is csrss which uses all the CPU, but also which processes are causing it to perform all that stuff.
Desirable Improvement #7. Simplified Driver Development
Since times of Multics, Interactive Programs were a step-child of development – and it still hurts drivers (which tend to hurt LOTS of people out there; in fact, it is drivers which are responsible for a vast majority of kernel panics/BSODs).
In particular, the following improvements would be IMNSHO desirable in this field:
- less cryptic kernel-level APIs. In fact, <heresy>ideally I’d like to stop caring about being in kernel mode or in user mode</heresy> (sure, there are things which are not possible in Ring 3, but 99% of the time these things can be hidden behind an abstraction layer which – dependent on the deployment-time decisions – either goes directly to the hardware if we’re in the Ring 0, or goes the way of microkernel).
- [already mentioned above] An ability to move any driver to user space (to isolate the problem, to debug, etc.); such an ability would greatly improve quality of life both for driver developers and for end-users.
- Direct support for purely event-driven drivers. Historically, support for Interactive Programs was usually added to OS’s as an afterthought, and programming them was traditionally ugly. Recently, for app-level there are significant improvements in this field (async frameworks are getting more and more popular every day), but they don’t cover driver development.
Desirable Improvement #8. Improved Performance (both for Interactive Programming and for HPC)
![BB_emotion_0008b.png](http://ithare.com/wp-content/uploads/BB_emotion_0008b.png)
- It is paramount to optimize both computing HPC-like loads, and Interactive Programs (with the latter being neglected way too often <sigh />).
- When speaking of performance, we DO need to distinguish between ‘latency’ (~=”how long specific request takes”) and ‘throughput’ (~=”how many requests per hour specific box can handle”).
- Battery life and CO2 footprint are closely related to throughput and are important
- As usual, whenever performance is in conflict with some other goal (such as Security or Stability) – we want this choice to be a deployment-time decision.
To Be Continued…
![BB_emotion_0001b.png](http://ithare.com/wp-content/uploads/BB_emotion_0001b.png)
References
[WuEtAl] Xiao Wu, Inhee Lee, Qing Dong, Kaiyuan Yang, Dongkwun Kim, Jingcheng Wang, Yimai Peng, Yiqun Zhang , Mehdi Saligane, Makoto Yasuda, Kazuyuki Kumeno, Fumitaka Ohno, Satoru Miyoshi, Masaru Kawaminami, Dennis Sylvester, David Blaauw, “A 0.04mm^3 16nW Wireless and Batteryless Sensor System with Integrated Cortex-M0+ Processor and Optical Communication for Cellular Temperature Measurement”
[Fowler] Martin Fowler, “Eradicating Non-Determinism in Tests”
[Aldridge] David Aldridge, “I Shot You First: Networking the Gameplay of HALO: REACH”, GDC2011
[Hare] ‘No Bugs’ Hare, “The Importance of Back-of-Envelope Estimates”, Overload #137
[Wikipedia.Tragedy of Commons] “Tragedy of the commons”
[Schneier] Bruce Schneier, “Attack Trees”
Acknowledgement
Cartoons by Sergey Gordeev
![IRL IRL](http://ithare.com/wp-content/uploads/irl-link.png)
P.S.
Don't like this post? Criticize↯
P.P.S.
We've tried to optimize our feed for viewing in your RSS viewer. However, our pages are quite complicated, so if you see any glitches when viewing this page in your RSS viewer, please refer to our original page.