What's new

Welcome to yeywe | Welcome My Forum

Join us now to get access to all our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, and so, so much more. It's also quick and totally free, so what are you waiting for?

Bringing Architecture of Operating Systems to XXI Century – Part II. Desirable Improvements

Hoca

Administrator
Staff member
Joined
Mar 19, 2024
Messages
549
Reaction score
0
Points
16
post-mortem debugging
…sometimes I’ve believed as many as six impossible things before breakfast.

— The Queen from Alice in Wonderland —

Continued from Part I. Changes in IT Over Last 50 Years

Ok, as it was discussed in Part I, a Damn Lot
™
of things has happened in the outside world over last 50 years, but maybe existing OS’s are already perfect so nothing needs to be changed?

However, there are LOTS of improvements OS users (ranging from driver developers to app developers to end-users) would like operating systems to provide (well, as soon as we realize that such improvements are possible).

Desirable Improvement #1. Flexibility – Choices in Deployment-Time rather than During Development​


BB_emotion_0017b.png
Very first, let’s observe that some of these improvements we’ll see below might be not achievable at the same time – but we certainly DO have a very sizeable audience for each and every of them; and whenever they might conflict to some extent (which, as we’ll see, is rarer than implied by our experience with existing OS’s) – well, it just means that there is a requirement to have certain things decided OPTIONALLY – and during deployment-time rather than during development time. In other words, we’d like to develop an app as a piece of code – and leave subtle details of its deployment to that-admin-who-deploys-our-app.

Indeed, even the same app (leave alone the same library) can be used in tons of different deployment scenarios – and providing an ability, say, to run a DB process within the kernel to improve performance (which can fly if DB server is a single-purpose box) – or within user space to improve stability – is a Good Thing
™
. Another example of things which would be good to choose between in deployment-time, would be a runtime memory protection for a certain app – which ideally should be turnable off or on depending on the balance between performance and security/stability in a particular deployment scenario; below we’ll also see other features which should be turnable on/off in deployment-time.

As such Flexibility is not really the case with existing operating systems – it is one of those improvements which we’re speaking about; let’s write this improvement down as #1.

Sure, achieving it would be difficult (if at all possible) with binary code – but with Source Code being available (which wasn’t really the case when Multics or VAX kernels were designed), it becomes a viable option.

Desirable Improvement #2. Same App/Driver Code from low-end MCUs to high-end CPUs​


BB_emotion_0009b.png
Closely related to the previous item is an ability to run the same app/driver code on a really wide range of processors, from $1 MCUs all the way to x64/SPARC/Power. Sure, I know of arguments that “this MUST be impossible” (because it is indeed impossible under existing operating systems) – but such an improvement is highly desirable. It would be really nice to have one single implementation of a certain SPI/I2C/USB/… driver, or of a web server (OpenSSH, VNC server, SNMP agent, OpenVPN tunnel, etc. etc.) app rather than rewriting it several times, there is no doubt about it. NB: here I don’t mean the niche covered by nommu Linux which still happens to require megabytes of RAM; rather, I’m saying that low-end versions of the new OS should be lean enough to run on a ~$1 MCU (these days ~=4K RAM, 32K ROM)1. Note that the requirement to run on smaller devices is not going to go away any time soon due to power consumption restrictions for battery-powered devices.

IF we manage to achieve having majority of the same code running on both low-end MCUs and high-end CPUs, this will simplify embedded development (which is extremely important today, in particular for IoT) a lot, and will also allow to avoid a lot of abominations (such as busy-loops instead of WFI), which are much more frequent in embedded development then they should be, and are causing lots of problems down the road, especially for battery-powered devices. On the other hand, quite a few CPU-oriented systems happen to be horrible resource hogs and can take a page or three from the embedded books.


1 making it run on world’s smallest computer [WuEtAl] with only 4K RAM and without ROM is going to be challenging but given time, we can still hope



Desirable Improvement #3. Improved App/Driver Stability. Testability. Post-mortem Debugging of Production Crashes​


One obvious thing which is highly desirable by everybody – is improving stability of apps and drivers running under the OS. While some people may say “hey, it is not a problem of the OS but of bad app devs” – I’d say that if there are two OS’s, everything else being equal, but one OS stimulating good development practices and providing tools which help to prevent app crashes, and another one which doesn’t – I’d take the 1st one any day of the week. Therefore, at least in my books (pun intended), improving stability of OS-related apps/drivers does qualify as an OS improvement.

Translating it into the admin/app developer world, I’d count at least the following items as very significant and highly desirable improvements:

  • Ensuring that the app is testable. Non-testable apps carry an extremely high risk of crashing once per day on a client’s box, without any chance to debug it (if you did deploy your app to a million of devices, you know that feeling when after your new release 0.1% of boxes – which means 1000 clients – start to experience a crash every few hours). BTW, testability implies determinism [Fowler].
  • An ability to have a checkbox saying ‘move this driver into user space’ (and another one, saying ‘turn detection of memory bugs on’). Both options will mean a performance hit, but if an admin has badly-needed driver/app crashing, it might save everybody’s bacon in quite a few real-world cases – especially if we’re speaking about 30% performance hit, not about 30x one.
  • BB_emotion_0012b.png
    If possible at all (let’s dream on a little bit) – I’d like to be able to ask user to turn on a checkbox which says ‘post-mortem debugging’, and then be able to send me not a ‘crash dump’, but a ‘crash log’, with last N minutes of the life of the program before crashing, being replayable on my development box. BTW, in addition to pure debugging, this enables LOTS of improvements, see, for example, [Aldridge] to see how such an approach was used to optimize game traffic.

Personally (as somebody who oversaw the deployment of Rather Serious Systems(tm) such as a G20 stock exchange), I see these three improvements in app/driver stability as so important that they alone justify migrating to a new OS.

Desirable Improvement #4. Built-In Fault Tolerance and Scalability to Multiple Boxes​


It is a pity that in XXI century we still need 3rd-party stuff to make our apps fault-tolerant (tolerant to hardware malfunctions, that is); this is not to mention that lots of these 3rd-party fault-tolerant mechanisms are themselves faulty and can easily decrease MTBF [Hare]. OTOH, good fault-tolerance designs DO exist – and IMNSHO they should be a part of a standard OS deployment.

The same goes for an ability to scale a single app – whether interactive one or an HPC one – onto multiple boxes (this is a classical task of Load Balancing/HPC Scheduling, but for quite a few reasons it is not a part of existing OS designs).

Desirable Improvement #5. Improved Security (wherever desirable)​


There is little doubt that security is one Big Fat Problem
™
with modern computer systems, and that it should be improved wherever feasible. OTOH, whenever there is a performance cost of improved security (BTW, as we’ll see below, not all the security improvements carry runtime performance costs) – then according to the Flexibility improvement described above, we DO want to have a deployment-time decision on “what do we prefer on this particular box – a bit of security or a bit of performance?”2


2 to security purists who deny such a heretical choice outright – let’s not forget that the goal of any security is defined as increasing the cost of breaking in above certain pre-defined level – usually a multiple of the cost of loss in case of break-in; this is known as Cost-Benefit Analysis often used as a part of Risk Analysis. As Bruce Schneiner Himself has once said: “Figure 5 shows all attacks that cost less than $100,000. If you are only concerned with attacks that are less expensive (maybe the contents of the safe are only worth $100,000), then you should only concern yourself with those attacks.”[Schneier]



Desirable Improvement #6. Trying to address Tragedy of the Commons​


BB_emotion_0027b.png
One issue which would be very nice to address, is a Tragedy of the Commons as applied to apps on the same box. In existing OS’s, whenever many apps are running, there is very little incentive for developers of any specific app to cap their resource use. In reality, each and every app running on my desktop/phone/… is thinking that it is the only one running – and is very eager to use all the available RAM, all CPU cores, etc. etc. – which is a typical [Wikipedia.Tragedy of Commons] scenario. It leads to a situation when all the apps (with a very few exceptions) tend to be resource hogs, which wastes lots of resources (and we may even be already within Akerlof’s “market for lemons” in this regard – though this requires separate analysis); this, among other things, leads to worse overall experience for end-users, and to smaller battery lifetimes – and also to an increase of CO2 footprint.

IF we could create an incentive for developers to use as little resources as possible – it would mitigate the problem at least to some extent. As one example of such an incentive we could say that those apps which take more time to process their events – will have lower priority compared to those apps which take less time (which, in turn, will make faster apps more responsive not only directly, but also indirectly via prioritizing them); this is expected to create at least some reason for apps to be less resource-hungry (in a way somewhat similar to Google policies on website access speed affecting ranking, creating a strong incentive to make more responsive sites). All other suggestions in this regard are very welcome too.

In addition, OS SHOULD provide very clear tracking of the resources used by certain processes when they’re performing tasks on behalf of the other processes. In other words, as an end-user I want to see not only that it is csrss which uses all the CPU, but also which processes are causing it to perform all that stuff.

Desirable Improvement #7. Simplified Driver Development​


Since times of Multics, Interactive Programs were a step-child of development – and it still hurts drivers (which tend to hurt LOTS of people out there; in fact, it is drivers which are responsible for a vast majority of kernel panics/BSODs).

In particular, the following improvements would be IMNSHO desirable in this field:

  • less cryptic kernel-level APIs. In fact, <heresy>ideally I’d like to stop caring about being in kernel mode or in user mode</heresy> (sure, there are things which are not possible in Ring 3, but 99% of the time these things can be hidden behind an abstraction layer which – dependent on the deployment-time decisions – either goes directly to the hardware if we’re in the Ring 0, or goes the way of microkernel).
  • [already mentioned above] An ability to move any driver to user space (to isolate the problem, to debug, etc.); such an ability would greatly improve quality of life both for driver developers and for end-users.
  • Direct support for purely event-driven drivers. Historically, support for Interactive Programs was usually added to OS’s as an afterthought, and programming them was traditionally ugly. Recently, for app-level there are significant improvements in this field (async frameworks are getting more and more popular every day), but they don’t cover driver development.

Desirable Improvement #8. Improved Performance (both for Interactive Programming and for HPC)​


BB_emotion_0008b.png
Improved performance is always desirable, however, there are a few things to note here:

  • It is paramount to optimize both computing HPC-like loads, and Interactive Programs (with the latter being neglected way too often <sigh />).
  • When speaking of performance, we DO need to distinguish between ‘latency’ (~=”how long specific request takes”) and ‘throughput’ (~=”how many requests per hour specific box can handle”).
  • Battery life and CO2 footprint are closely related to throughput and are important
  • As usual, whenever performance is in conflict with some other goal (such as Security or Stability) – we want this choice to be a deployment-time decision.

To Be Continued…​


BB_emotion_0001b.png
As we can see, not only a Damn Lot(tm) of improvements in IT since the point when designs of currently existing OS’s have been conceived, but <surprise! /> there is still a room for improvement for existing OS’s. In Part III, we’ll try to see whether we can use those IT-improvements-over-last-50-years to improve quality of life for OS users (from developers to end-users).


References​



[WuEtAl] Xiao Wu, Inhee Lee, Qing Dong, Kaiyuan Yang, Dongkwun Kim, Jingcheng Wang, Yimai Peng, Yiqun Zhang , Mehdi Saligane, Makoto Yasuda, Kazuyuki Kumeno, Fumitaka Ohno, Satoru Miyoshi, Masaru Kawaminami, Dennis Sylvester, David Blaauw, “A 0.04mm^3 16nW Wireless and Batteryless Sensor System with Integrated Cortex-M0+ Processor and Optical Communication for Cellular Temperature Measurement”

[Fowler] Martin Fowler, “Eradicating Non-Determinism in Tests”

[Aldridge] David Aldridge, “I Shot You First: Networking the Gameplay of HALO: REACH”, GDC2011

[Hare] ‘No Bugs’ Hare, “The Importance of Back-of-Envelope Estimates”, Overload #137

[Wikipedia.Tragedy of Commons] “Tragedy of the commons”

[Schneier] Bruce Schneier, “Attack Trees”


Acknowledgement​


Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

P.S.​


Don't like this post? Criticize↯

P.P.S.​


We've tried to optimize our feed for viewing in your RSS viewer. However, our pages are quite complicated, so if you see any glitches when viewing this page in your RSS viewer, please refer to our original page.
 
Top Bottom