Design and build the space shuttle

In this post, I write about my experience in programming the "space shuttle" which, in the end, is a Bootloader. I think that "space shuttle" is not the best Bootloader but this can be considered just as a Bootloader example. At the same time, I don't think that you can expect this post to be a general Bootloader programming tutorial, but you can look at it as one possible way how to write a bootloader. But what is a Bootloader? In very simple terms, it is a software that is stored on the Bootsector (or Partition Boot Record) that gets loaded into RAM during the Boot process and runs. In the end, this Bootloader will load much more software after itself which is the rest of the Operating System.

This will be the first software for something useful that I do because all that has been done until now was just a contribution to build up the know-how. Well before build up the space shuttle, I have to make a project. Before that, I have to list the requirements of the features and capabilities that my "space shuttle" has to have and I have to consider the constraints that will inevitably limit the design1. I will proceed with order and preciselly I will follow these steps:

The constraints

First

The address space of 8086 is organized as illustrated in Fig. A.

Real mode memory organization
Fig. A - Real mode memory organization

Note that on the right part of Fig. A created an alternative view of this space counted in blocks rather than bytes. The point is that LBA transfers blocks into RAM so I want to know how much "block space" rather than "byte space" is available and where. The decimal notation is always right, but the hexadecimal is wrong every time the division by 512 gives a rest. The RAM space I can use goes from 0x 00 05 00 to 0x 09 FF FF, with the Partition Boot Record placed by the GRUB at address 0x 7C 00.

Second

The maximum space available that I have for "space shuttle" considering everything (code, BPB, strings, etc) is 512 Bytes. This is a constrain that I made me myself in the sense that I could, if I would, use the reserved block and have more space available but this would defer my learning experience. What I mean is that, sooner or later, I will hit some space limits with the code increasing in size due to complexity. If I don't set this constrain now, probably I will be able to fit the code in 1024 Bytes very easily, but I won't learn how to squeeze things when necessary. Therefore, I set a limit of 512 Bytes by myself in order to be forced to squeeze as much as possible and probably learn by this experience.

Third

The shuttle will appear in RAM at address 0x 7C 00. All other software I have until now (AP-LAND, AP-PROBE, etc.) are designed to work loaded in memory at the same start address. Therefore, I have an address conflict here.

The requirements

First

"Space Shuttle" must be able to locate a file in the ROOT directory of the volume formatted with FAT16, load it into RAM and pass control to it.

Second

"Space Shuttle" must feedback the user with string messages on the video about what is going on (problems or successful completion).

The design implementations

What is this? Is it a requirement? Is it a constraint? Well, in my opinion, it is a mix of the two things. I considered the requirements and the constraints and I realized that there was something more that I had to take into account. In the end, the "design implementations" are something that I define by myself whose effects are similar to the constraints. In addition to that, the design implementations also solves conflicts between requirement in contradictions or not compatibles with constraints by redefining either the constraints or the requirements or both in a new way and explaining why it has to be so.

First: relocation of code

Considering that I have the constraint that any software I load in RAM wants to start at the same location where my "Space Shuttle" is located then I have to relocate the "Space Shuttle" in order to free the RAM space for the code that is going to be loaded next. This is not the only solution available. I could have redesigned all the previous software to be able to work starting at 0x 7E 00 or somewhere else. This brings me to the next consideration. Suppose that I am able to build the "Space Shuttle", after that I have to define the specification of the software that the "Space Shuttle" can handle. For instance, I should mention that the "Space Shuttle compatible" software is coded in such a way that it is located in RAM starting at address 0x 7C 00 and have the code entry point in the same place. In theory, I can do whatever I want as long as I say which is the compatibility standard for it.

Second: maximum size of file loadable into RAM

The RAM organization for our shuttle will look like the one in Fig. B.

RAM organization of the Space Shuttle
Fig. B - RAM organization of the "Space Shuttle"

From 0x0 to 0x500 is occupied with important information of the PC compatible standard as well as from 0x0A 00 00 to 0x0F FF FF. Now considering that I have to relocate the code, I decided to place it one block before the Partition Boot Record location in RAM i.e. starting at 0x 7A 00. In this way I cut the RAM in two regions: the first region comprises all that is before the relocated code (I would like to use it as working RAM); the second region comprises all that is after the relocated code. In this way, I have 623616 Bytes of free space (0x09 FF FF – 0x 7C 00 + 1). The maximum size of the file loadable by the "Space Shuttle" could be as big as this, but, in the end, I decided to limit it to something smaller. To understand why I took this decision I have to explain a little bit. 623616 Bytes of RAM is equivalent to 1218 LBA blocks (= 623616/512) and is equivalent to 304,5 clusters (= 1218/4 considering that each cluster is equivalent to 4 LBA blocks on my HDD). So if I want to use all these Bytes available I have to consider in my software a counter for the LBA blocks that are currently loaded in memory by each INT 13H request and once loading the last crunch of LBA block from the HDD I have to calculate a value of blocks that fits exactly the remaining memory space. It is of course technically possible to write such a code, the only disadvantage is that it will take quite a number of Bytes and I have space limitation for the "Space Shuttle" (it must fit the 512 Bytes all together). If I reduce a little bit the maximum size of the file that the "Space Shuttle" can load I could write the code in a much shorter form and this will help me. Which is the smart size I am looking at? To find it out, I considered another portion of RAM, the one that is before the relocated shuttle code and that I plan to dedicate as private use of the code. In this area, I need some space for the stack, some space for variables, constants and a buffer area where to store temporary the crunches of HDD containing the ROOT and the FAT that I need to scan to be able to locate and load my target file into RAM. So I started to find the maximum common divisor for the two areas and I found that 38 LBA blocks is a good number to read crunches of HDD by each INT 13H request. In this way I can hardcode this number in my code, making it as compact as possible. So, in the end, the maximum size of the file that can be loaded with the "Space Shuttle" is 622592 Bytes (0x09 FB FF – 0x 7C 00 + 1 to be aligned with 38 LBA boundaries). I think that this is a good example of a design implementation which isn't a constraint by itself but results having similar effects since now I am constrained to load files that are not bigger than that.
To be honest I have to say that the "design implementations" are feedbacks coming from bottom-up coding attempts. In the beginning, I really didn't want to have more constraints than the necessary ones so I tried to code the piece of routine that had to take care of monitoring the RAM during the filling process and change the size of the LBA read crunches in order to fully use the free space available. Then I realized that I was running out of space to code the "Space Shuttle" and I wanted absolutely to stay within the 512 Bytes (this is a learning target for me), so I came up with this idea of standardizing as much as possible the LBA address packet and hard code the LBA read crunches in such a way that I didn't have to calculate and change them at run time. In this way, I could save a lot of Bytes of code. So here you see another characteristic of the "design implementations": they are feedbacks from each and every attempt of coding. At the end, when the "Space Shuttle" will hopefully run I can look back again here and read all the points like this one, that I had to add in order to make it. This is automatically the list of limitation that the "Space Shuttle" software will have and, at the same time, the list of the compatibility specification. For instance, here I have another one: the file loadable by the "Space Shuttle" cannot be bigger than 622592 Bytes.

Third: video messages

One requirement is: the "Space Shuttle" must provide useful video message to prompt the user about what is going on. There is only one thing that can be ok (the file is loaded and ready for control transfer) but there are many things that can be wrong. Before analysing all the things that can be wrong, let me consider the successful termination of the program. If successful, the program jumps to 0x0000:7C00 and starts executing whatever is there, but before doing so I want the program to prompt on screen that it has correctly done its job. Imagine the following scenario: you start the PC and it immediately crashes. How can you distinguish if it was the "Space Shuttle" to crash or the software it has loaded and passed control to? So I came out with the idea to prompt a message about successful completion and wait for a keypress before passing control. In this way, I could distinguish which code is causing the crash. Giving the space constraint and the fact that each character on the strings occupies one Byte, messages have to be as short as possible and still meaningful. I decided to have just "Loaded!" as a successful string and to avoid any further string such as "press any key to continue..." simply because I don't have space for it. I think that every user will be a little bit disappointed by just reading the message "Loaded!" with nothing more but, one second after that, the user will question himself "ok loaded and now what happens?", and the second immediately after "... let me try to press a key to see what happens". I decided to rely on the natural curiosity of human being for my software to work. But what if, in some years from now, we have some artificial intelligence that is not stimulated by curiosity? The hypothetic robot will keep waiting, doing nothing since it is not written on the screen that it has to press a key. I know that this is provocative, but it is just to think a little bit out of the box.

The things that can go wrong are many:

  1. INT 13H read error (CF=set, AH=error code)
  2. File not found in the ROOT directory
  3. File is too big
  4. File is corrupted (cannot be read)
  5. ROOT is corrupted (cannot be read)
  6. All copies of FAT are corrupted (cannot be read)

It is a lot to distinguish and a lot of messages to prompt and consequently to keep saved inside the 512 Bytes of overall space. I started considering the INT 13H read errors. I could use the same technique used with stack_to_string i.e. to have a pre-formatted string such as "INT 13H error code: xx", convert the error Byte in ASCII code at run time using nibble_to_ASCII function, substitute the "xx" with the error code and then print the string on video. All this is proven to work since it has been done with stack_to_string, but I don't have space to implement it. I decided that it won't be too bad to give up on this message since if a read error occurs, it is either during reading the ROOT or during reading the FAT or during reading the file itself. The perfect solution would be messages somehow like these:

"Error reading ROOT"
"INT 13H error code: xx"
"Press any key to reboot..."

but that is a lot of text (Byte space), so if I give up with the string telling back which INT 13H error is, I can save a lot of space and code (nibble_to_ASCII, etc.) at the same time. In the end, the user still knows which kind of problem occurred (read error of ROOT, or FAT or File), he/she only miss the detail about the error code. Concerning the remaining error scenarios, I will try to implement all message strings.

Fourth: FAT16, only short names, file in ROOT directory

I know how to handle FAT16. In theory, I can find any file at any location on the disk but I have to store on the Partition Boot Record also the information concerning the path and the file name to search. The longer the path is, the more space I have to dedicate for it and the bigger will get the code that searches inside a subdirectory2. I thought that if I constraint the capability of the "Space shuttle" to load only files that are located in the ROOT directory, I increase my chances of fitting all the stuff inside the Partition Boot Record. For the same reason, I decided to limit the file name at 8.3 format. Even if the long file name is compatible with FAT16, the longer the name to search is, the more Bytes it occupies, and if I have to code an algorithm for that complication I reduce the chances to fit everything inside 512 Bytes of space. Now that I fixed this aspect of the design implementation (the "Space Shuttle" will search files only in ROOT directory with 8.3 format), I just have to define the name of the file to search. You can give any name you want for it and feel free to choose one that you like, but in my case, I decided to call it "SOFTWARE.BIG". It fits the 8.3 format, it tells that this file is a software (program or application call it as you like) and that it can be relatively big. Of course, you may observe that this is a binary file so one may decide to use ".BIN" instead. Again you can use the file name you like as long as it fits 8.3 format.

Fifth: Shuttle cannot be copied but must be generated at installation time

This design implementation is a very important one that I would like you to keep in mind for the future (just in case). If you just consider the requirements, the constraints and the design implementations, you may suppose that once the "Space Shuttle" is ready you can just copy it on every FAT16 volume and it will work fine. But why you cannot just copy the "Space Shuttle" on any FAT16 volume and run it? Well, I haven’t presented any line of code yet, and I can tell you part of the story (once again remember that design implementation is the place for the feedbacks from the bottom-up phase). At the beginning, I really had the idea that the code of the "Space Shuttle" has to read the information stored in the BIOS Parameter Block and calculate from them all the addresses required to locate the beginning of ROOT, FAT and file to work with them. After some attempts during the bottom-up trials, I realized that the Bytes of code necessary to transfer the data from BIOS Parameter Block to register, then do the algebra in CPU and finally store the result in the LBA address packet, were getting too much. I tried, as much as I could, to squeeze those Bytes of code down but still, I didn't succeed. And this teaches a lesson I am glad to share with you: one has to compromise between requirements. My solution was to calculate the values offline on a piece of paper on my desk and to save them in a pre-formatted, partially pre-filled LBA address packet. I could have done it in the other way if I gave up on the 512 Bytes constraints and used the reserved block, but I didn't want to give up on that so I had to accept the fact that the "Space Shuttle" is not a portable software. Another observation is that one places the bootloader at installation time in the Partition Boot Record, so the "Space Shuttle" may become virtually portable on any FAT volume if the installation routine doesn't just copy the code "as is" but it also takes care to calculate and update the pre-formatted and pre-filled LBA address packet based on the real parameters of the partition.

Sixth: tools available for coding

You may not think that this is a design implementation that has an impact on the final code. Indeed this is the point that has the biggest impact on the code. For instance, if you code in C/C++ or directly in assembly the binary of the code will look different even if it implements exactly the same flow chart3. In my case, I don't have any assembler except DEBUG.EXE and any other program except the Microsoft Office. So, what I did was to make a flow chart using PowerPoint (this was necessary to clarify the flow of the code) then I used DEBUG.EXE to code and test the small blocks of code. After that, I copied the binary in Excel, one Byte in every cell, so I could move the Bytes around "with my hands" by selecting them and drag-dropping with the mouse. It is maybe really a stupid way but it was so powerful indeed. The only thing one have to take care of is referencing within the code (do you still remember the problem shown in Fig. F of post "AP-PROBE"?). Here I learned an easy and stupid technique that helped me with that: it consists of just code every jump with the immediately following instruction. The advantage is that I could test in DEBUG.EXE every module standalone without needs that the jump target module was present; I saw if the jump occurred just checking the flags; It was easy (or let me say not so hard) to identify the jumps in the code because I looked at the Bytes "0x00" (or "0x0000") to find those jumps out (I think it will be clearer with the screenshots few lines ahead in this post).

The Engineering phase

It was a lot of stuff I had to consider before even build up the first project and all of that just for a relatively simple task. Now I have to admit that during my experimenting with the bottom-up phase there was a point I wasn't making any progress further: this was the indicator that it was time to move to the top-down method. The first PowerPoint flow chart looks like the one of Fig. C and Fig. D.

Space Shuttle basic flow chart (part 1)
Fig. C - "Space Shuttle" basic flow chart (part 1)


Space Shuttle basic flow chart (part 2)
Fig. D - "Space Shuttle" basic flow chart (part 2)

You can see that at the point when I did the flow chart, the decision of following only FAT16 and file location in ROOT was taken. Suppose that the algorithm should work with FAT32 then first you have to find the ROOT on the disk and after that, you can load it in RAM. On the present flowchart, I immediately wanted to load ROOT into RAM implicitly meaning two things: first, it is FAT16 and second, the file is located in ROOT directory. We talked about these elements already in the design implementations paragraph. Please observe also that this logic would also work fine with the long file name, the only thing is that the code inside the block "Compare elements:" in Fig. C has to be coded differently in the two cases. As I said, space consideration about the length of the code and the comparison string pushed me to decide for 8.3 format. Another thing that I want you to observe is the original structure I thought for the messages. The Idea was that every error messages would then land to a standard exit procedure whose task was to prompt the message "press any key to reboot...", wait for a keypress and finally reboot (this is where the continuation symbols 1, 3, 4, 5 and 6 in Fig. D jump to). For the block "Transfer control" I planned something like "File loaded onto RAM successfully" and one line below "Press any key to transfer control...". At that point, I wasn't sure if I could fit all the stuff in the 512 Bytes but at least I wanted to try. As you can see at this stage, I planned to have two slightly different "Press any key" kind of strings which means they were not inter-exchangeable. Towards the end, I had a working core algorithm that was squeezed as much as I could so I had to cut the length of the strings. In this way "File loaded onto RAM successfully" became just "Loaded!", the "Press any key to transfer control..." was abandoned (as already explained) and the error message "Press any key to reboot..." became just "Press any key...". I realize only now during the writing of the diary that thanks to all these changes the "Press any key..." string is now generally valid and I could have placed it also below the message "Loaded!". If I want to implement it like this now, I have to change the logic (the flow chart) to implement it. For instance, I could append the sting "Press any key..." by each message every time I call the function "show_string", then the module "Wait for keypress and reboot:" doesn't have to print anything on the screen but just wait and reboot (I mean just INT16 and INT19). Well, I won't do this change now. But this is a potential that you can decide to exploit if you feel like to make a bit of training by yourself.

A big problem I had is that assembly is quite a chaotic language with all its positive aspects (chaos is very powerful, life and creation itself come from chaos) and negative ones (it is really easy to make mistakes). And now it gets a little bit philosophic: It is so powerful because it lets the programmer free of doing whatever he/she wants, but the higher the freedom the bigger the responsibility of the action taken. That is exactly the same as in real life and can be applied to everything. Think about it: one can judge the amount of power he/she has by the number of things he/she can do without restriction and consequently if one can do all kind of things, the only thing that prevents he/she from doing the very bad and horrible ones is the sense of responsibility in using that power. In human history, we have learned that it is bad to trust in the sense of responsibility and let one, or few people, all the power to do what they want (absolute monarchy or dictatorship). Better is to limit the power one has, implying by that, limiting the freedom of action (in practice it is achieved by cutting the power into pieces and trying balancing them: government, parliament, the court of justice, etc.). But what all this has to do with assembly? Well, you have the freedom (and the power that comes with that) to do all that you want together with all the correlated risks. Example, I am currently reading about how to program the CRT controller (I am still far from being able to do it), but I read a disclaimer that warned me about the risk that bad setting of parameters can cause the controller to be PHYSICALLY damaged. The same was valid in the past with HDD. There were viruses programmed intentionally to damage the disk. So you see "the dark side of the force". But another, somehow positive, example could be the following: who said that floppy cannot store more than 1,44 MB? Well, it had been verified that this data storage density was a very good compromise between density and data integrity reliability. But if one was ready to take some small risks he/she could have reprogrammed the floppy drive to be capable of more than 1,44MB. High-level programming languages limit the freedom you have, but they are less bug-prone compared to assembly. The big advantage of the high-level programming language is the control of the flow. Such languages have some graphical aids such as the use of parenthesis { } and the programmer can visualize the nesting of the blocks using the tabulation in the text of the code. I have tried to do this attempting to a kind of graphical structure in the assembly that was somehow similar to the one in high-level programming language and it can sometimes really help. But the big problem with assembly is that there is no standard flow structure. If you take your time to analyze the control flow structure of any high-level language you will see that they end in one of the three basic structures of Fig. E.

Standard structure of control flows
Fig. E - Standard structure of control flows

The SELECT-CASE is realized arranging a cascade if IF-THEN-ELSE statements in a special way. FOR-NEXT is realized with a loop structure that performs the check at the beginning of the loop and so on. The high-level program languages limit the freedom and power of the programmer for the good reason that in this way one has only one entry and one exit point, keeping the flow under control. Assembly doesn't offer the user such facilities simply because they don't exist from the perspective of the CPU. All what the CPU understands is to do some algebra with the IP register. In 90% of the cases, it is just adding the number of Bytes used by the current instruction to point the next one, and in other cases, it is about adding or subtracting a constant from the IP thus creating the effect what we perceive from outside (from the perspective of the programmer) of a jump. It is very simple and though a very powerful mechanism. One has the potential to design a loop of instruction to serve a special purpose and then depending on the slightly different needs one could enter the loop at different entry points just in the middle of the loop instead of calling it as a function and entering it from the beginning. At the same time, one can set along the path of the loop some conditional jumps that exit from different places of the loop rather than the main check of the loop. In the end, we have multiple entries combined with multiple exit conditions which make the control of the flow impossible as done by high-level languages. Whether or not there are real cases which this technique can really bring an advantage is not the question here. The point is that it is possible indeed and the written form of the assembly language cannot represent the complexity of the flow accurately. If you take your time and observe the flow chart of Fig. C and Fig. D you will see that they don't obey to the rules of the canonical or standard control flow of high-level languages. If you want to code it in C or Java you have first to reorganize your flow. Of course, it is possible to do that, but the implementation of a flow chart that follows the rigid rules of the high-level languages has a bigger binary code has a result. Once again freedom gives you power but it sets problems on your path if you don't use it carefully. I do insist so much with this concept because I had a lot of trouble trying coding in written form my program. I found no way I could follow such a complex flow using comments and tabulation. Of course, I did it to some extent but it wasn't really representing what was going on in the flow of the code, and I kept doing mistakes in the realization of the software. The real breakthrough came once I started writing my code directly in the PowerPoint flow chart. Only then, I started to make progress again.

The Building phase

Space Shuttle flow chart with code (part 1)
Fig. F - "Space Shuttle" flow chart with code (part 1)


Space Shuttle flow chart with code (part 2)
Fig. G - "Space Shuttle" flow chart with code (part 2)

In Fig. F and Fig. G you see the assembly code (well almost 100% of it) of the "Space Shuttle" already in the flow chart form. If you compare these pictures with the one in Fig. C and Fig. D you will notice that they are a little bit different. Can you tell why?

Space Shuttle code in RAM
Fig. H - "Space Shuttle" code in RAM

Fig. H looks like pretty much chaotic (it is in a way) but this is the best I could come up with for building up this piece of software. As you see, I have all the little segments of code each one doing a small part of the job. I gave them different colours and I marked with a box the points where I had to fix the jumps. Then I kept drug and drop the pieces of code and data until I could fit them all in the 512 Bytes of the Partition Boot Record. You have to imagine it as a kind of TETRIS where I had to match the boundaries of every portion of the code, but I could also change a little bit the shape by changing the code itself thus changing the number of Bytes required for each "brick" of the TETRIS. I hope you can figure out what sort of game I was playing to fit all inside 512 Bytes. Let me try to explain what you see here. The Partition Boot Record is represented as it should appear in memory after the relocation. I did so because I wanted to be sure about the location to point strings, LBA address packet, variables, etc. and read the value immediately rather than doing some algebra. Moreover, since I was keeping shifting things all-around to solve my TETRIS, those locations were constantly changing.
Starting at 0x7C00 you see the Partition Boot Record in its original position as loaded by the BIOS (I marked it with a raster of transversal lines), then I relocated it starting at 0x7A00. The relocation code (marked in light green) is responsible for it. At address 0x7BFE there was the boot signature which has been overwritten with 0x0000 and is now part of one of the two pre-formatted, pre-filled LBA address packets.
In general, I tried to follow this rule to organize my code: going from the higher address to lower (from bottom to top) I piled up the data; after that, I piled up the service routines. From the opposite side, growing from low to high addresses (from the top downwards in the picture), I placed the code. The two areas grow one towards the other and, in the end, they left only two Bytes of space unused at addresses 0x7B5E.
At address 0x7BF0 (at the bottom of the relocated code) you see the first pre-formatted, pre-filled LBA address packet. You know already the format of this packet (see Fig. A of the post "Read the first block of HDD") so you see that the number of blocks (0x0038), the store address in RAM (0x0000:0500) and the beginning of ROOT in HDD (0x0000 0000 0232 7C00) is already hardcoded. So everything is set for the first read operation i.e. transfer the ROOT in RAM. Of course, I can save a lot of Bytes in this way but the drawback is that the code is not portable from FAT volume to FAT volume. Another drawback comes from a strong simplification I did in order to squeeze the code as much as I could: I read the ROOT only once in RAM. I checked at my desk on a piece of paper that the ROOT size, that I have, fits inside 0x38 LBA read blocks all in once (indeed it is smaller than that), so I could avoid reading the ROOT crunch by crunch. In the beginning, I had a code version that read the ROOT by portions of the RAM buffer and was updating it when necessary during the search for the filename. Unfortunately, this version of the code was too long for the purpose so I checked the size of the ROOT on my disk, and likely I could read it all with one operation. So we can calculate now the biggest ROOT size that the "Space Shuttle" can handle (or it is compatible with): 56 (=0x38 LBA block) * 512 Bytes per block / 32 Bytes per ROOT entry = 896 entries. It seems that the "Space Shuttle" can find the file only if it is located in the first 896 entries of the ROOT but, more preciselly, just 512 entries. Why? The number of ROOT entries is stored in the BIOS Parameter Block and it is equal to 512 (=0x200) in this case. I have hardcoded the value instead of reading it from the BIOS Parameter Block. So if we just copy-paste the shuttle to another volume without fixing all the hardcoded parameters it won't find any file stored after the 512th entry in the ROOT. Why I did so? Well, for space limitation reason. MOV CX, imm-16 is one Byte shorter than MOV CX,[from BPB]. One Byte doesn't seem too much but I had to steal Bytes everywhere and I left only with two Bytes free at the end. The same LBA block will be reused for the FAT read operation, I will only change two Bytes at 0x7BF8. The FAT1 and FAT2 starts are pre-calculated from the BIOS Parameter Block and stored at addresses 0x7BDC and 0x7BDE. Again it was Bytes shorter to move the pre-calculated value directly in the pre-formatted LBA rather than reading the raw data from BIOS Parameter Block, calculate these values and transfer them in the LBA address packet.

As soon as it was clear to me that I had to save any Bytes if I wanted to have one chance to make it, I wrote down all the possible LBA that shuttle would need and compare them to see which Bytes changes and which could be standardized. As you can see in Fig. I all the Bytes in green are common so could have been standardized.

LBA packets required for shuttle
Fig. I - LBA packets required for shuttle

The first idea I had was to have only one LBA address packet; I wanted to change the Bytes as required and build the needed address. Then I realized that the Bytes of code that I needed to transfer the information and rebuild the LBA address packets were more than the Bytes of an intermediate solution with two pre-filled LBAs that still needed some manipulation. So, in the end, I defined a second LBA packet starting at address 0x7BE0 with the beginning of cluster #2 (beginning of data area). This value is the offset to add once the cluster number is known in order to find the data on the disk. The current portion of the FAT loaded in memory is stored at address 0x7BD7. Suppose I calculate that the next cluster I want to know is in the FAT portion number 3, in this Byte, I read which portion of FAT is currently in RAM, so I can compare and know if an update of the FAT image is necessary or not. The initial value of 0xFF is a trick to force the code to load the FAT in RAM the very first time (we will see this later in details). The problem here is that each copy of FAT is 0xCF blocks long (as stated in BIOS Parameter Block) and my RAM buffer is just 0x38 blocks so I have to use an algorithm that reads portions of FAT in RAM at each time. And here is a very interesting point that I want to comment with you. Remember that I didn't implement such a kind of algorithm that can handle the ROOT directory by truncating it in crunches of 0x38 blocks, then load onto RAM and managing update of crunch image if necessary, but I had to develop such an algorithm for the FAT. In theory, there is the potential to write a general-purpose algorithm that can be used for both (FAT and ROOT management in RAM) to be called by different points of the code just changing some parameters (i.e. pointing ROOT, FAT1 or FAT2) and serving the purpose. Indeed, I tried to do such a thing but I didn't succeed (at least I didn't within 512 Bytes...). This is an improvement potential you may want to try for your own experiments.

Once I clarified the organization of the Bytes in RAM, I can move and comment on the code as in Fig. F and Fig. G. The general reset and relocation is pretty much standard stuff with initialization of segment registers and copy of code using REP-MOVSW instruction. After that, each block of code in the flow chart get more comment in a dedicated paragraph.

Transfer ROOT in RAM

Transfer ROOT in RAM
Fig. J - Transfer ROOT in RAM

Here you see that I try to read the HDD 3 times before giving up and consider that a read error really occurred. This is a little bit historical. For floppy disk drive, the spin up time of the disk was remarkably longer so it happened that a first read attempt was negative but if followed by some others in a raw it was indeed able to read. With HDD this situation got much better and seems that some HDD controllers even take care about this condition so it is just enough to read once: if it is not ok, it means it is really not ok. With solid-state disks, this problem doesn't even exist. The point is that I collected as much information as possible to see if this procedure was really necessary or not. In the post "Read the first block of HDD" I had a successful read at first shot. But all the information I found on the Internet couldn't give me a sure statement that such a procedure was definitively obsolete and not required. I mean that I would be glad if I could save all those Bytes of code to use them for something else. In the end, I had to take a decision, either discard the multiple read implementation and take some risk that it fails or rely on the positive experience already made in the post "Read the first block of HDD" and keep going this way. I decided for the first for no technical reason, just because I took this challenge as part of the learning experience. So you see that if 3 times a reading problem happens, it exits the loop and falls down into the "ROOT read error" display message routine.

Compare elements

Compare elements
Fig. K - Compare elements

BX holds the pointer to the beginning of each entry of the ROOT image as loaded into RAM. CX holds the number of the entries in the ROOT. As I said before, notice that the code is not reading this value from the BIOS Parameter Block (lost of portability) but it is hardcoded in order to save one Byte. The next line (XCHG AX, CX) is another way of nesting. CX is the counter used for many things. In this case, I had the problem of repeat-string instruction nested inside a loop cycle and I had to use CX for both. In a standard way (as a compiler would do) one could store CX on the stack, but I decided to store it in AX. Next line is tricky. I have loaded CX with the value of 0xC(=12), one more than the string to scan but why? Suppose there are two strings to compare "AAA" and "AAB", one would consider setting the counter to 3 and say that if at the end of the scan the counter is 0 then all characters are equal. This is wrong! To understand the reason for it we have to do it step by step. On Chapter 2 page 41 of the "INTEL – The 8086 family User's manual" you can see the hard coded routine in the circuitry of the CPU for the REP CMPSB instructions. It works like this:

  1. Decrease CX (now CX=2)
  2. Compare Bytes ("A" vs. "A")
  3. Adjust SI and DI
  4. Are they equal? If yes do it again.
  5. Decrease CX (now CX=1)
  6. Compare Bytes ("A" vs. "A")
  7. Adjust SI and DI
  8. Are they equal? If yes do it again.
  9. Decrease CX (now CX=0)
  10. Compare Bytes ("A" vs. "B")
  11. Adjust SI and DI
  12. Are they equal? If yes do it again.
  13. No they are not equal this time so stop

At this point the routine stops not because all characters are equal (CX=0) but because the last two characters were not equal at all. If we now check CX we could think the strings match because CX=0 but it is not. If we repeat the same example starting with CX=4 then the CPU will stop at the comparison between "A" vs. "B" and the CX counter will be still = 1. So if you want to be sure to recognize a perfect match of strings including the last character too then the CX counter must be 1 count more than the length of the string. CPU compare strings by pointing their beginning in DS:[SI] and ES:[DI]. The source string is constant and we can point it just with MOV SI, 0x7BC5 but the destination string changes the position in RAM depending which ROOT entry is considered at each time. You may think that since ROOT entries are at 0x20 (=32) Bytes space from each other you can just add this value to DI at the beginning of each loop. The point is that DI changes depending on the grade of success of the scan operation so you would keep adding 32 Bytes of offset from an index that is variable so it won't work. What really remains constant at the beginning of each ROOT entry is the value pointed by BX, so we increase this value before transferring it into DI as initialization before every scan. We keep looping until the end of ROOT and if the code doesn't find a match it doesn't trigger the exit condition so it has to fall down in the "File not found" error message.

Check size of file

Check size of file
Fig. L - Check size of file

Once the file is found, DI points at the end of the file name plus 2 Bytes. To understand why 2 Bytes, you have to consider that the repeat string loop has already changed DI and SI before checking the CX=0 condition and breaking the execution (you can see the flow chart at page 2-41, Figure 2-33 of the "INTEL – The 8086 family User's manual"). In any case, the relative position of DI within the ROOT entry is known, so we can just add a fixed offset to it and read the size of the file. The size is stored in a 32bit format so double word that doesn't fit in one register so I decided to store it in BX:AX. After that, I subtracted the maximum free space available of 0x09 8000 (= 622 592 see Fig. B). Note that we have to perform subtraction over two registers so we have to start from the low order register and go to the high order register considering the borrow of previous subtraction (if any). So the idea is that if the result is below or equal zero it will fit in the available RAM (so the jump occurs if below or equal JBE) otherwise the code falls down in the "File too big" error message.

Initialize 1st cluster and read data pointed by the current cluster

Initialize 1st cluster and read data pointed by current cluster
Fig. M - Initialize 1st cluster and read data pointed by current cluster

This portion of the code of the "Space Shuttle" reads the "SOFTWARE.BIG" file one cluster at each time. The relative position of DI inside the selected entry in ROOT is known so I just have to add the offset to it and read the value of the starting cluster in AX. This initialization needs to be done only one time, after this, the code enters a loop in which it reads the data from the disk that is stored at the location pointed by the current cluster in AX, then it reads the next cluster to follow (if any) in AX again and it loops in this way until the end of the file (marked with cluster 0xFFFF). At the beginning, I save a copy of the current cluster pointed by AX in BX (note that I could also have used MOV BX, AX but not XCHG AX, BX !!!). I do that because I am going to manipulate the register AX in order to calculate the relative LBA block from the start of the data area. This is easily done with: LBA relative to start of data area = (cluster – 2) * 4. It is easily explained because the cluster count starts at #2, so the relative position needs to be corrected subtracting it and then I multiply by the number of blocks per cluster (again, this should be read from the BIOS Parameter Block to have a portable code on each FAT volume, but instead I hardcoded this value in the code for space reasons). Once I have the result of the calculated offset in the 32bit over 2 registers (DX:AX), I can add this relative offset to the absolute LBA start of the cluster #2 that once again was pre-calculated and hardcoded. Once this last computation is done and the result stored on the LBA address packet I can go on and perform the read attempt from HDD hoping it is successful. After trying 3 times and if still not successful, it falls down to the error message string "DATA read error" signalling that the file was unfortunately stored on some bad sectors. But if one of the read attempts is successful, I jump out of the loop into the next block.

Point next free RAM

Point next free RAM
Fig. N - Point next free RAM

In the LBA address packet, the beginning Byte of the store buffer in RAM is addressed with the segment:offset convention. This has the advantage that there are many segment:offset combinations to describe a Byte that is at the beginning of a paragraph (example: 0020:0100 = 0030:0000 = 0010:0200). Since we read 4 blocks per cluster (once again hardcoded instead of BIOS Parameter Block related) we keep filling the RAM with a paragraph alignment. One could decide to increase the offset part of the address by 0x800 (=2048 = 4*512) but once this value gets bigger then 0xFFFF he/she has to fix the segment part of the address. The algorithm is then a bit complicated. But since we fill the RAM with a paragraph alignment we can just update the segment part directly adding 0x80 and keeping the offset part always equal to 0x0000.

Compare the current portion of FAT in RAM against the required one

LBA disk address packet
Fig. A - LBA disk address packet

Once the portion of the file contained in the current cluster has been loaded onto RAM, I have to use the current cluster value as a pointer inside the FAT table to read the value of the following cluster. The value of the cluster was in AX, but I had to modify it and make some algebra to calculate the real LBA read address from HDD. Luckily I saved a copy of it in BX that I can restore in AX with and XCHG operation (note that I have to move it into AX because the MUL and DIV operations that will follow accept AX ONLY as one of the implicit argument). Since this is FAT16, every cluster is 2 Bytes long so that the offset from the beginning of the FAT of each cluster is just the cluster number multiplied by 2. I decomposed this number using 2 numbers: the first stands for a "big steps" and the second stands for a "small steps". So the question is how many big steps and then small steps one has to count from the beginning of the FAT to reach the desired address? Well the big steps are as big as the RAM buffer = 0x38 LBA blocks (= 56 * 512 = 28672 Bytes = 0x7000). So if I first multiply the cluster number by 2 and then divide by 28672 Bytes I know immediately the FAT portion in AX and the pointer relative to the beginning of the FAT portion in DX. Once this conversion is done, I have to compare the FAT portion in AL4 with the Byte stored at 0xFBDF. If the FAT image in RAM is the same as recorded by this Byte, I can find immediately in RAM the value of the next cluster, otherwise, I have to update the FAT image in RAM before getting this value. This is the general logic but I need to trigger it for the very first time in such a way that an image of FAT overwrites the image of ROOT in the RAM buffer. To do that, I used the conventional value of 0xFF hardcoded for the Byte that stores the FAT image in RAM, in such a way that the CMP always ends with ZF=0 the first time5.

Update FAT image

Update FAT image
Fig. P - Update FAT image

This block does a lot of things with two loops nested. In the beginning, I convert with some algebra the original value in AL (the portion of the FAT that needs to be loaded onto RAM) in an offset address in terms of LBA blocks from the beginning of the FAT itself stored now in AX. Once AX stores the relative LBA from FAT begin, I save a copy of it in BP (I did it with PUSH and POP passing the value through stack but MOV BP, AX could have been also used in an equivalent form). I need to store a copy of it because I am going to enter a loop whose task is to calculate and then read the FAT image from RAM. There are 2 copies of FAT on HDD, so if the first fails I can try with the second one, but it means that I have to recalculate the real LBA again: the offset from the beginning of FAT is the same but the beginning of the FAT is different depending on if one is reading FAT1 or FAT2. Since the loop is not going to preserve the value of AX, I need to have a copy saved somewhere to recall back if I re-enter the loop for the second time attempting the read from FAT2. The instruction XCHG AX, BP at the beginning of FAT_loop_start: seems strange. You may object that AX=BP already so why exchanging them? This instruction is there for an initial reset if the loop goes again for the second time because the FAT1 was found corrupted. In that case, when the loop restarts for the second time, AX is not equal to BP so I need to recreate the good starting condition. The register BX is used to provide indexed addressing, in fact, the start address of FAT1 and FAT2 were calculated offline and hardcoded one after the other at addresses 0x7BDE (FAT1) and 0x7BDC (FAT2). If we still have a read error after 3 attempts the code falls and loops again try treading the second FAT. If it is still not successful than it falls down again and displays the "FAT read error" string.

Get next cluster

Get next cluster
Fig. Q - Get next cluster

There are two ways to enter this block so, I have to consider both of them. At the end of "Compare current portion of FAT in RAM" I had in DX the value of the offset from the beginning of the current FAT portion to read the next cluster from. I saved this value in the stack. I had also the value of the portion of the FAT image to be loaded onto RAM. Then I don't know whether or not the code went thru "Update FAT image", but since the code is here, I am sure that the image of the portion of the FAT in RAM is the right one, so I better save this value now in 0x7BD7 for future comparisons. This value was temporary stored in DI, so I recall it with XCHG.
Another value I need now is the offset that was in DX and I have saved on the stack, so I can recall it with a POP and put in BX in order to use an indexed addressing mode. I had to save this value on the stack because if the code had to go thru "Update FAT image" then all the registers would have been used and there was no more place inside CPU to store temporary this value. Once the next cluster is known and stored in AX, I can compare it with the end of file flag (0xFFFF) and jump to the beginning of the loop again until we finally hit the end of the file and the code falls down into the transfer control block.

Transfer control

What this block does is unfortunately very poor and simple: it shows the string on the screen "Loaded!", it waits for a keypress (INT 16H) and goes with and unconditional jumps to 0x7C00. I said unfortunately because I would like to implement at least checksum verification before passing control to it but I didn't have any Byte free left. Who knows what kind of file this "SOFTWARE.BIG" in reality is? Of course, I can specify that it must be a binary with an entry point at its first Byte which is going to be loaded in RAM at 0x7C00 but this is no guarantee by itself that the specifications will be respected. I can rename a .JPG image file in "SOFTWARE.BIG" and the "Space Shuttle" will load and pass control to it. This is really a risky jump to take!!!

The very last step

Now that I have my "Space Shuttle" in excel as a sequence of ASCII in each and every cell, I have to bring this back in DEBUG.EXE to generate a binary file. I could do it one Byte at a time or, as I discovered with a lot of pleasure, I can just write a small script. To create the script I just open a new and empty TXT-file. I copy and paste from excel the Bytes as they appear in Fig. H and I remember to change manually the last two Bytes for the boot signature. At this point 90% of the job is done because all the Bytes are there in the order I need, I just have to type some DEBUG commands to generate the file. The result looks like the one in the DEBUG script to generate the "Space Shuttle". It is still quite an amount of typing to do but much practical way than typing directly into DEBUG.EXE. Once the script file was ready I invoked it with the syntax:
debug < script.txt

Et voilá, job done!
Looking back now on the years when I was a kid how many things could I have learned and how better could I have used my MS-DOS 3.0 if only I had Internet but, I had no Internet at that time.

DEBUG script to generate the "Space Shuttle"
e 100 eb 3c 90 4d 53 44 4f 53 35 2e 30 00 02 04 02 00 e 110 02 00 02 00 00 f8 cf 00 3f 00 f0 00 60 7a 32 02 e 120 e0 3a 03 00 80 00 29 da 3b 65 40 4e 4f 20 4e 41 e 130 4d 45 20 20 20 20 46 41 54 31 36 20 20 20 fa 33 e 140 db 8e db 8e c3 8e d3 bc 00 78 fc be 00 7c bf 00 e 150 7a b9 ff 00 f3 a5 ea 5b 7a 00 00 89 1e fe 7b fb e 160 b9 03 00 be f0 7b b2 80 b4 42 cd 13 73 18 33 c0 e 170 cd 13 e2 f4 be 80 7b e8 e6 00 be 6f 7b e8 e0 00 e 180 33 c0 cd 16 cd 19 bb 00 05 b9 00 02 91 b9 0c 00 e 190 8b fb be cc 7b f3 a6 e3 0e 83 c3 20 91 e2 ed be e 1a0 8b 7b e8 bb 00 eb d3 8b 45 10 8b 5d 12 2d 00 80 e 1b0 83 db 09 76 08 be 9c 7b e8 a5 00 eb bd 8b 45 0e e 1c0 50 5b 2d 02 00 b9 04 00 f7 e1 03 06 d8 7b 13 16 e 1d0 da 7b a3 e8 7b 89 16 ea 7b b9 03 00 be e0 7b b2 e 1e0 80 b4 42 cd 13 73 0e 33 c0 cd 13 e2 f4 be ad 7b e 1f0 e8 6d 00 eb 85 81 06 e6 7b 80 00 93 bb 02 00 b9 e 200 00 70 f7 e3 f7 f1 50 5f 52 3a 06 d7 7b 75 1c 97 e 210 88 06 d7 7b 5b 8b 87 00 05 3d ff ff 75 a2 be c2 e 220 7b e8 3c 00 33 c0 cd 16 e9 d5 00 b4 38 f6 e4 50 e 230 5d bb 04 00 95 03 87 da 7b 89 06 f8 7b b9 03 00 e 240 be f0 7b b2 80 b4 42 cd 13 73 c4 33 c0 cd 13 e2 e 250 f4 4b 4b 75 df be b8 7b e8 05 00 e9 1c ff 94 95 e 260 b4 0e bb 07 00 ac 3c 00 74 04 cd 10 eb f7 c3 50 e 270 72 65 73 73 20 61 6e 79 20 6b 65 79 2e 2e 2e 00 e 280 42 61 64 20 52 4F 4F 54 0a 0d 00 46 69 6C 65 20 e 290 6E 6F 74 20 66 6F 75 6E 64 0a 0d 00 46 69 6C 65 e 2a0 20 74 6F 6F 20 6C 61 72 67 65 0a 0d 00 42 61 64 e 2b0 20 64 61 74 61 0a 0d 00 42 61 64 20 46 41 54 0a e 2c0 0d 00 4c 6F 61 64 65 64 21 0a 0d 00 53 4f 46 54 e 2d0 57 41 52 45 42 49 47 FF 20 7c 32 02 31 7b 62 7a e 2e0 10 00 04 00 00 00 c0 07 20 7c 32 02 00 00 00 00 e 2f0 10 00 38 00 00 05 00 00 00 7c 32 02 00 00 55 aa n SHUTTLE.BIN rcx 200 w q


  1. I don't want to give you the impression that one has to use a very rigorous, structured top-down approach right from the beginning. The truth is exactly the opposite one. I have used a bottom-up approach doing some experiment with the code immediately to get the feeling of what maybe works and what won't. I experimented quite a while before switching to the rigorous top-down (design, build and test) approach. The point is that one has to develop a feeling about when it is time to move from bottom-up to top-down, but no matter what you have to do this step if you want to have good quality in the code. What I do here is one step further, which is to write a diary to strength and consolidate each step of my learning before moving to the next one. [click back]
  2. In FAT32 the ROOT directory is not statically allocated as in FAT16: it means that you have to search the ROOT directory as if it was any other sub-directory. This is another reason why I preferred to use FAT16 instead of FAT32 for my first learning step. [click back]
  3. For instance, we saw already two different implementation techniques about how to nest FOR-Loops: the fist as a compiler would do in the post "stack_to_string procedure" and the second as a person could do coding directly in assembly in the post "Read the first block of HDD". [click back]
  4. Well I know that the FAT is just 207 blocks long since I read it in the BIOS Parameter Block block at my desk so I am sure that any result will always have AH=0x00. [click back]
  5. Please note that "Zero Flag" ZF=0 means that the flag is cleared so it tells that the two numbers compared are different. [click back]

<PREV.  -  ALL  -  NEXT>

Comments