It is time to look into FAT

In this post, I write about my first deep dive in the FAT16 format and my experiments reading the file allocation table. It was fun to make sense and understand the FAT16 format, even though it was not easy to find good documentation on the internet. In fact, I couldn't have learnt it from the documentation alone without my experiments.

Now that we know all about the first block (the Partition Boot Record) and we know how to read any block from HDD into RAM, it is time to go out of the first block of the HDD and discover the rest of it. On the Internet, you will find a lot of documentation about the FAT16 format but just two of them are good in my opinion:

  1. "A tutorial on the FAT file system" by Bob Eager;
  2. "Microsoft Extensible Firmware Initiative FAT32 File System Specification" version 1.03 from December 6th 2000.

The first one is a good tutorial to break the ice, but (almost1) all details are just described in the second document from Microsoft. After reading the FAT specification, I decided to remain with the 16-bits version, since it is simpler than the 32-bits one. When I set up the test rig I wanted to keep the complexity of the partition "D:\" as low as possible in such a way to increase my chances of learning and managing it. So I stuck to the KISS rule (Keep It Safe and Simple) and I kept it in FAT16.

First of all, I formatted the partition "D:\" again and I put some folders and ".TXT" files there with known content; then I deleted some of them keeping some others. I did so because I planned to do a dump of a lot of blocks of my HDD partition using the Linux command dd, convert it later on in a text file and have a look on how things are organized on that partition. I wanted to read a lot of blocks so I set the count = 2000 this time.

dd if=/dev/sda2 of=/home/mik/Desktop/sda2.bin bs=512 count=2000
hexdump -C -v sda2.bin > sda2.txt


Once again I transferred the result in Excel in such a way to make it easier for me analysing, commenting and preparing the pictures. We have analysed already the BIOS Parameter Block (BPB) in Fig. E of post "Learn read before you write", so let's start to match the information stored there with the dump of the disk here and see if it makes sense.

The two copies of FAT16
Fig. A - The two copies of FAT16

In Fig. A we see that an empty block is immediately following the Partition Boot Record. This fact agrees with the information stored in the word starting at 0x0E from the beginning of the Partition Boot Record inside the BIOS Parameter Block (see Fig. E of post "Learn read before you write"). Even though statistically, in most cases, the number of reserved blocks is 1 being the allowance for the boot block, it doesn't mean that it has to be like this always. It is worth to highlight here one difference in the comments between the two Microsoft's specifications.

BPB_RsvdSecCnt in the specifications
Fig. B - BPB_RsvdSecCnt in the specifications

In Fig. B, you see that "Reserved sector count in BIOS Parameter Block" (BPB_RsvdSecCnt) is used to align the start of the data area to integral multiples of the cluster size with respect to the start of the partition. In my case, I created the partition and I formatted it from Windows XP (a Microsoft OS) which indeed has created more that one reserved sector (or block) in order to align properly the start of the data area. So the number of reserved blocks creates some space between the beginning of the partition and the beginning of the file system. This could be used for our purpose of building up a "space shuttle" that perhaps won't be able to fit within 512 Bytes of Partition Boot Record space. I have read somewhere that old versions of GRUB (GRUB-Legacy) may have used this space to fill it with useful code that can be read during booting when still no knowledge of the file system exists. But let me stretch this concept further and put it to the limit: why not create so much space with the reserved blocks as the addressable memory in 16-Bit real mode? One could load all the necessary software in RAM with INT 13H all at once. When the PC boots there is no knowledge of any file system. As far as the BIOS is concerned, the HDD is organized as a linear sequence of LBA blocks numbered one after the other. For me, this seems simple, robust, reliable and fast. But this is just my opinion based on the facts I know and things I have learned until now. Probably there are facts I am not aware of until now which suggest not to follow this way. Just let us say it is a possibility one may decide to exploit (but this is just my personal opinion about that and, as usual, I invite you to make your own opinion or tell me in the comments).

At the end of the reserved block we find the two copies of the FAT (Fig. A), so I checked the number stored in the BIOS Parameter Block, I did some additions and I verified that the start address from the beginning of the hexadecimal dump corresponded to the partition "D:\" of my drive. We see that each cluster number is a 16-bits number (the reason for calling it FAT 16) and we know that the first two clusters don't exist because the file system driver uses the first two storage locations (FAT[0] and FAT[1]) of the FAT array for different kind of information rather than a cluster number. The first word stored at position FAT[0], contains a copy of the BPB_Media2 byte in its low 8 bits, and all other bits are set to 1. The second word stored at position FAT[1] is used by the file system driver to store information about the volume. The high two bits of the FAT[1] are entry for dirty volume flags (all other bits, are always left set to 1).
For FAT16:

  • ClnShutBitMask = 0x8000;
  • HrdErrBitMask = 0x4000;

All other FAT storage location corresponds to a cluster, for instance, the storage location FAT[2] corresponds to the cluster #2. The numeric value stored in the FAT tells something about the corresponding cluster. In the "Microsoft FAT Specification (SDA Contribution)" from August 30th 2005 (which I cannot find anymore on the Internet) there is a useful table with the FAT entry values comparing FAT12, FAT16 and FAT32. I summarize here in Fig. C the FAT16 part of that table.

FAT16 entry values
Fig. C - FAT16 entry values

So let's do some arithmetic: with a 16-bits quantity, we have 65536 numbers available. Eleven numbers are already reserved with special meanings (0x0000, 0x0001 and from 0xFFF7 to 0xFFFF) so there remain 65536 - 11 = 65525 numbers that we can use to identify the clusters; in conclusion, we can allocate 65525 clusters with FAT16. But how many Bytes are in a cluster? Well, it depends. A cluster is a group of sectors (or logic blocks) and each sector has a size in Bytes. At this point, I suggest you read the Microsoft's specification on page 9 about BPB_BytsPerSec (Bytes per Sectos) and BPB_SecPerClus (Sector per Cluster). It is interesting to read that the size of the sector (the number of Bytes it has) depends on the HDD. 512 Bytes per sector (or per logic block) seems to be the most used size, but it is not always the case3. The number of sectors per cluster can be any within the list of the specification (BPB_SecPerClus on page 9) as long as the total size of the cluster is smaller or equal to 32 KBytes4. On the two extremes considering the smallest block by the smallest cluster and the biggest block with the biggest cluster we have the following range for the data:

  • Minimum HDD addressable data of ~32 MB (with 512 Bytes per block and 1 block per cluster)
  • Maximum HDD addressable data of ~2 GB (with 32 KBytes per cluster)

The smaller the size of the cluster, the better the data storage density that can be achieved, at expense of the maximum HDD size that can be addressed. The opposite is also true. Let me explain it with an example. Suppose that we have to store a file of 1024256 Bytes (~0,98 MB), in the first case, we need 2001 clusters (1 cluster = 512 Bytes) leaving 256 Bytes unused for the file but occupied on the HDD by the file system. In the second case, we need 32 clusters (1 cluster = 32 KBytes = 32*1024 Bytes) leaving 24320 Bytes unused for the file but occupied on the HDD by the file system.
What I found on the partition "D:\" of my HDD was 512 Bytes per block at 4 blocks per cluster which equals 2 KB per cluster.

ROOT directory
Fig. D - ROOT directory

Moving on and scrolling down in the excel file, where I put the dumped image of the HDD, I found the ROOT directory (Fig. D) and EUREKA! I found the volume label in the first entry of the ROOT directory. Do you remember? I was analysing the BIOS parameter block and I found the field designed to store the information of the volume name filled with the string “NONAME ” (Fig. E of post "Learn read before you write"), meanwhile, when opening the volume with the file explorer on Windows-XP it was showing the volume name "TEST". At that time I was wondering how it could be possible. Now I know that the first entry of the ROOT directory is a special entry containing this info5.

Entry for file ROOT.TXT in the ROOT directory
Fig. E - Entry for file "ROOT.TXT" in the ROOT directory

Scrolling down a little bit more I found the entry of the file "ROOT.TXT" in the ROOT directory (Fig. E). Here I read that this file starts at cluster #3, so I scrolled down until cluster #3 and I found the data corresponding to it (Fig. G).

Cluster #3
Fig. G - Cluster #3

According to specification, the FAT16 can accommodate long file name (i.e. more than 8 characters for the name and more than 3 characters for the extension). I wanted to look at this too, so I formatted the test partition "D:\" again and I stored a text file in the ROOT directory whose size is big enough to occupy more than one cluster.

FAT16 long name
Fig. H - FAT16 long name

In Fig. H you see the new dump of the ROOT directory. The first entry in the ROOT is still the volume label characterized by identification byte 0x08 at offset 0x0B and having a starting cluster 0x0000 (we know that real clusters start at #2 so this is one of the conventional meaning of cluster number #0). Soon after this, we have the long file name. So the technique is to write the long name using more than one entry in the ROOT. The entries containing the long name must be in a sequence one after the other starting from the end of the string (at low address) and going towards the beginning of the string (at high address). These entries in the ROOT don't follow the general rule but they use all other fields (for instance the date and time of creation field) also to store portions of the long name. The last entry in the chain (at the highest address of them all) follows the general rule and contains all the standard information (date, time of creation/modification, starting cluster, size and so on) but the file name is now stored with 8.3 format convention with the "~N" truncation. N can be any number in range from "~1" to "~999999" (read page 32 of "Microsoft's specification"). The first byte of every crunch of the long name is a numbering index of the portion of the name plus the special value 0x40 (meaning that bit 6 is set) to indicate the end of the file name. If you do the algebra, with 5 bits remaining one can address 0x3f possible crunches (= 63) considering that crunch 0x00 has the special meaning indicating a "free" directory entry and the "last" entry of current directory, it remains only 62 entries to write the long name. Every entry can accommodate 13 characters so long names can be 806 (=13*62) characters long including the "." and the file extension. But this is not true: the maximum length is limited to 255 characters (read page 30 of "Microsoft's specification"). In my case, the first entry is marked with 0x42 = 0x40 + 0x02 so it means that this is the crunch number 2 and this is also the last crunch. Next crunch is the 0x01 and so on. I have noticed that the characters for long names occupy 2 Bytes and I guess it is UTF-16 but Microsoft doesn't mention UTF-16 directly in the specification rather it writes: "Long names are stored in long directory entries in UNICODE. UNICODE characters are 16-bit characters".

File saved into two clusters
Fig. I - File saved into two clusters

In Fig. I you can see the file going above the first cluster occupying cluster #2 and #3. In the ROOT (Fig. H) we find the start cluster (cluster #2) and in the FAT at position 2 we find the continuing cluster (0x 0003). In Cluster #3 we find the end of the file and in position #3 of the FAT, we find the marker 0x FFFF which indicates the end of the file indeed.

GREAT! I am really happy! It all makes sense and I feel confident I have fully understood the FAT16 file system. It was a bit painful and long but it was definitively a good idea to have all the Bytes in an Excel file; it was then easy for me to build graphically the logical groups and see them. I could give colours to separate, distinguish the different Bytes and this improved a lot my understanding process of it. Another advantage was that I could write comments on the right of each Byte or group of Byte organizing them as a TETRIS cascade and calculate the offsets where each new logical group starts and compare it with the information saved in the BIOS Parameter Block. It was a good learning experience for me.

Comments