tags : Linux, Systems, Filesystems, Object Store (eg. S3)

Physical Medium

  • HDD
  • SSD
  • Those devices have a connector that speaks a certain transfer protocol, which is usually a SCSI or ATA cable.
  • Read more on this later: Data Storage on Unix

Storage Medium

  • Attachment Packet Interface (ATAPI) disk
  • Serial-Attached SCSI (SAS) disk
  • Serial Advanced Technology Attachment (SATA) disk.

Storage layers

See Exta FS Info

Block layer

See Memory Hierarchy Block layer divides the storage device into fixed-sized “blocks” of contiguous memory

  • HDD: Spinning platters & moving arms
  • Tape: Single contiguous magnetic stripe.
  • SSDs: Arrays of NAND flash
λ stat -f .
  File: "."
    ID: 4817ba2a119ea9a0 Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 113739102  Free: 73843814   Available: 68047808
Inodes: Total: 28966912   Free: 26207344
 
λ sudo tune2fs -l /dev/sda3 | rg -i inode
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype ...
Inode count:              28966912
Free inodes:              26207643
Inodes per group:         8192
Inode blocks per group:   512
First inode:              11
Inode size:               256
Journal inode:            8
First orphan inode:       26246587
Journal backup:           inode blocks

Superblock

  • Superblock is file system metadata (Eg. FS type, size, status, free and allocated blocks etc. and metadata of metadata). Critical to the FS, so redundant copies are stored.
  • Superblock is defined by the FS but actually present in the partition.
    • If the superblock of a partition /var becomes corrupt
    • The FS in question (/var) cannot be mounted by the OS.
    • In this case you may need to run fsck to pick a backup copy of the superblock and attempt to recover the FS.
λ sudo dumpe2fs /dev/sda3 | rg superblock
dumpe2fs 1.47.0 (5-Feb-2023)
Primary superblock at 0, Group descriptors at 1-56
Backup superblock at 32768, Group descriptors at 32769-32824
...
Backup superblock at 78675968, Group descriptors at 78675969-78676024
Backup superblock at 102400000, Group descriptors at 102400001-102400056

Layout

Layout is specific to FS. Let’s take EXT4 Layout for example. It has one inode table per block group which tracks only inodes in that group

File/inode layer

Files often contain data that spans multiple blocks. It uses inode to specify a file.

Max file size (for ext4)

λ sudo tune2fs -l /dev/sda3 | rg -i "Block|inodes" # trimmed results, in bytes
Inode count:              28966912
Block count:              115838208
Block size:               4096
Inode size:               256
 
# (115838208 * 4096)/(10^9) = 474GB (my SSD size)
  • Max file size and Max file system/volume size are different. See What are the file and file system size limitations
  • ext4 max block count : 2^32 => (2^32 * 4096)/(10^9) = 16TiB (Max file size)
  • The 64bit flag allows blocks greater than 2^32
  • Smallest possible blocksize for ext4 is 1024 bytes.

inode table size

  • bytes-per-inode is the same as inode_ratio . This is a config option, not a property of the partition.
# inode_count = (blocks_count * blocksize) / inode_ratio
# inode_count = (partition_size_in_bytes) / inode_ratio
# inode_table_size = inode_count * inode_size
  • As a general rule of thumb the inode_ratio should not be smaller than the blocksize because you cannot (easily) store more than one file in one block.

Directory layer

  • Unix directories are lists of association structures, each of which contains one filename and one inode number.
  • dentry

LVM (Logical Volume Manager)

See LVM - ArchWiki

  • A layer above physical partitions that allows multiple partitions to appear as one.
  • Grow a filesystem by adding physical space
  • Physical disk like as disk or partition to be used with lvm
  • LVM is useful if you want to use LUKS to encrypt your root, swap and other partitions in one encrypted volume.
  • It sort of is an alternative(less featured) to ZFS and btrfs. In was it’s just a volume manager.

What’s the usual blueprint like?

  • esp(/boot/efi) and boot(/boot) are separate physical partitions, not placed within LVM, because Grub has to access them at boot time.

Random notes

  • Install linux distro on LVM partition
  • Allows you to continue expanding the partition by adding disk on the fly
  • It’s similar to RAID in a way that it gives an unified view of the underlying disks but doesn’t give redudancy and other fancy things that RAID gives you

Uses

Personal use

Practices of using LVM in the Cloud

  • My verdict: Don’t do LVM in the cloud unless you absolutely have a certain usecase
    • We can have multiple EBS disk mounted at diff mountpoints and our applications can make use of it. So not always the case that if multiple disks you need LVM etc.
  • People have different needs and people do run ZFS/btrfs/LVM on top of Cloud volumes like AWS EBS. See AWS for more info around EBS. But usually if you just have one disk, or even multiple disk and you’re fine they being multiple disks you don’t need LVM.
  • LVM also has snapshotting feature, but we already have snapshotting at cloud vendor volume level so this becomes redundant
  • When you’re running VMs in the cloud, we can attach volumes. These volume allow themselves to be modified. Then you need to resize the partition and filesystem. From what I see, if it’s only one volume disk that you’re using, the steps needed are more or less the same with and without LVM.
  • If you’er using multiple disks in the cloud however for some reason, LVM can help. But Volumes can grow pretty big so its unlikely you’d need to add another disk, unless there’s specific io requirements etc.
  • There are other clever uses of LVM on the cloud
  • lsblk : Attached disks
  • cat /etc/fstab : Mounts on boot
  • df -h : Disk free on mounted stuff
  • file -s /dev/xvdf : Determine file system
  • fdisk -l : List all partitions
  • lvm
    • lvm pvdisplay
    • pvs : Show short details of physical volumes
    • vgs : Show short details of volume groups
    • lvs : Show short details of logical volumes

Disk usage patterns

Disk usage patterns in cloud vm v/s Bare metal

How LVM becomes redundant in cloud VMs

  • Instead of adding one or a few big disks and partitioning them in the VM(as we do in bare metal), we add each potential volume as an individual disk containing at most one partition.
  • When you need to extend space, you increase the virtual disk in the virtualization environment, and then grow a partition and FS in a VM. This is now your LVM(the idea of it is being executed by the vendor/platform so we don’t really need to do it inside our vm).

What disks do we need in cloud VMs

  • In most cases you will need a root FS and data FS, e.g. two virtual disks, and chances are the root FS disk of ~30 GB will never need to be resized.
  • Databases often suggest using separate storage for write ahead logs; that will require addition of a dedicated virtual disk anyway, so to place it onto separate storage in the virtualization environment.

RAID

What

  • with raid you store in a way that if a drive fails, you lose no data. (in particular types of raids)
  • raid no substiutes for data.

Resources