I was looking for a way to estimate the disk space in use by a file, i.e. a Python version of the du command. Naively, I thought that shutil.disk_usage would give me that, since its docstring says: Return disk usage statistics about the given path. But it turns out that it is instead a Python equivalent of the df command: it shows information about the file system on which the path resides. So, I have two questions:

  • Is there a Python equivalence of the du command? I am happy now with calling du as an external, but a Python variant might be useful for repeated use.
  • Can the docstring of shutil.disk_usage be improved to make clear that it only shows results about the whole file system, not about the individual path?
  • (apparent_size
  • ? (usable_st_size (sb) ? MAX (0, sb->st_size) : 0)
  • : (uintmax_t) STP_NBLOCKS (sb) * ST_NBLOCKSIZE),
  • (time_type == time_mtime ? get_stat_mtime (sb)
  • : time_type == time_atime ? get_stat_atime (sb)
  • : get_stat_ctime (sb)));
  • level = ent->fts_level;
  • dui_to_print = dui;
  • if (n_alloc == 0)
  • n_alloc = level + 10;
  • && (st)->st_blksize <= (size_t) -1 / 8 + 1) \
  • ? (st)->st_blksize : DEV_BSIZE)
  • # if defined hpux || defined __hpux__ || defined __hpux
  • /* HP-UX counts st_blocks in 1024-byte units.
  • This loses when mixing HP-UX and BSD file systems with NFS. */
  • # define ST_NBLOCKSIZE 1024
  • # endif
  • #endif
  • #ifndef STP_NBLOCKS
  • # define STP_NBLOCKS(st) ((st)->st_blocks)
  • #endif
  • #ifndef ST_NBLOCKSIZE
  • # ifdef S_BLKSIZE
  • # define ST_NBLOCKSIZE S_BLKSIZE
  • # else
  • # define ST_NBLOCKSIZE 512
  • # endif
  • #endif
  • Thanks for the suggestions. I may take a dip into it when I have more time. For now, I will stay with running du in a subprocess, with the following function:

    def disk_usage(paths):
        cmd = ['du', '-B1', '-s', '-D'] + [os.path.expanduser(p) for p in paths]
        P = subprocess.run(cmd, capture_output=True, encoding='utf-8')
        out = {}
        for line in P.stdout.split('\n'):
            s = line.split('\t')
            if len(s) == 2:
                out[s[1]] = int(s[0])
        return out
                  

    But don’t (physical on-disk) disk block sizes vary over time as drives get bigger? I remember when a HDD block size was 32K, now it might be 512K or more.

    How would one find the block size for an individual drive so this routine would work for 30 years?

    Are block sizes managed differently for HDD and SDD?

    I was looking for a way to estimate the disk space in use by a file,

    Somehow on my Windows system I have du from https://www.sysinternals.com. Hm, but that’s just for a directory size only, not for a single file size on the physical disk.

    Or maybe take a look at the gnu utils du source code. Start somewhere around here: Coreutils - GNU core utilities

    Chuck R.:

    But don’t disk block sizes vary over time as drives get bigger? I remember when a HDD block size was 32K, now it might be 512K or more.

    In these cases, there should be a separate .blksize attribute to tell you the block size.

    I have to say, it’s quite strange not to see an obvious, high-level interface for this in the standard library.

    Chuck R.:

    But don’t disk block sizes vary over time as drives get bigger? I remember when a HDD block size was 32K, now it might be 512K or more.

    How would one find the block size for an individual drive so this routine would work for 30 years?

    Karl Knechtel:

    In these cases, there should be a separate .blksize attribute to tell you the block size.

    To be clear, st_blocks isn’t measured in units of filesystem blocksizes or in units of st_blksize (all three are uncorrelated).

    The units of st_blocks aren’t POSIX-standardized, but it’s often blocks of 512-bytes. Often enough that the Python docs simply document it as “Number of 512-byte blocks allocated for file.”

    The actual filesystem blocksize is found in statvfs.f_bsize, which has the Python analog os.statvfs("...").f_bsize.

    Per sys/stat.h:

  • st_blksize - A file system-specific preferred I/O block size for this object. In some file system types, this may vary from file to file.
  • st_blocks - Number of blocks allocated for this object.
  • The unit for the st_blocks member of the stat structure is not defined within IEEE Std 1003.1-2001. In some implementations it is 512 bytes. It may differ on a file system basis. There is no correlation between values of the st_blocks and st_blksize, and the f_bsize (from <sys/statvfs.h>) structure members.

    The value that st_blocks reports has no direct relation to the underlying filesytem block size. Per the coreutils source code linked in my post above, du assumes that st_blocks is in units of 512 bytes unless you redefine it to some other value with a macro at compile time.