File system#

When working with large datasets it is important to learn how to interact with the file system, e.g., you may need to search for particular files and create or delete directories.

os and os.walk#

The os module in python provides functions for interacting with the operating system. It comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. Below are the main functions that we use in this week, but the os module offers a number of additional functions. We generally import the os module through (and any other module \(\rightarrow\) this is the same as the library() function in \(\textbf{R}\)

import os
# Check out all available functions that the os module offers
print(dir(os))
['CLD_CONTINUED', 'CLD_DUMPED', 'CLD_EXITED', 'CLD_KILLED', 'CLD_STOPPED', 'CLD_TRAPPED', 'DirEntry', 'EX_CANTCREAT', 'EX_CONFIG', 'EX_DATAERR', 'EX_IOERR', 'EX_NOHOST', 'EX_NOINPUT', 'EX_NOPERM', 'EX_NOUSER', 'EX_OK', 'EX_OSERR', 'EX_OSFILE', 'EX_PROTOCOL', 'EX_SOFTWARE', 'EX_TEMPFAIL', 'EX_UNAVAILABLE', 'EX_USAGE', 'F_LOCK', 'F_OK', 'F_TEST', 'F_TLOCK', 'F_ULOCK', 'GenericAlias', 'Mapping', 'MutableMapping', 'NGROUPS_MAX', 'O_ACCMODE', 'O_APPEND', 'O_ASYNC', 'O_CLOEXEC', 'O_CREAT', 'O_DIRECTORY', 'O_DSYNC', 'O_EVTONLY', 'O_EXCL', 'O_EXLOCK', 'O_FSYNC', 'O_NDELAY', 'O_NOCTTY', 'O_NOFOLLOW', 'O_NOFOLLOW_ANY', 'O_NONBLOCK', 'O_RDONLY', 'O_RDWR', 'O_SHLOCK', 'O_SYMLINK', 'O_SYNC', 'O_TRUNC', 'O_WRONLY', 'POSIX_SPAWN_CLOSE', 'POSIX_SPAWN_DUP2', 'POSIX_SPAWN_OPEN', 'PRIO_DARWIN_BG', 'PRIO_DARWIN_NONUI', 'PRIO_DARWIN_PROCESS', 'PRIO_DARWIN_THREAD', 'PRIO_PGRP', 'PRIO_PROCESS', 'PRIO_USER', 'P_ALL', 'P_NOWAIT', 'P_NOWAITO', 'P_PGID', 'P_PID', 'P_WAIT', 'PathLike', 'RTLD_GLOBAL', 'RTLD_LAZY', 'RTLD_LOCAL', 'RTLD_NODELETE', 'RTLD_NOLOAD', 'RTLD_NOW', 'R_OK', 'SCHED_FIFO', 'SCHED_OTHER', 'SCHED_RR', 'SEEK_CUR', 'SEEK_DATA', 'SEEK_END', 'SEEK_HOLE', 'SEEK_SET', 'ST_NOSUID', 'ST_RDONLY', 'TMP_MAX', 'WCONTINUED', 'WCOREDUMP', 'WEXITED', 'WEXITSTATUS', 'WIFCONTINUED', 'WIFEXITED', 'WIFSIGNALED', 'WIFSTOPPED', 'WNOHANG', 'WNOWAIT', 'WSTOPPED', 'WSTOPSIG', 'WTERMSIG', 'WUNTRACED', 'W_OK', 'X_OK', '_Environ', '__all__', '__builtins__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_check_methods', '_execvpe', '_exists', '_exit', '_fspath', '_fwalk', '_get_exports_list', '_spawnvef', '_wrap_close', 'abc', 'abort', 'access', 'altsep', 'chdir', 'chflags', 'chmod', 'chown', 'chroot', 'close', 'closerange', 'confstr', 'confstr_names', 'cpu_count', 'ctermid', 'curdir', 'defpath', 'device_encoding', 'devnull', 'dup', 'dup2', 'environ', 'environb', 'error', 'execl', 'execle', 'execlp', 'execlpe', 'execv', 'execve', 'execvp', 'execvpe', 'extsep', 'fchdir', 'fchmod', 'fchown', 'fdopen', 'fork', 'forkpty', 'fpathconf', 'fsdecode', 'fsencode', 'fspath', 'fstat', 'fstatvfs', 'fsync', 'ftruncate', 'fwalk', 'get_blocking', 'get_exec_path', 'get_inheritable', 'get_terminal_size', 'getcwd', 'getcwdb', 'getegid', 'getenv', 'getenvb', 'geteuid', 'getgid', 'getgrouplist', 'getgroups', 'getloadavg', 'getlogin', 'getpgid', 'getpgrp', 'getpid', 'getppid', 'getpriority', 'getsid', 'getuid', 'initgroups', 'isatty', 'kill', 'killpg', 'lchflags', 'lchmod', 'lchown', 'linesep', 'link', 'listdir', 'lockf', 'login_tty', 'lseek', 'lstat', 'major', 'makedev', 'makedirs', 'minor', 'mkdir', 'mkfifo', 'mknod', 'name', 'nice', 'open', 'openpty', 'pardir', 'path', 'pathconf', 'pathconf_names', 'pathsep', 'pipe', 'popen', 'posix_spawn', 'posix_spawnp', 'pread', 'preadv', 'putenv', 'pwrite', 'pwritev', 'read', 'readlink', 'readv', 'register_at_fork', 'remove', 'removedirs', 'rename', 'renames', 'replace', 'rmdir', 'scandir', 'sched_get_priority_max', 'sched_get_priority_min', 'sched_yield', 'sendfile', 'sep', 'set_blocking', 'set_inheritable', 'setegid', 'seteuid', 'setgid', 'setgroups', 'setpgid', 'setpgrp', 'setpriority', 'setregid', 'setreuid', 'setsid', 'setuid', 'spawnl', 'spawnle', 'spawnlp', 'spawnlpe', 'spawnv', 'spawnve', 'spawnvp', 'spawnvpe', 'st', 'stat', 'stat_result', 'statvfs', 'statvfs_result', 'strerror', 'supports_bytes_environ', 'supports_dir_fd', 'supports_effective_ids', 'supports_fd', 'supports_follow_symlinks', 'symlink', 'sync', 'sys', 'sysconf', 'sysconf_names', 'system', 'tcgetpgrp', 'tcsetpgrp', 'terminal_size', 'times', 'times_result', 'truncate', 'ttyname', 'umask', 'uname', 'uname_result', 'unlink', 'unsetenv', 'urandom', 'utime', 'wait', 'wait3', 'wait4', 'waitpid', 'waitstatus_to_exitcode', 'walk', 'write', 'writev']

Similar to \(\textbf{R}\) you can assess and change the current working directory. Use the commands os.getcwd() (Get Currenet Working Directory) and os.chdir (Change Working Directory) to do so. I change the working directory to a local folder on my drive. When you apply this function you need to adjust this line accordingly. python will throw an error if you assign a path to the working directory that does not exist.

os.getcwd()
'D:\\OneDrive - Conservation Biogeography Lab\\_TEACHING\\__Classes-Modules_HUB\\M8_Geoprocessing-with-python\\Week_02\\WS_23-24\\Assignment02'
path = "D:/OneDrive - Conservation Biogeography Lab/_TEACHING/__Classes-Modules_HUB/M8_Geoprocessing-with-python/Week_02/WS_23-24/Assignment02/"
os.chdir(path)

List files#

To list all files inside the working directory, use os.listdir(). If you want to list all files in another folder that is not your working directory, you have to provide the path as an argument to the function (in form of a string)

os.listdir()
['Assignment03_part01_Files.zip',
 'geopy03_numpy_assignment.ipynb',
 'Submissions_WS2223',
 'tileID_410_y2000.tif',
 'tileID_410_y2000.tif.aux.xml',
 'tileID_410_y2005.tif',
 'tileID_410_y2005.tif.aux.xml',
 'tileID_410_y2010.tif',
 'tileID_410_y2010.tif.aux.xml',
 'tileID_410_y2015.tif',
 'tileID_410_y2015.tif.aux.xml',
 'tileID_410_y2018.tif',
 'tileID_410_y2018.tif.aux.xml',
 '_Old']

As you can see, the directory contains individual files but also folder, which themselves may contain additional files (and folders). Looking at all of them recursively requires some additional thinking and scripting. I took the liberty to bring this into a small utility function that prints out nicely all files inside a path. The key method hereby is os.walk() which recursively goes through all subdirectories and files inside a folder.

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))
list_files(path)
/
    tileID_410_y2018.tif
    tileID_410_y2000.tif
    tileID_410_y2015.tif
    tileID_410_y2005.tif
    tileID_410_y2010.tif

Create and delete folders#

Often, what we want to do is to automatically add or remove files and/or folders. For example, you want to process a number of files, and want to automatically store the outputs you get in separate folders that indicate through their name, which process /or input files) have been used. the os module allows for doing so.

Caution!

The example below creates a subfolder insider our path because we have defined the path with a / at the end. Have a look and see what python does if you define our variable path without the / at the end.

os.mkdir(path + 'test')
os.listdir(path)
['Assignment03_part01_Files.zip',
 'geopy03_numpy_assignment.ipynb',
 'Submissions_WS2223',
 'test',
 'tileID_410_y2000.tif',
 'tileID_410_y2000.tif.aux.xml',
 'tileID_410_y2005.tif',
 'tileID_410_y2005.tif.aux.xml',
 'tileID_410_y2010.tif',
 'tileID_410_y2010.tif.aux.xml',
 'tileID_410_y2015.tif',
 'tileID_410_y2015.tif.aux.xml',
 'tileID_410_y2018.tif',
 'tileID_410_y2018.tif.aux.xml',
 '_Old']

if we want to delete the same folder, we can use the following command

os.rmdir(path + 'test')
os.listdir()
['Assignment03_part01_Files.zip',
 'geopy03_numpy_assignment.ipynb',
 'Submissions_WS2223',
 'tileID_410_y2000.tif',
 'tileID_410_y2000.tif.aux.xml',
 'tileID_410_y2005.tif',
 'tileID_410_y2005.tif.aux.xml',
 'tileID_410_y2010.tif',
 'tileID_410_y2010.tif.aux.xml',
 'tileID_410_y2015.tif',
 'tileID_410_y2015.tif.aux.xml',
 'tileID_410_y2018.tif',
 'tileID_410_y2018.tif.aux.xml',
 '_Old']

The function os.mkdir(), however, throws an error (specifically: an FileNotFoundError error) when you want to create a nested directory with two (or more levels). Try yourself to run the code

os.mkdir(path + "folder/subfolder")

the reason for that is that os.mkdir() is that python looks for a directory called folder to create the directory subfolder. Since folder at this point does not exist, it throws the error. For purposes like this one, we therefore need to apply a different function:

os.makedirs(path + "folder/subfolder")

Remove directories#

Removing directories works in the same way as creating directories:

  1. For first-level directories we use the function os.rmdir(yourPath)

  2. For recursive deletion of subfolders and files within them use os.removedirs(yourPath)

More advanced commands#

While the os module is great for low-level basic operations, for higher-level operations (e.g., copying data) we recommend the shutil module \(\rightarrow\) check out more here https://docs.python.org/3/library/shutil.html

  • Copies the file src to the file or directory dst. src and dst should be strings. […]

import shutil
shutil.copy(path + "Assignment03_part01_Files.zip", path + "/Assignment03_part01_Files_copy.zip")

Using these functions, you will get through week 2 of this course including the lab assignment. Having said this, we only scratched the surface of what can be done here. Feel free to explore some more functions of the os module. You could do that by checking out some of the functions that print(dir(os)) returns. Or you explore some other sources. Below are some additional websites, that you can check out

  1. w3schools.com

  2. Official python documentation