Python's import system

← Blog

2020-05-02

Introduction

If you’ve done Python for more than 2-3 days (heck it didn’t even take that long for me), you’d have come across import related errors, i.e ‘ImportError’, ‘ModuleNotFoundError’. This blog is my effort to share my understanding on anything and everything I’ve learned about the import machinery.

What happens when you do ‘import spam’/ ‘from spam import ham’

Python has one of the easiest abstractions for its packages.One can imagine a package as a directory in a file system, and modules as files within the directory. If you’ve seen something like ‘from spam.components import ham’, ‘components’ here is a submodule. It can be imagined as, there is a directory named spam, and components is a file. It can be shown in ascii as follows.

spam/
└── components.py

One can access variables, functions, classes defined inside components.py as ‘x’ by from spam.components import x.

Before looking into what goes on inside, let’s look into some terminologies.

sys.modules

sys package is built-in. In this case, consider it as black magic, if you understand what follows, you’d understand how built-in modules are imported as well.

So , sys.modules stores a reference to every module loaded in execution. During runtime, suppose a file is executed, all the packages it imports ‘recursively’ i.e

If file_main.py has

import lib1

And lib1 internally imports lib2, lib3. sys.modules stores a single reference to each of these modules. It makes sense, since there’d else be conflicts if modules are ‘reloaded’ in runtime. (Yes, imported modules can be refreshed in runtime.) An use case of ‘reloading’ would be, if one deals with configuration files which are updated in runtime, they’d want to refresh this reference. Here’s a link where modules are refreshed in runtime (Oh, and load_source internally reloads the module if already defined).

$PYTHONPATH, is a list of directories searched for packages. Imagine Python runtime searching these directories for packages when you import them(but a lot goes on other than this as explained in the later section).

If your package is stored in a path for instance ‘/home/user/pythonlib1/’, then $PYTHONPATH should have ‘/home/user/’. Thus when you ‘import pythonlib1’, it searches in a list of paths(which include ‘/home/user’) for ‘pythonlib1’.

Regular and namespace packages

Regular packages are the ones we generally work with with a typical directory structure as follows.

app/
├── __init__.py
├── eggs
│ ├── __init__.py
│ └── egg_func.py
└── ham.py

Usage of ‘init.py’ is deprecated and unnecessary. Even though there is backward compatibility, you’d understand why init.py’s usage has been deprecated.

Note that, in regular packages, one can load ‘app’ stored in only ‘one’ path. E.g if you have two ‘app’s the module-finder would stop when it encounters ‘app’ the first time.

Namespace packages are defined in the docs as follows

“A namespace package is a composite of various portions, where each portion contributes a subpackage to the parent package.”

Namespace packages can support the following directory structure

/location1/app/
└── ham.py
location2/app/
├── eggs
│ └── egg_func.py

Let’s look into what happens internally when the above line is executed.

If location1, location2 are both in $PYTHONPATH, finders registers both these packages as ‘app’ and one can access both eggs and ham using ‘from app import ham’, ‘from app import eggs’

Now that it’s settled let’s look into what happens when you import a package.

To find and load a package, the following steps are followed.

Once the full name of the package is received if ‘spam.ham’ is imported, sys.modules is looked for this name, if there is a valid(not None) entry, search completes and the process returns. If sys.modules[“spam.ham”] is None, ModuleNotFoundError is raised.
If there is no record of “spam.ham” in sys.modules, Python’s “import protocol” is fired.
The import protocol has two main components.
- Finders are responsible to find/locate a package and its subpackage. In simple words, finders tell if they are capable of handling this particular import. For instance, there are finders for in-built packages like(sys, os), for frozen bytecode modules, and the default one to find packages in paths listed in $PYTHONPATH.
- Loaders are responsible for loading found packages.
The import protocol starts with iterating ‘sys.meta_path’ which is a list of meta path finder objects. Meta path finders are objects with the ‘find_spec’ method. Every meta path finder object is used and the ‘find_spec’ method is called with parameters ‘name’, ‘parent path’, ‘target’. ‘Import spam.ham’ would be called with find_spec(‘spam’, None, None), and find_spec(‘spam.ham’, spam.path , None) until the final submodule is accessed.
If during the process, any submodule can’t be accessed, None is returned, and the next meta_path finder is queried.
If the module is ‘found’ by meta_path finders, a spec is returned. A spec is an object with a loader attribute.
Loaders execute the module in global namespace, and raise ImportError if there are errors while executing the code. Note that before loaders execute the modules, sys.modules[“new_module”] references to an unexecuted but initialized module object. This helps Python to place a check on recursive imports and infinite loops.

To make Python follow all the steps for a non-default usage, we need to append a custom finder to sys.meta_path. A demo code which warns when a specific module is imported is written below, and is very self explanatory after this discussion. The sample used is taken unabridged from https://xion.org.pl/2012/05/06/hacking-python-imports/

import logging
import imp
import sys

class WarnOnImport(object):
    def __init__(self, *args):
        self.module_names = args

    def find_module(self, fullname, path=None):
        if fullname in self.module_names:
            self.path = path
            return self
        return None

    def load_module(self, name):
        if name in sys.modules:
            return sys.modules[name]
        module_info = imp.find_module(name, self.path)
        module = imp.load_module(name, *module_info)
        sys.modules[name] = module

        logging.warning("Imported deprecated module %s", name)
        return module

sys.meta_path = [WarnOnImport('getopt', 'optparse', # etc.
                             )]

Find_spec, load_spec are 3.4+ replacements for find_module, load_module. But according to the docs, if find_spec is undefined, find_module is used, and a similar story for load_spec, load_module.

PEP 302

The above methodology to inject code to the import system is old and dates back to 2002, started with PEP-302 and is called ‘import hooks’. Why did I decide to write about an old detail like that? Because even today, I’ve seen my friends, acquaintances make mistakes or spend some time debugging path related errors.

Prospects

There are crazy things one can do with import hooks. Think about it, if you want to do ‘anything’ after someone hits import x, you decide what to load up sys.modules[“x”] with. A parsed csv, a totally different module, do authentication stuff and fetch files from server, load them up, load files compressed to formats other than .gz.

References

I hope you now have a better understanding of the import system. My next attempt would be to create a finder which imports a python package via HTTP behind an authwall, for an use case where source code is private and can only be accessed via tokens in .env at runtime.

Update: Please refer to this repository to find source code of finders and loaders which load a package over HTTP.