Error handling

To make GROMACS behave like a proper library, we need to handle errors in a consistent and predictable way. In this section, “user” refers to the end user of GROMACS whether via some command-line tool, or a workflow, or a call to a public API. There are different types of errors, and the handling reflects this. This section is a work in progress, particularly as the broader C++ community is a long way from consensus in these areas.

Brief summary on which method to use

More detailed rules and rationale are written below, but in short, when a reason exists that code is unable to do its job:

  • If the reason can be checked at compile-time, then use static_assert.

  • If the reason is normal in context, then express that in the types used (e.g. return std::optional) and document that this is normal.

  • If the reason is that an internal invariant or pre-condition is violated (e.g. unexpected null pointer passed) on a hot code path, then use GMX_ASSERT.

  • Otherwise, if the reason is that an internal invariant or pre-condition is violated then use GMX_RELEASE_ASSERT.

  • Otherwise, (typically an error returned from system call or GPU SDK, bad user input), then use GMX_THROW.

Guiding principles

  • GROMACS should adopt approaches that have achieved consensus elsewhere, e.g. in the C++ Core Guidelines. In particular, be guided by its section on error handling

  • The library should not print out anything to stdio/stderr unless it is part of the API specification, and even then, there should be a way for the user to suppress or redirect the output.

  • The library should normally not terminate the program without the user having control over this.

  • Design interfaces of functions, classes, modules, and libraries so that values passed at run time are valid. Pass const references or not_null pointers rather than raw pointers. Return objects where possible. Use e.g. class enums for the type of passed values. Consider such enums as template parameters, rather than passing run-time values. Refactor existing interfaces to improve such aspects when starting new work in an area.

  • Check user input at API boundaries and establish invariants as soon as possible, e.g. by expressing the user’s choice in the type system. These form the pre-conditions that error handling will rely on.

  • Use assertions to validate invariants and pre-conditions. There is value in using a different technique for checking such violations in order to make the reason for the check clear to the maintainer.

Specific rules

  • Use static_assert wherever possible to detect errors at compile time.

  • Throw exceptions to indicate that a function cannot do its assigned task, per the C++ Core Guidelines E.2. In particular, constructors should throw when they cannot construct a valid object, per C++ Core Guidelines C.42. However, recognize that in some cases the underlying reason is that some other component has not set up the correct pre-condition, and such cases should be handled with assertions (see below).

  • At API boundaries, the assigned task of some code will be to validate the input, and that code should express failure to validate by throwing.

  • Many programming errors violate pre-conditions of other functions. Until there is language support for contracts, the best that can be done is to check these with assertions. Note that only one component should have the responsibility for validating any particular input from the user, and other components should rely upon that validation in their pre-conditions.

  • When asserting, use GMX_RELEASE_ASSERT by default. This macro will run its check in all build configurations, including Release.

  • When asserting in cases where the code is called in an inner loop of e.g. the MD step, GMX_ASSERT can be used. This macro will run its check only when NDEBUG is not defined, including the RelWithAssert build configuration (which is the default build type used in CI).

  • It can be appropriate to provide both checked and unchecked interfaces, as std::vector does with at() and operator[], respectively. Note that even the latter is checked if you build e.g. libstdc++ in the right configuration!

  • When calling low-level APIs (including C and C++ standard libraries, GPU SDKs) always check for success/failure. Generally the correct thing to do upon failure will be to throw, perhaps including a descriptive string obtained from an error code with another API call.

  • Do catch exceptions from lower-level components memory or file system IO errors. As a general guideline, incorrect user input should not produce an untrapped exception resulting in execution termination telling the user an exception occured. Instead, you should catch exceptions in an earlier stack frame, make a suitable decision about diagnostic messages, and then decide whether execution should be terminated (if that is in the scope of the code making the decision) and, if so, how to terminate.

  • There is a global list of possible exceptions in api/legacy/include/gromacs/utility/exceptions.h, and the library should throw one of these when it fails, possibly providing a more detailed description of the reason for the failure. The types of exceptions can be extended, and currently include:

    • Out of memory (e.g. std::bad_alloc)

    • File I/O error (e.g. not found)

    • Invalid user input (could not be understood)

    • Inconsistent user input (parsed correctly, but has internal conflicts)

    • Simulation instability

    • Invalid API call/value/internal error (an assertion might also be used in such cases)

    • In the internals of a module called from code that is not exception safe, you can use exceptions for error handling, but avoid propagating them to caller code.

  • Avoid using exceptions to propagate errors across regions that start or join threads with OpenMP, since OpenMP cannot make guarantees about whether exceptions are caught or if the program will crash. Currently we catch all exceptions before we leave an OpenMP threaded region. If you throw an exception, make sure that it is caught and handled appropriately in the same thread/OpenMP section.

  • Avoid using exceptions to propagate errors within regions where non-blocking API calls (e.g. to MPI or GPU SDKs) have been made, because the possible advantage of catching at a higher level and continuing execution is absent when the partner in the API call may be left blocked.

  • There are also cases where a library routine wants to report a warning or a non-fatal error, but is still able to continue processing. In this case you should try to collect all issues and report and report them (similar to what grompp does with notes, warnings and errors) instead of just returning the first error. It is irritating to users if they fix the reported error, but then they keep getting a new error message every time the rerun the program.

  • A function should not fail as part of its normal operation. However, doing nothing can be considered normal operation. A function accessing data should typically also be callable when no such data is available, but still return through normal means. If the failure is not normal, it is OK to rather throw an exception.

  • Error handling with gmx_fatal, gmx_warning, gmx_incons, gmx_comm etc. is deprecated and should generally be refactored to throw or assert according to the above guidelines.

  • There is currently no attempt made to check for error states on other MPI ranks during the simulation and provide a coordinated recovery. However setup code should do such checks routinely.

  • We use GMX_RELEASE_ASSERT and GMX_ASSERT rather than assert to ensure that non-immediate strings can be used to describe the problem when the error is reported. This is particularly useful when troubleshooting issues where missing test coverage leads users to uncover such errors.

For coding guidelines to make this all work, see Implementing exceptions for error handling.