Useful mdrun features

This section discusses features in gmx mdrun that don’t fit well elsewhere.

Re-running a simulation

The rerun feature allows you to take any trajectory file traj.trr and compute quantities based upon the coordinates in that file using the model physics supplied in the topol.tpr file. It can be used with command lines like mdrun -s topol -rerun traj.trr. That tpr could be different from the one that generated the trajectory. This can be used to compute the energy or forces for exactly the coordinates supplied as input, or to extract quantities based on subsets of the molecular system (see gmx convert-tpr and gmx trjconv). It is easier to do a correct “single-point” energy evaluation with this feature than a 0-step simulation.

Neighbor searching is performed for every frame in the trajectory independently of the value in nstlist, since gmx mdrun can no longer assume anything about how the structures were generated. Naturally, no update or constraint algorithms are ever used.

The rerun feature cannot, in general, compute many of the quantities reported during full simulations. It does only take positions as input (ignoring potentially present velocities), and does only report potential energies, volume and density, dH/dl terms, and restraint information. It does notably not report kinetic, total or conserved energy, temperature, virial or pressure.

Running a simulation in reproducible mode

It is generally difficult to run an efficient parallel MD simulation that is based primarily on floating-point arithmetic and is fully reproducible. By default, gmx mdrun will observe how things are going and vary how the simulation is conducted in order to optimize throughput. However, there is a “reproducible mode” available with mdrun -reprod that will systematically eliminate all sources of variation within that run; repeated invocations on the same input and hardware will be binary identical. However, running in this mode on different hardware, or with a different compiler, etc. will not be reproducible. This should normally only be used when investigating possible problems.

Halting running simulations

When gmx mdrun receives a TERM or INT signal (e.g. when ctrl+C is pressed), it will stop at the next neighbor search step or at the second global communication step, whichever happens later. When gmx mdrun receives a second TERM or INT signal and reproducibility is not requested, it will stop at the first global communication step. In both cases all the usual output will be written to file and a checkpoint file is written at the last step. When gmx mdrun receives an ABRT signal or the third TERM or INT signal, it will abort directly without writing a new checkpoint file. When running with MPI, a signal to one of the gmx mdrun ranks is sufficient, this signal should not be sent to mpirun or the gmx mdrun process that is the parent of the others.

Running multi-simulations

There are numerous situations where running a related set of simulations within the same invocation of mdrun are necessary or useful. Running a replica-exchange simulation requires it, as do simulations using ensemble-based distance or orientation restraints. Running a related series of lambda points for a free-energy computation is also convenient to do this way, but beware of the potential side-effects related to resource utilization and load balance discussed later.

This feature requires configuring GROMACS with an external MPI library so that the set of simulations can communicate. The n simulations within the set can use internal MPI parallelism also, so that mpirun -np x gmx_mpi mdrun for x a multiple of n will use x/n ranks per simulation.

To launch a multi-simulation, the -multidir option is used. For the input and output files of a multi-simulation a set of n subdirectories is required, one for each simulation. Place all the relevant input files in those directories (e.g. named topol.tpr), and launch a multi-simualtion with mpirun -np x gmx_mpi mdrun -s topol -multidir <names-of-directories>. If the order of the simulations within the multi-simulation is significant, you are responsible for ordering their names when you provide them to -multidir. Be careful with shells that do filename globbing dictionary-style, e.g. dir1 dir10 dir11 ... dir2 ....

Examples running multi-simulations

mpirun -np 32 gmx_mpi mdrun -multidir a b c d

Starts a multi-simulation on 32 ranks with 4 simulations. The input and output files are found in directories a, b, c, and d.

mpirun -np 32 gmx_mpi mdrun -multidir a b c d -gputasks 0000000011111111

Starts the same multi-simulation as before. On a machine with two physical nodes and two GPUs per node, there will be 16 MPI ranks per node, and 8 MPI ranks per simulation. The 16 MPI ranks doing PP work on a node are mapped to the GPUs with IDs 0 and 1, even though they come from more than one simulation. They are mapped in the order indicated, so that the PP ranks from each simulation use a single GPU. However, the order 0101010101010101 could run faster.

Running replica-exchange simulations

When running a multi-simulation, using gmx mdrun -replex n means that a replica exchange is attempted every given number of steps. The number of replicas is set with -multidir option, described above. All run input files should use a different value for the coupling parameter (e.g. temperature), which ascends over the set of input files. The random seed for replica exchange is set with -reseed. After every exchange, the velocities are scaled and neighbor searching is performed. See the Reference Manual for more details on how replica exchange functions in GROMACS.

Multi-simulation performance considerations

The frequency of communication across a multi-simulation can have an impact on performance. This is highly algorithm dependent, but in general it is recommended to set up a multi-simulation to do inter-simulation communication as infrequently as possible but as frequently as necessary. However, even when members of multi-simulation do not communicate frequently (or at all), and therefore the associated performance overhead is small or even negligible, load imbalance can still have a significant impact on performance and resource utilization. Current multi-simulation algorithms use a fixed interval for data exchange (e.g. replica exchange every N steps) and therefore all members of a multi-simulation need to reach this step before the collective communication can happen and any of them can proceed to step N+1. Hence, the slowest member of the multi-simulation will determine the performance of the entire ensemble. This load imbalance will not only limit performance but will also leave resources idle; e.g. if one of the simulations in an n-way multi-simulation runs at half the performance than the rest, the resources assigned to the n-1 faster running simulations will be left idle for approximately half of the wall-time of the entire multi-simulation job. The source of this imbalance can range from inherent workload imbalance across the simulations within a multi-simulation to differences in hardware speed or inter-node network performance variability affecting a subset of ranks and therefore only some of the simulations. Reducing the amount of resources left idle requires reducing the load imbalance, which may involve splitting up non-communicating multi-simulations, or making sure to request a “compact” allocation on a cluster (if the job scheduler allows). Note that imbalance also applies to non-communicating multi-simulations like FEP calculations since the resources assigned to earlier finishing simulations can not be relinquished until the entire MPI job can finish.

Controlling the length of the simulation

Normally, the length of an MD simulation is best managed through the mdp option nsteps, however there are situations where more control is useful. gmx mdrun -nsteps 100 overrides the mdp file and executes 100 steps. gmx mdrun -maxh 2.5 will terminate the simulation shortly before 2.5 hours elapse, which can be useful when running under cluster queues (as long as the queuing system does not ever suspend the simulation).