Previous Thread
Next Thread
Print Thread
#33907 05/13/14 05:33 PM
Joined: Sep 2006
Posts: 146
S
slaw Offline OP
Forum Member
OP Offline
Forum Member
S
Joined: Sep 2006
Posts: 146
Dear CHARMM Community,

I've been thinking about this for a while and was curious as to why it is important/necessary to store high precision trajectories (beyond 3 decimal places)(I am referring to atomic coordinates and not velocities here). From what I understand, CHARMM stores the coordinates for each atom in a binary file as floats (not doubles). However, considering that the starting structure used for a simulation is usually a PDB file then any quantity that is derived from it (i.e., geometric properties such as distance, angles, dihedrals, etc) should only be as precise as the precision of the PDB file (3 decimal places). I can see where the energy might be "off" if the precision is truncated to 3 decimal places but can somebody offer further insight on why else it would be important/necessary?


Last edited by slaw; 05/13/14 05:34 PM.
Joined: Dec 2005
Posts: 1,535
Forum Member
Offline
Forum Member
Joined: Dec 2005
Posts: 1,535
  • This seems more a question for the "General Chemistry Discussions" forum.
  • The experimental precision of the information in a pdb file is usually far, far lower than those 3 decimal places. The designers of the pdb format played it safe.
  • Conversely, the errors on the coordinates in a minimized structure can technically get arbitrary low depending on the convergence criteria. Of course, one can question the real-life implications of that, but it's still valuable for e.g. reproducibility testing.
  • Moderately disordered side chains may occasionally be fit into the densities incorrectly altogether, as is quite often the case for drug-like molecules (because the density fitting softwares are more specialized/constrained for proteins). Simulating for a while tends to fix that (at least in the case of the side chains). So it is fundamentally incorrect to assume a simulation cannot achieve higher precision than the X-ray structure it started from. Just think of it, the hydrogens are usually not resolved in the pdb, while they are in the simulation...
  • If you have a thermalized structure that by chance happens to have an atom pair far up a L-J repulsive wall, 0.001Å can make a significant difference in energy and force; for simulation purposes, I'd imagine this kind of precision would be bad for conservation of energy (or at the very least "pushing it"). Accordingly, there may be trajectory analysis jobs for which this precision would be too low; I'm thinking in the lines of the perturbative approach we're using to recalculate some bulk phase properties for a perturbation in the force field parameters without rerunning the simulations. And for the Potential Energy Scans we're using to fit the dihedrals in or force field, using pdb as an intermediary format does introduce some noise into the fitting problem (though it's typically not catastrophic).
  • Last but not least, it's hard to come up with a binary storage format that is more practical than a 32-bit (= 4 byte) single precision float. One can use 16-bit fixed-point representation, but then the coordinates cannot be outside the -32.768Å to 32.767Å range, which is too small for many simulation systems. 32-bit fixed point would cure that, but there's no space saving compared to single precision. Of course, you could propose "odd" bits, like 24-bit fixed point, but that's quite a nightmare for the people who write software that deals with the format, and it's still only 25% saving...

Joined: Sep 2003
Posts: 4,872
Likes: 11
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 4,872
Likes: 11
Disks are cheap. GROMACS uses a compressed format (lossy) but it does not save all that much space, typically the reduction of the file size is at most 2-3 fold. Reducing the frequency whith which you save frames (either directly or in a postprocessing step) can do much better than that.


Lennart Nilsson
Karolinska Institutet
Stockholm, Sweden
Joined: Sep 2003
Posts: 8,637
Likes: 25
rmv Online Content
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 8,637
Likes: 25
Molecular geometry is not the only property of interest from a simulation; evaluating energetics would require the full storage precision. (Internally, CHARMM arrays have coordinates, forces, and velocities in double precision.)

Most of my work over the past few decades has involved simulations which do not start from a file obtained from the PDB.

An aside: Does anyone ever simulate a structure from the PDB in the actual crystal packing environment?


Rick Venable
computational chemist

Joined: Sep 2003
Posts: 4,872
Likes: 11
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 4,872
Likes: 11
Disk space: A project like the ABC study of DNA running 39 systems of ca 40K atoms for 1 microsecond each would need ca 2TB if coordinates are saved every 10ps resulting in 100000 stored frames in each trajectory (40000*3*4*100000*39).

To the aside:

All-atom simulations of biomolecular crystals
By: Case, David A.
Conference: 246th National Meeting of the American-Chemical-Society (ACS) Location: Indianapolis, IN Date: SEP 08-12, 2013
Sponsor(s): Amer Chem Soc
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY Volume: 246 Meeting Abstract: 49-COMP Published: SEP 8 2013


Lennart Nilsson
Karolinska Institutet
Stockholm, Sweden
Joined: Sep 2006
Posts: 146
S
slaw Offline OP
Forum Member
OP Offline
Forum Member
S
Joined: Sep 2006
Posts: 146
Thanks for the insights and the feedback:

  • Please move to "General Chemistry Discussions" (I don't see where I can do this)
  • I don't think I made it clear at all in my original post but I was actually referring to archiving/post-processing. I think it is necessary to output the original trajectory coordinates in at least single precision but after the data has been analyzed and the paper has been published, I was wondering if there were any issues with archiving the trajectory in a lossy format? Some may even ask, "why even keep the file altogether or re-run the simulation if you need it?"
  • In terms of archiving/post-processing, skipping frames and using a lossy compressed format are not mutually exclusive. One could/should do both. And while disks are cheap, we cannot assume that everyone has access to the same resources. And so while they (e.g., assistant professor at a small college) may get access to XSEDE or other super computing clusters, they may not have the money to keep buying disks to archive the data afterwards. I'm not really trying to debate/defend this point though since I'm sure every lab has their own philosophy on keeping "old" data.
  • I can see a compressed/lossy format being useful in the case of collaborations where you want to share your data. I can't imagine downloading a 2TB file but maybe it's not that bad when compared to the time needed to run the simulation. Also, some journals (e.g., PLoS) are thinking about requiring authors to deposit all of their raw data in some publicly accessible space after their manuscript is accepted and so compressing could become relevant in that case.
  • Aside:
    Ahlstrom, L. S. and Miyashita, O. (2013), Packing interface energetics in different crystal forms of the λ Cro dimer. Proteins. doi: 10.1002/prot.24478

Last edited by slaw; 05/14/14 12:35 PM.
Joined: Dec 2005
Posts: 1,535
Forum Member
Offline
Forum Member
Joined: Dec 2005
Posts: 1,535
  • I could be wrong, but I interpreted Rick's aside as being aimed at pointing out that if you're not simulating the protein in its packing environment (which is relatively rare), you're actually not trying to reproduce experimental data, so it's not very meaningful to compare errors on the simulation coordinates with errors on the experimental ones. If I interpreted this wrong, then my answer is: I believe at some point, we actually did a modest series of protein crystal simulations when working on the protein FF (more specifically the Drude one IIRC). I can figure out whether we published this (we most probably did) and what the citation is, if you want.
  • Our archiving strategy is in the lines of: use CHARMM's MERGE command to reduce the size of the trajectory by removing solvent (where appropriate; large reduction in size) and decreasing sample rate. If someone later wants to run an analysis for which the low sample rate or lack of solvent is inadequate, they'll have to rerun the simulation on their hardware, which can be safely assumed to be at least as good as what we have now.
  • A 2TB trajectory would generally mean someone went overboard with the sampling rate. I don't know of any practical/useful analysis that would benefit significantly from that much data. Note that I can easily see this statement being invalidated in the future, but we'll have bigger disks and faster connections by then.
  • Assuming people don't go overboard, well, price/TB of raw disk currently hovers around $40. This is of course just a bare hard drive, but when you're talking about long-term archival, this is pretty much a non-issue.
  • As said before, on a logarithmic scale (reflecting the exponential growth in storage capacity), the gains by lossy compression are puny compared to a smart selection of atoms and samples to keep. This especially goes for your supplementary material scenario, where one can limit the atom selection and sampling rate to "enough to verify the data in the paper".

Joined: Sep 2003
Posts: 8,637
Likes: 25
rmv Online Content
Forum Member
Online Content
Forum Member
Joined: Sep 2003
Posts: 8,637
Likes: 25
As noted, I was referring to the fact the most simulations started from a PDB file are not actually run in the crystal packing environment, which limits comparison to that experimental result.

Archiving policies vary from one institution to another, but it's good practice to keep enough primary data (from measurements or simulations) to be able to reproduce published results.


Rick Venable
computational chemist


Moderated by  lennart, rmv 

Link Copied to Clipboard
Powered by UBB.threads™ PHP Forum Software 7.7.5
(Release build 20201027)
Responsive Width:

PHP: 7.3.31-1~deb10u3 Page Time: 0.010s Queries: 30 (0.005s) Memory: 0.7691 MB (Peak: 0.8431 MB) Data Comp: Off Server Time: 2023-03-23 13:30:30 UTC
Valid HTML 5 and Valid CSS