Fixing #11

josemduarte · web-flow · commit be1143cb86af · 2016-09-13T16:05:46.000-07:00
diff --git a/structure/bioassembly.md b/structure/bioassembly.md
@@ -99,19 +99,17 @@ Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://w
 
 Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly:
 
-A) The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. 
+1. The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
 
-In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
+2. There is also a pre-computed file available from the PDB that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
 
-B) There is also a pre-computed file available that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
+As of version 5.0 BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF files.
 
-BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF, as well as to parse the pre-computed file. The [BioUnitDataProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProvider.html) interface defines what is required to re-build an assembly. The [BioUnitDataProviderFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/quaternary/io/BioUnitDataProviderFactory.html) allows to specify which of the BioUnitDataProviders is getting used.
-
-Take a look at the method getBiologicalAssembly() in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html)  to see how the BioUnitDataProviders are used by the *BiologicalAssemblyBuilder*.
+Take a look at the method `getBiologicalAssembly()` in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html)  to see how the underlying *BiologicalAssemblyBuilder* is called.
 
 ## Memory consumption
 
-This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise there is no successfully load this. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
+This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise you will not be able to load it. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
 <pre>
     -Xmx10G 
 </pre>
@@ -131,97 +129,27 @@ Note: when loading this structure with 9GB of memory, the Java VM spends a signi
     </tr>
 </table>
 
-## Low level access to parsing pre-assembled biological asssembly files
-
-To load the pre-assembled biological assembly file directly, one can tweak the low-level PDB file parser like this
-
-```java
-
-public static void main(String[] args){
-
-        public static void main(String[] args){
-
-        // This loads the PBCV-1 virus capsid, one of, if not the biggest biological assembly in terms on nr. of atoms.
-        // The 1m4x.pdb1.gz file has 313 MB (compressed)
-        // This Structure requires a minimum of 9 GB of memory to be loaded in memory. 
-
-        String pdbId = "1M4X";
-
-        Structure bigStructure = readStructure(pdbId,1);
-        
-        // let's take a look how much memory this consumes currently
+## Representing symmetry related chains
+Chains are identified by chain identifiers which serve to distinguish the different molecular entities present in the asymmetric unit. Once a biological assembly is built it can be composed of chains from both the asymmetric unit or from chains resulting in applying a symmetry operator (this chains are also called "symmetry mates"). The problem with that is that the symmetry mates will get the same chain identifiers as the untransformed chains. 
 
-        Runtime r = Runtime.getRuntime();
+In order to solve that issue there are 2 solutions:
 
-        // let's try to trigger the Java Garbage collector
-        r.gc();
+1. Assign new chain identifiers. In BioJava the new chain identifiers assigned are of the form `<original chain id>_<symmetry operator id>`.
+2. Place the symmetry partners into different models. This is the solution taken by the pre-computed biounit files available from the PDB. 
 
-        System.out.println("Memory consumption after " + pdbId + 
-                " structure has been loaded into memory:");
-        
-        String mem = String.format("Total %dMB, Used %dMB, Free %dMB, Max %dMB", 
-                r.totalMemory() / 1048576,
-                (r.totalMemory() - r.freeMemory()) / 1048576, 
-                r.freeMemory() / 1048576,
-                r.maxMemory() / 1048576);
+Since version 5.0 BioJava uses approach 1) to store the biounit in a single `Structure` object. Because the chain identifiers are then of more than 1 character, the Structure can only be written out in mmCIF format (PDB format is limited to 1 character chain identifiers).
 
-        System.out.println(mem);
-                
-        System.out.println("# atoms: " + StructureTools.getNrAtoms(bigStructure));
-        
-    }
-    /** Load a specific biological assembly for a PDB entry
-     *  
-     * @param pdbId .. the PDB ID
-     * @param bioAssemblyId .. the first assembly has the bioAssemblyId 1
-     * @return a Structure object or null if something went wrong.
-     */
-    public static Structure  readStructure(String pdbId, int bioAssemblyId) {
-
-        // pre-computed files use lower case PDB IDs
-        pdbId = pdbId.toLowerCase();
-
-        // we need to tweak the FileParsing parameters a bit
-        FileParsingParameters p = new FileParsingParameters();
-
-        // some bio assemblies are large, we want an all atom representation and avoid
-        // switching to a Calpha-only representation for large molecules
-        // note, this requires several GB of memory for some of the largest assemblies, such a 1MX4
-        p.setAtomCaThreshold(Integer.MAX_VALUE);
-
-        // parse remark 350 
-        p.setParseBioAssembly(true);
-
-        // The low level PDB file parser
-        PDBFileReader pdbreader = new PDBFileReader();
-
-        // we just need this to track where to store PDB files
-        // this checks the PDB_DIR property (and uses a tmp location if not set) 
-        AtomCache cache = new AtomCache();
-        pdbreader.setPath(cache.getPath());
-
-        pdbreader.setFileParsingParameters(p);
-
-        // download missing files
-        pdbreader.setAutoFetch(true);
-
-        pdbreader.setBioAssemblyId(bioAssemblyId);
-        pdbreader.setBioAssemblyFallback(false);
-
-        Structure structure = null;
-        try { 
-            structure = pdbreader.getStructureById(pdbId);
-            if ( bioAssemblyId > 0 )
-                structure.setBiologicalAssembly(true);
-            structure.setPDBCode(pdbId);
-        } catch (Exception e){
-            e.printStackTrace();
-            return null;
-        }
-        return structure;
-    }
- ```
+In BioJava one can still produce a biounit using approach 2) by passing a boolean parameter to the `getBiologicalAssembly` method:
+```java
+Structure struct = StructureIO.getBiologicalAssembly(pdbId, true);
+```
+## PDB entries with more than 1 biological assemblies
+Many PDB entries are assigned more than 1 biological assemblies. This is due to many factors: sometimes the authors disagree with the annotators, sometimes the authors are not sure about which biological assembly is the right one, sometimes there are several equivalent biological assemblies present in the asymmetric unit (but with slightly different  conformations) and each of those is annotated as a different biological assembly.
 
+To get all biological assemblies for a given PDB entry one needs to use:
+```java
+List<Structure> bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId);
+```
 
 ## Further Reading