Data handling and processing at BioMAX
This is a guide on the facilities available for macromolecular data processing at MAX IV. Click on the following links to display information on each topic.
The image files from the Eiger detector are written out in the NeXus format, using the HDF5 data convention as a container. Each h5 container stores 1 degree worth of images (eg, ten 0.1 degree images). For each data set a master file is written containing experimental and instrumentation metadata. The files are stored in a directory of the form /data/visitors/biomax/’proposal’/’date’/raw/’yourprotein’/’yourprotein-yourcrystal’. This directory is mounted on all the computers available at MAX IV.
Important: don’t change the file names or paths, which might break the internal data-links and also causes failure of most software.
During the experiment, the data are available on the high speed local storage at MAX IV. Concurrently, the data are also replicated to a high capacity offline storage at LUNARC, where they will be kept for 6 months. Note that although the data are on a separate location after transfer to LUNARC is completed, you can still use the the same path to access the data (/data/visitors/biomax/… etc). Since at this time the data are not backed up or archived, it is good practice to transfer them to your own institution at your earliest convenience. The data can be accessed for transfer via Globus or sftp and for further offline processing on the Aurora cluster.
As soon as a data set has been collected and written to the local disk, MXCuBE3 launches several automated data processing pipelines. fast_dp and EDNA produce very fast results for rapid feedback on data quality, while autoPROC runs significantly slower, but it performs anisotropy analysis and may provide better data. The output files of the automated processing are stored under the ‘/data/visitors/biomax/’proposal’/’date’/process’ subdirectory, but you can also inspect with ISPyB running on a remote browser and downloaded directly to your own computer.
You can see images as they are being collected with the program ALBULA. Click the ALBULA icon on the beamline desktop to launch ALBULA in monitoring mode.
Adxv is a useful tool to inspect images after collection. It can add frames together, a useful feature to assess visually the quality of the diffraction for thin phi sliced data. Adxv is available offline from PReSTO (see section below). To see the images present in the data directory, change the default extension to h5. Then click on any of the h5 containers (not the master file!). Use the “slabs” box to set the number of images to add together; 5-10 is a good choice for data collected in 0.1 degree frames.
For manual data processing during or after the experiment we recommend use of the PReSTO platform, available to general non industry users on the MAX IV HPC clusters. During the beamtime, users can log in to the so call online HPC cluster from the beamline machines:
- Click on Applications (first item in the top panel of the desktop).
- Hover over the Internet menu and select the ThinLinc Client application.
- Type clu0-fe-1 as the server, your username and your password and the return key.
- By default, the ThinLinc desktop will be set to full screen mode. You can use the F8 key to exit this.
- When you finish your work, please log out (click on the System item in the panel and select “Log out”. If you do not log out, your session will keep on running indefinitely.
Since the beginning of 2019, users can also log in remotely from outside MAX IV to a second HPC cluster, the “offline” cluster (offline-fe1). To do this, the Pulse Secure software to establish a VPN connection to BioMAX must be downloaded and installed, as well as the ThinLinc client. Please see these instructions. The offline cluster has fewer nodes than the online cluster, but otherwise the configuration is similar. The /data disk is mounted on both clusters, as well as the home disk. For more help, see IT environment at BioMAX.
Once you are logged in to ThinLinc The easiest way to launch a PReSTO ready terminal is through the thinlinc client “Applications” menu.
Here are some tips to run some supported packages from PReSTO on the MAX IV cluster. To fully explore the capabilities of the software and job submission in the cluster (important if you are planning to go beyond data reduction and run longer jobs) please refer to the individual programs documentation and the PReSTO help pages. The entire list of software available through PReSTO is listed here.
- Open the GUI: Applications → PreSTO → XDSAPP.
- Allocate the maximum number of cores and the maximum time for the job (we suggest 24 cores and 1 hour).
- Go to the Settings tab and select the location of your output directory. By default, a subdirectory will be created in the directory where the input data are.
- Load the master file of the data set to process. This will populate all the fields in the “Settings” tab, used to generate the input XDS.INP file.
- Go back to the settings tab and specify the number of jobs and CPUs. The product of the two numbers should be equal to or less than twice the total number of cores allocated. For example, if you allocated 24 cores, you can use 6 jobs and 8 CPUs.
- You can choose whether to run the different tasks one by one or run them all as a supertask by clicking “Do all”.
- To view a graphical display of the results of the INTEGRATE and CORRECT steps and the dataset statistics, use the corresponding tabs.
- Open the GUI: Applications → PreSTO → XDSGUI.
- Allocate the number of cores available for the job and the maximum time (we suggest 24 cores and 1 hour).
- On the projects TAB, select the directory where you want to store the processing results; create the directory first from a terminal if it does not exist. By default the result files will be written to the same directory as the images.
- On the Frame tab, load the master file of the data set to process and then generate the XDS.INP file. Edit the input if you wish to and save it. For faster processing we recommend defining the number of jobs and number of processors in the input. The product of the two numbers should be equal to or less than twice the total number of cores allocated. For example, if you allocated 24 cores, insert the two following lines:
- To make use of the the neggia library to speed up reading of the h5 images, insert this line:
- Click on Run XDS to start the job.
Use an interactive node to run Xia2: Open a terminal in the thinlinc client (do not use the PReSTO window, as currently there are some incompatibility between modules required to run Xia2 and other software packages) and then type:
interactive --exclusive -t 01:00:00
module load DIALS
We recommend using the dials pipeline using the command: ‘xia2 pipeline=dials image=/data/visitors/biomax/…/’path_to_master’.h5′. You can also run Xia2 to run XDS based pipelines, however running XDS directly seems to be more advantageous in that case.
Use an interactive node to run autoPROC: Open the PReSTO terminal and then type
interactive --exclusive -t 01:00:00
Go to the directory where you wish to store the data and start autoPROC from the command line:
process -h5 /'pathto masterfile_master'.h5 autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_JOBS=6 autoPROC_XdsKeyword_MAXIMUM_NUMBER_OF_PROCESSORS=8 autoPROC_XdsKeyword_ROTATION_AXIS="0,-1,0"
- Please see some of the scripts written by PReSTO developers for data processing on the MAX IV HPC clusters. Prior to running native.script or anomalous.script, the output directory must be created.
- This is the pipeline that appears to generate the best results to date:
- Process with XDSAPP to 0.3-0.4 Å higher resolution than XDSAPP suggests by default.
- Use the output XDS_ASCII.HKL as input for the STARANISO server.The resulting anisotropic dataset will be better than standard autoPROC run.
- Try out DIALS, since this software is undergoing continuous improvement.