# Copyright (c) 2013-2020, SIB - Swiss Institute of Bioinformatics and
#                          Biozentrum - University of Basel
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


This folder contains scripts that were used to train weights for the scoring
functions used in loop selection in the pipeline. You need a compiled and
working version of ProMod3 to run these scripts. Scripts are provided as-is!

----------------------
Steps to be performed:
----------------------

1) setup testset and dbs by executing scripts in this order within this folder:
$ pm generate_bft_testset.py
$ pm generate_bft_dbs.py
-> by default we get 50000 loops (5000 for each loop length in [3,12])

2) generate big table (BFT) with scores for each loop candidate for each loop
- This takes a lot of time and is best done on a cluster (we assume bc2)
- Test script on a small example
$ pm generate_bft_chunks.py range 0 10
-> should generate file "loop_data_range_00000_00010.json", which you can delete
- Adapt "PROMOD3_ROOT" in "run_array_job.sh" to point to PM3 stage. Then submit:
$ qsub run_array_job.sh
-> this executes 500 jobs, each doing 100 loops with generate_bft_chunks.py
-> the scripts automatically abort and dump partial results if mem. or runtime
   above given thresholds ("max_time" and "max_mem" in the python script)
   -> adapt those values if you customize max. mem. and runtime in run_array_job
-> output is stored as "loop_data_range_X_Y.json" for loops in [X,Y[
- check that each run_out_X.stderr is empty by executing:
$ cat *.stderr
- check for unfinished chunks (i.e. missing loops) by executing
$ grep "ABORT" *.stdout
- if loops in [X,Y[ are missing, manually execute (or submit)
$ pm generate_bft_chunks.py range X Y
- combine all "loop_data_range_X_Y.json" by executing:
$ python generate_bft.py
-> this should produce a fairly big numpy array dumped as loop_bft.dat and a
   json file loop_infos.json containing information to access the data

3) Analyze composition of BFT and effect of single scores
$ python analyze_bft_candidates.py
-> this generates plots showing the distribution of loop candidates
-> this will also display statistics 
$ python analyze_bft_score_correlations.py
-> this generates plots showing correlation of score to ca_rmsd
-> this will also display statistics for all scores and correlation coefficients
$ python analyze_bft_scores.py
-> this generates plots showing prob. to find loop candidate with ca_rmsd <= x
-> this will also store csv files with the AUC table and display the table

4) Optimize weights
- First we need to split the BFT into a training and a test set:
$ python generate_training_bft.py
- We experimented with CMA (Hansen et al.) and scipy's Nelder-Mead method
-> CMA seemed to perform better
-> for testing run the following to optimize BB weights
   (takes around 2min unless max_time adapted in script)
$ python optimize_weights_cma.py BB
- We usually aim to optimize multiple sets of weights
-> hence we provide scripts to submit this as jobs on the bc2 cluster
-> sets trained: "BB", "BB_DB", "BB_DB_AANR", "BB_AANR", "BB_DB_AAR", "BB_AAR"
   -> BB   = Backbone scores
   -> AAR  = AllAtom scores after relaxation
   -> AANR = AllAtom scores before relaxation
   -> DB   = FragDB specific: profile scores and stem RMSD
-> for all sets of weights we launch one job with the full BFT and 5 jobs which
   only consider a pair of loop lengths
-> launch all jobs with:
$ python run_all_optimizations.py
-> when all done collect with:
$ python collect_optimized_weights.py cma_result cma_weights.json
-> analyze result with:
$ python analyze_weights.py cma_weights.json
=> we will finally have a cma_weights.json with all trained weights and
   a cma_weights.png showing the performance for the weights

5) Get code snippets with preferred sets of weights
$ python get_weights_code.py weights BB cma_weights.json
$ python get_weights_code.py weights_aa BB_AANR cma_weights.json
$ python get_weights_code.py weights_db BB_DB cma_weights.json
$ python get_weights_code.py weights_db_aa BB_DB_AANR cma_weights.json

See the descriptions in each script on what files are being generated.
