[PUBLIC] Chemprop v1 to v2 User Transition Guide


1	Status	v2 Milestone	v1 argument	v2 argument	short option	Notes
2			Main Args
3	Done	v2.0		--logfile		--log also works. If not specified, prints logging output to stdout.
4	Done	v2.0	--quiet	--verbose	-v	Quiet (errors only) is the default, -v also includes warning messages, -vv adds info level messages, -vvv adds debug messages (max).
5	Not planned	v2.0	--log_frequency
6			Common Args
7	Done	v2.0	--smiles_columns	--smiles-columns	-s	If unspecified, uses the 0th column in the input csv.
8	Done	v2.0		--reaction-columns	-r	The names of columns in the input CSV containing reactions.
9	Done	v2.0		--no-header-row		Only works when the first column is the smiles and the rest of the columns are targets.
10	Done	v2.0	--num_workers	--num-workers	-n
11	Done	v2.0	--batch_size	--batch-size	-b
12	Done	v2.0	--no_cuda	--accelerator		GPU training is now handled by Pytorch Lightning. These arguments are passed directly to the Trainer.
13	Done	v2.0	--gpu	--devices
14	Done	v2.0	--reaction_mode	--reaction-mode		--rxn-mode also works and is preferred. This is was originally in Training Args.
15	Done	v2.0	--explicit_h	--keep-h		Originally in Training Args
16	Done	v2.0	--adding_h	--add-h		Originally in Training Args
17	Done	v2.0	--features_generator	--features-generators		These generate molecular features to be used as descriptors (concatenated to the output of the message passing).
18	Done	v2.0	--features_path	--descriptors-path
19	TODO	v2.2	--phase_features_path	--phase-features-path
20	Done	v2.0	--no_features_scaling	--no-descriptor-scaling
21	Done	v2.0	--no_atom_descriptor_scaling	--no-atom-feature-scaling		Atom descriptors and features now have separate arguments. These were originally in Training Args
22	Done	v2.0	--no_atom_descriptor_scaling	--no-atom-descriptor-scaling
23	Done	v2.0	--no_bond_descriptor_scaling	--no-bond-feature-scaling		Bond descriptors and features now have separate arguments. These were originally in Training Args
24	TODO	v2.2	--no_bond_descriptor_scaling	--no-bond-descriptor-scaling
25	Done	v2.0	--atom_descriptors			Atom descriptors and features now have separate arguments.
26	Done	v2.0	--atom_descriptors_path	--atom-features-path		Specify the flag multiple times if including for multiple components, see documentation for details.
27	Done	v2.0	--atom_descriptors_path	--atom-descriptors-path		Specify the flag multiple times if including for multiple components, see documentation for details.
28	TODO	v2.2	--bond_descriptors			Bond descriptors and features now have separate arguments.
29	Done	v2.0	--bond_descriptors_path	--bond-features-path		Specify the flag multiple times if including for multiple components, see documentation for details.
30	TODO	v2.2	--bond_descriptors_path	--bond-descriptors-path
31	Not planned	v2.0	--no_cache_mol			The rdkit.Chem.Mol objects are always cached. See --no-cache in Train Args for control of caching the MolGraph objects (featurized from rdkit.Chem.Mols)
32	Not planned	v2.0	--empty_cache
33	Not planned	v2.0	--cache_cutoff
34	TODO	v2.2	--constraints_path	--contraints-path
35	Not planned	v2.0	--number_of_molecules			Removed. Supply multiple inputs to --smiles-columns if you have multiple molecules
36	Not planned	v2.0	--max_data_size			Removed. Use a splits file (--splits-file) to specify which datapoints you want to use.
37			Train Args
38	Done	v2.1	--config_path	--config-path
39	Done	v2.0	--data_path	--data-path	-i
40	Done	v2.0	--save_dir	--save-dir	-o	--output-dir also works and is preferred
41	In Progress	v2.2	--checkpoint_dir	--checkpoint		The three checkpoint arguments will be combined. Checkpoint will only be used for resuming training. Use --model-path to specify the model (.ckpt or .pt) for inference.
42		v2.2	--checkpoint_path
43		v2.2	--checkpoint_paths
44	Done	v2.0	--checkpoint_frzn	--model-frzn
45	Done	v2.0	--frzn_ffn_layers	--frzn-ffn-layers
46	TODO	v2.2	--freeze_first_only	--freeze-first-only
47	Done	v2.0	--save_preds			Predictions on the test set are now always saved.
48	TODO	v2.2	--resume_experiment	--resume-experiment
49	Done	v2.0	--ensemble_size	--ensemble-size
50	TODO	v2.2	--is_atom_bond_targets	--is-atom-bond-targets
51	TODO	v2.2	--no_adding_bond_types	--no-adding-bond-types
52	TODO	v2.2	--keeping_atom_map	--keeping-atom-map
53	TODO	v2.2	--no_shared_atom_bond_ffn	--no-shared-atom-bond-ffn
54	TODO	v2.2	--weights_ffn_num_layers	--weights-ffn-num-layers
55	Done	v2.0	--bias	--message-bias
56	Done	v2.0	--hidden_size	--message-hidden-dim
57	Done	v2.0	--depth	--depth
58	Done	v2.0	--undirected	--undirected
59	Done	v2.0	--dropout	--dropout
60	Done	v2.0	--mpn_shared	--mpn-shared
61	Done	v2.0	--activation	--activation
62	Done	v2.0	--aggregation	--aggregation		--agg also works
63	Done	v2.0	--aggregation_norm	--aggregation-norm
64	Done	v2.0	--atom_messages	--atom-messages
65	TODO	v2.2	--bias_solvent	--bias-solvent
66	TODO	v2.2	--hidden_size_solvent	--hidden-size-solvent
67	TODO	v2.2	--depth_solvent	--depth-solvent
68	Done	v2.0	--ffn_hidden_size	--ffn-hidden-dim
69	Done	v2.0	--ffn_num_layers	--ffn-num-layers
70	Done	v2.0		--no-batch-norm		v2 uses batch normalization between the aggregation and FFN by default
71	Done	v2.0	--multiclass_num_classes	--multiclass-num-classes
72	TODO	v2.2	--spectra_activation	--spectral_activation
73	Done	v2.0		--weight-column	-w
74	Done	v2.0	--target_columns	--target-columns
75	Done	v2.0	--ignore_columns	--ignore-columns
76	TODO	v2.2	--spectra_phase_mask_path	--spectra-phase-mask-path
77	Done	v2.0	--dataset_type	--task-type	-t
78	Done	v2.0	--loss_function	--loss-function	-l
79	Done	v2.1	--evidential_regularization	--evidential-regularization		--v-kl also works and is preferred as this also controls the annealing coefficient in the dirichlet method
80	Done	v2.1		--eps		Evidential regularization epsilon
81	Done	v2.1		--alpha		Target error bounds for quantile interval loss
82	TODO	v2.2	--spectra_target_floor	--spectra-target-floor	-T
83	Done	v2.0	--metric	--metric		--metrics also works and is preferred. This was combined with --extra_metrics. Specify as many metrics as desired, the first metric given is used for early stopping.
84	Done	v2.0	--extra_metrics			Combined with --metric
85	Done	v2.1	--show_individual_scores	--show-individual-scores		Currently predictions from each model in an ensemble are reported in separate csv's. This may change in v2.1 when we add uncertainty estimation.
86	Done	v2.0	--target_weights	--task-weights
87	Done	v2.0	--warmup_epochs	--warmup-epochs
88	Done	v2.0	--init_lr	--init-lr
89	Done	v2.0	--max_lr	--max-lr
90	Done	v2.0	--final_lr	--final-lr
91	Done	v2.0	--epochs	--epochs
92	Done	v2.0		--patience		Used in early stopping
93	Done	v2.0	--grad_clip	--grad-clip
94	Done	v2.1	--class_balance	--class-balance
95	Done	v2.0	--split_type	--split-type		--split also works and is preferred
96	Done	v2.0	--split_sizes	--split-sizes
97	Done	v2.0	--split_key_molecule	--split-key-molecule
98	Done	v2.0	--num_folds	--num-folds	-k
99	Done	v2.0		--save-smiles-splits		Saves the smiles (as determined by RDKit) for each split to "train_smiles.csv", "val_smiles.csv", and "test_smiles.csv".
100	Done	v2.0	--folds_file	--splits-file		Data is now given as a single csv and then split interally. The split can be specified via a separate splits file (see documentation for the format) or via a splits column in the data file. The splits apply to the data and separate features/descriptors files.
101		v2.0	--val_fold_index
102		v2.0	--test_fold_index
103		v2.0	--crossval_index_dir
104		v2.0	--crossval_index_file
105	Done	v2.0		--splits-column		A column with "train", "val", "test" , or "" (blank) for each row in the data file to split the data.
106	Done	v2.0	--seed	--data-seed		Controls splitting and dataloader shuffling, defaults to 0
107	Done	v2.0	--pytorch_seed	--pytorch-seed		Controls model initilization and training steps taken, no default
108	Done	v2.0		--no-cache		The featurized molecules are now cached by default. Use this to turn that off to save memory.
109	Not planned	v2.0	--reaction			Not needed anymore in v2. Use --reaction-columns instead
110	Not planned	v2.0	--reaction_solvent			Not needed in v2. To train a model with reaction+solvent, put the reactions smiles and the solvent smiles in the data file and specify --reaction-columns and --smiles-columns
111	Not planned	v2.0	--data_weights_path			Removed. Use --weight-column to specify individual data weights
112	Not planned		--overwrite_default_bond_features			Use a custom featurizer to change the bond features
113	Not planned	v2.0	--features_only			This can be done by importing Chemprop as a module.
114	Not planned		--overwrite_default_atom_features			Use a custom featurizer to change the atom features
115	Not planned	v2.0	--separate_val_path	--separate-val-path		See the note above about --splits-file and --splits-column
116	Not planned	v2.0	--separate_test_path	--separate-test-path
117	Not planned	v2.0	--separate_val_features_path
118	Not planned	v2.0	--separate_test_features_path
119	Not planned	v2.0	--separate_val_phase_features_path
120	Not planned	v2.0	--separate_test_phase_features_path
121	Not planned	v2.0	--separate_val_atom_descriptors_path
122	Not planned	v2.0	--separate_test_atom_descriptors_path
123	Not planned	v2.0	--separate_val_atom_features_path
124	Not planned	v2.0	--separate_test_atom_features_path
125	Not planned	v2.0	--separate_val_bond_features_path
126	Not planned	v2.0	--separate_test_bond_features_path
127	Not planned	v2.0	--separate_val_constraints_path
128	Not planned	v2.0	--separate_test_constraints_path
129			Predict Args
130	Done	v2.0	--test_path	--test-path	-i
131	Done	v2.0	--preds_path	--preds-path	-o	--output also works and is preferred
132	Done	v2.0	--checkpoint_path	--model-path
133	Done	v2.0		--target-columns		Users can specify names for output columns. Otherwise defaults to "pred_0", "pred_1", etc.
134	Not planned	v2.1	--ensemble_variance
135	Done	v2.1	--individual_ensemble_predictions			Individual model predictions are always returned in a separate file
136	Done	v2.1	--uncertainty_method	--uncertainty-method
137	Done	v2.1	--calibration_method	--calibration-method
138	Done	v2.1	--evaluation_methods	--evaluation-methods
139	Not planned	v2.1	--evaluation_scores_path			Evaluation scores are now printed to the terminal by default
140	Done	v2.1	--uncertainty_dropout_p	--uncertainty-dropout-p
141	Done	v2.1	--dropout_sampling_size	--dropout-sampling-size
142	Done	v2.1	--calibration_interval_percentile	--calibration-interval-percentile
143	Done	v2.1		--conformal-alpha
144	TODO	v2.2	--regression_calibrator_metric	--regression-calibrator-metric
145	Done	v2.1		--cal-path
146	Done	v2.1	--calibration_features_path	--cal-descriptors-path
147	TODO	v2.2	--calibration_phase_features_path	--calibration-phase-features-path
148	Done			--cal-atom-features-path
149	Done	v2.1	--calibration_atom_descriptors_path	--cal-atom-descriptors-path
150	Done	v2.1		--cal-bond-features-path
151	TODO	v2.2	--calibration_bond_desciptors_path	--cal-bond-desciptors-path
152	Not planned	v2.0	--drop_extra_columns			This can be done via post processing on the command line. (e.g. in Linux, something like `awk -F ',' '{print $1 "," $3}' OFS=',' preds.csv > trimmed_preds.csv`)
153			Convert Args
154	Done	v2.0		--input-path	-i
155	Done	v2.0		--output-path	-o
156			Fingerprint Args
157	Done	v2.0	--test_path	--test-path	-i
158	Done	v2.0	--preds_path	--preds-path	-o	--output also works and is preferred
159	Done	v2.0	--checkpoint_path	--model-path
160	Done	v2.0	--fingerprint_type	--ffn-block-index		The index indicates which linear layer returns the encoding in the FFN. An index of 0 denotes the post-aggregation representation through a 0-layer MLP, while an index of 1 represents the output from the first linear layer in the FFN, and so forth.
161	TODO	v2.2	Other extra feature/descriptor options
162			Hyperopt Args
163	Done		--search_parameter_keywords	--search-parameter-keywords		ffn_hidden_size is now ffn_hidden_dim; hidden_size is now message_hidden_dim
164	Done		--log_dir	--hpopt-save-dir		Hyperparameter optimization has been migrated to Ray Tune. See the documentation for usage details.
165	Done		--hyperopt_checkpoint_dir	--hpopt-save-dir
166	Done		--num_iters	--raytune-num-samples
167	Done			--raytune-search-algorithm
168	Done			--raytune-num-workers
169	Done			--raytune-use-gpu
170	Done			--raytune-num-checkpoints-to-keep
171	Done			--raytune-grace-period
172	Done			--raytune-reduction-factor
173	Done			--hyperopt-n-initial-points
174	Done		--hyperopt_seed	--hyperopt-random-state-seed
175			--config_save_path
176			--manual_trial_dirs
177			--startup_random_iters
178			Interpret Args
179	Not planned	v2.1				Example jupyter notebooks show how to use external methods to interpret chemprop models.
180			Sklearn Train Args
181	Not planned		-	-	-	-
182			Sklearn Predict Args
183	Not planned		-	-	-	-


1		v1	v2	Notes
2	FFN num layers	2	1
3	Epochs	30	50
4	Aggregation	mean	norm
5	Batch size	50	64
6	Batch norm	FALSE	FALSE	In version v.2.0.x, batch norm was True by default. We found occasional bad performance, so have removed it as the default.
7	evidential-regularization	0.2	0	Optimal value is dataset-dependent. 0.2 was the best value found in the original Soleimany paper.
8	HPO search space:			linked_hidden_size was removed as an option, the hidden size for message passing and the feed forward network are now always searched separately.
9	- batch size	range(5, 201, 5)	16, 32, 64, 128, 256
10	- dropout	range(0, 0.41, 0.005)	See note	first dropout is turned on with a 50% chance and if turned on the dropout value is chosen from the v1 values
11	- learning rates		See documentation


1	Will not be reimplemented v2:	web, sklearn
2	Not yet supported in v2 but will be:	interpret
3	Backwards incompatibility:	splitting will give different results (astartes is now used for most splitting), can't start training a model in v1 and finish training in v2