1 of 12

Large Scale Protein Modeling in DeepChem

Alana Xiang, Google Summer of Code 2021

Last Updated Aug 21

2 of 12

The Problem

  • Machine learning can help us make new insights about proteins!
  • But machine learning is hard, and researchers may find it tedious to deal with chaining together libraries to make it work
  • Adding protein modeling support to DeepChem will help us make it easier to do machine learning on proteins

3 of 12

The Project

Create a well-tested, end-to-end DeepChem workflow for protein modeling.

Roadmap:

  1. Data loading and featurization of protein sequence data (Revamp FASTALoader)
  2. Training on protein sequence data (BERT tokenizer+model wrappers, Protein BERT support)
  3. Add protein datasets to DeepChem’s MoleculeNet
  4. Stretch: Metrics and evaluation for protein data

  • * In Progress

4 of 12

FASTA loader

  • Added ability to use additional featurizers to process FASTA files
  • Added support for sharding to support efficient loading of large files
  • Added tests and documentation

The changes to the FASTA loader will allow users to make a series of large FASTA files featurized and ready for training with only one line of code.

5 of 12

BERT Tokenizer Featurizer

  • Wrote a wrapper class for the HuggingFace BERTTokenizerFast class into DeepChem, to allow:
    • Creation of BERTTokenizer based tokenizers, which will allow for training in DeepChem of:
      • Protein models, based on RostLab’s ProtBert

6 of 12

Goals for Final Evaluation

  • Improve FASTA Loader even further
    • My changes to the FASTALoader which have been pushed to DeepChem HEAD already make it possible to featurize a series of FASTA files with arbitrary DeepChem featurizers.
    • However, FASTALoader does not currently support sharding. This means that the loader may hang on sufficiently large files.
    • I am currently in the process of testing and refining sharding support on the FASTALoader.
  • Add BERTTokenizer BERTFeaturizer to DeepChem
  • Add Uniprot to MoleculeNet
  • Add metrics for the above models into DeepChem

7 of 12

Adding Bert Featurizer to DeepChem

  • Added Bert Tokenizer modeled after Seyone and Walid’s RobertaFeaturizer
  • Bert Featurizer now passes all unit tests
  • Results match outputs of HuggingFace’s tokenizer
  • Modified `__call__()` to operate DeepChemically by switching to a has-a model.

8 of 12

run_bert.py

It is now possible to replace BertTokenizer with BertFeaturizer in a HuggingFace-style workflow, as demonstrated in run_bert.py

9 of 12

Ensuring Compatibility

This week:

  • BertFeaturizer was modified to no longer inherit from BertTokenizer. Instead, it is passed a BertTokenizerFast object which it keeps as an attribute.
  • Changes were made to BertFeaturizer so that `__call__()` calls BertFeaturizer.featurize().
  • BertFeaturizer now functions with the existing FASTALoader.

Unfortunately, my DeepChem environment broke this week, and the new BertFeaturizer change took longer than expected. But now it is done!

10 of 12

Reviewing Our Todo List

  • Improve FASTA Loader even further (from first evaluation)
    • Fixing sharding support and getting ready to merge
  • Finalize BERT Featurizer
    • Refine documentation
  • Demonstrate HF training pipeline (from FASTA loader to output) using DeepChem BERT Featurizer
  • Add additional HF elements to DeepChem as deemed necessary for a DeepChemic protein training experience
  • Add Uniprot to MoleculeNet
  • Add metrics for the above models into DeepChem (stretch)

11 of 12

Next Question

How can we get BertModel to take in a featurized DiskDataset instead of the dictionaries that it was passed in ProtBert’s HuggingFace pipeline?

A: We will wrap BertModel as a DeepChem model

12 of 12

Final Week!

  • Wrote a wrapper for BertModel that should allow training on BertModels natively in DeepChem
  • Improved BertFeaturizer to address comments from Bharath and Seyone
  • Wrote a data loader for FASTA files
  • https://github.com/deepchem/deepchem/pull/2667
  • https://github.com/deepchem/deepchem/pull/2666
  • https://github.com/deepchem/deepchem/pull/2642
  • Progress relative to initial goals:
    • Project deviated from what was planned significantly, but I think we still made a lot of progress!