Large Scale Protein Modeling in DeepChem
Alana Xiang, Google Summer of Code 2021
Last Updated Aug 21
The Problem
The Project
Create a well-tested, end-to-end DeepChem workflow for protein modeling.
Roadmap:
FASTA loader
The changes to the FASTA loader will allow users to make a series of large FASTA files featurized and ready for training with only one line of code.
BERT Tokenizer Featurizer
Goals for Final Evaluation
Adding Bert Featurizer to DeepChem
run_bert.py
It is now possible to replace BertTokenizer with BertFeaturizer in a HuggingFace-style workflow, as demonstrated in run_bert.py
Ensuring Compatibility
This week:
Unfortunately, my DeepChem environment broke this week, and the new BertFeaturizer change took longer than expected. But now it is done!
Reviewing Our Todo List
Next Question
How can we get BertModel to take in a featurized DiskDataset instead of the dictionaries that it was passed in ProtBert’s HuggingFace pipeline?
A: We will wrap BertModel as a DeepChem model
Final Week!