1 of 9

Applied Bioinformatics 2025�Week 3 Session 1�Introduction to MSstats

Natalie Turner, PhD

Postdoctoral Fellow – Yates Lab

Department of Molecular Medicine

naturner@scripps.edu

2 of 9

3 of 9

MSstats and MSstatsConvert

MSstats family packages works with label-free, Selected Reaction Monitoring (SRM) and Tandem Mass Tag (TMT) datasets

Takes MS raw quant data as input (compatible with multiple upstream processing software tools)

MSstatsConvert enables reformatting of virtually any MS quantification result into a format required by MSstats

e.g., ‘DIANNtoMSstatsFormat’ takes the results file from DIANN and formats/cleans it for further processing in MSstats

4 of 9

Features

Peptide Spectral Match (PSM)- and protein-level filtering,

managing shared peptides,

removing decoy, iRT, and other irrelevant sequences,

removing features or proteins with a low number of measurements,

aggregating duplicated measurements,

handling fractionation by removing overlapped features,

creating balanced statistical design in the presence of missing data.

5 of 9

DIANNtoMSstatsFormat

input	name of MSstats input report from Diann, which includes feature-level data.
annotation	name of 'annotation.txt' data which includes Condition, BioReplicate, Run.
global_qvalue_cutoff	The global qvalue cutoff
qvalue_cutoff	local qvalue cutoff for library
pg_qvalue_cutoff	local qvalue cutoff for protein groups. Run should be the same as filename.
useUniquePeptide	should unique peptides be removed
removeFewMeasurements	should proteins with few measurements be removed
removeOxidationMpeptides	should peptides with oxidation be removed
removeProtein_with1Feature	should proteins with a single feature be removed
use_log_file	logical. If TRUE, information about data processing will be saved to a file.
append	logical. If TRUE, information about data processing will be added to an existing log file.
verbose	logical. If TRUE, information about data processing wil be printed to the console.
log_file_path	character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If append = TRUE, has to be a valid path to a file.
MBR	True if analysis was done with match between runs
quantificationColumn	Use 'FragmentQuantCorrected'(default) column for quantified intensities. 'FragmentQuantRaw' can be used instead.

6 of 9

dataProcess: Clean, normalize and summarize before differential analysis�

raw	name of the raw (input) data set.
logTrans	base of logarithm transformation: 2 (default) or 10.
normalization	normalization to remove systematic bias between MS runs. There are three different normalizations supported: 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals 'globalStandards' represents normalization with global standards proteins. If FALSE, no normalization is performed.
featureSubset	"all" (default) uses all features that the data set has. "top3" uses top 3 features which have highest average of log-intensity across runs. "topN" uses top N features which has highest average of log-intensity across runs. It needs the input for n_top_feature option. "highQuality" flags uninformative feature and outliers.
remove_uninformative_feature_outlier	optional. Only required if featureSubset = "highQuality". TRUE allows to remove 1) noisy features (flagged in the column feature_quality with "Uninformative"), 2) outliers (flagged in the column, is_outlier with TRUE, before run-level summarization. FALSE (default) uses all features and intensities for run-level summarization.
min_feature_count	optional. Only required if featureSubset = "highQuality". Defines a minimum number of informative features a protein needs to be considered in the feature selection algorithm.
n_top_feature	optional. Only required if featureSubset = 'topN'. It that case, it specifies number of top features that will be used. Default is 3, which means to use top 3 features.

7 of 9

dataProcess: Clean, normalize and summarize before differential analysis�

summaryMethod	"TMP" (default) means Tukey's median polish, which is robust estimation method. "linear" uses linear mixed model.
equalFeatureVar	only for summaryMethod = "linear". default is TRUE. Logical variable for whether the model should account for heterogeneous variation among intensities from different features. Default is TRUE, which assume equal variance among intensities from features. FALSE means that we cannot assume equal variance among intensities from features, then we will account for heterogeneous variation from different features.
censoredInt	Missing values are censored or at random. 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. '0' uses zero intensities as censored intensity. In this case, NA intensities are missing at random. The output from Skyline should use '0'. Null assumes that all NA intensites are randomly missing.
MBimpute	only for summaryMethod = "TMP" and censoredInt = 'NA' or '0'. TRUE (default) imputes 'NA' or '0' (depending on censoredInt option) by Accelated failure model. FALSE uses the values assigned by cutoffCensored.
remove50missing	only for summaryMethod = "TMP". TRUE removes the proteins where every run has at least 50% missing values for each peptide. FALSE is default.
fix_missing	Optional, same as the 'fix_missing' parameter in MSstatsConvert::MSstatsBalancedDesign function
maxQuantileforCensored	Maximum quantile for deciding censored missing values, default is 0.999
numberOfCores	Number of cores for parallel processing. When > 1, a logfile named 'MSstats_dataProcess_log_progress.log' is created to track progress. Only works for Linux & Mac OS. Default is 1.

8 of 9

Go to your R Notebook

Work through the data import, cleaning, and processing steps using MSstats and MSstatsConvert
Explore the different arguments in DIANNtoMSstatsFormat and dataProcess; understand the various parameters that can be changed.
Try changing some of the q value cut-offs, number of features, or normalization method, and see what effect this has on the output (ProcessedData$FeatureLevelData$Protein and ProcessedData$FeatureLevelData$Peptide)

9 of 9

Next Session

Differential abundance analysis and Volcano plots
Capstone assignment walk-through