Table of Contents

Terms of Use        3

General User Policy        3

For academic and non-profit institutions        3

For commercial and for-profit institutions        3

Citation Policy        3

Overview        4

Input File        5

Using Binner        5

Input Tab        6

Output Options Tab        7

Data Cleaning Tab        9

Feature Grouping Tab        10

Annotation        11

Annotation File        11

Annotation Tab        14

Output File        15

Summary Tab        15

Mass Difference Distribution        15

Bin Statistics        16

Correlations        16

Correlations by Cluster        17

Correlations by Bin        17

Mass Differences        18

Mass Differences by Cluster        18

Mass Differences by Bin        19

Intensities        19

Molecular Ions        19

Additional Columns        19

Supplemental Information        20

Cutoff Score        20

Clustering        20

Sub-clustering        21

Annotation Preferences        21

Isotope Detection        22


Terms of Use

General User Policy

For academic and non-profit institutions

  • Permission is granted to access, use, and/or download the Tool for internal use only.
  • If the user desires to create derivative works of the Tool, source code or access to data may be made available through request by sending an email to binner-help@umich.edu.
  • Use of the Tool must be acknowledged in resulting publications (see citation policy below).

For commercial and for-profit institutions

  • Permission is granted to access, use, and/or download the Tool for internal use only.
  • To create derivative works of the Tool for commercial purposes, source code or access to data may be permitted through negotiation for a commercial license.  Please send request to binner-help@umich.edu.

Citation Policy

Please cite:  [a]

Please note that due to continuous software upgrades the images in this handout may differ from what you see on the screen.


Overview

When metabolites are analyzed by electrospray ionization-mass spectrometry (ESI-MS), they are often detected as multiple ion species due to the presence of isotopologues, adducts, and in-source fragments that cannot be readily identified.  Binner is a Java application that takes a numerical feature table obtained from any preprocessing software (e.g. XCMS, MZmine) as input and outputs a file with clusters of closely eluting, highly correlated metabolite features and their putative annotations, along with their pairwise correlations and mass differences.  

Binner aims to accurately annotate suspected chemical adducts and decipher the underlying metabolite neutral mass.  Below is an overview of the Binner workflow.

This is a diagram that show the workflow, starting with a input data file, then data cleaning, binning, isotope detection, clustering, computing mass differences, annotation, and finally the output.

Input File

Binner accepts .txt and .csv files where each row represents one feature.  The file must include the following columns:

  • A column containing the feature name.  Any name or identifier, including chemical formula, compound name, or other identifier are accepted.

Note:  The names need to be machine-readable, as certain special symbols and characters may cause a problem.

  • A column containing mass to charge ratios (m/z).
  • A column containing retention times (RT).  The recommendation is to provide retention time in minutes.  

Note:  If seconds are used, the user must update Binner’s default settings accordingly.

  • A column for each sample containing the intensity for each associated feature.  A minimum of 3 samples is necessary, although 20 or more is recommended.
  • Additional columns can be included in the input data file and then carried over to the output file.  

The order of the above listed columns is flexible, as Binner requires column mapping on the Input Tab.

This is an example of the input file format, showing the various columns that are needed.

Using Binner

After launching Binner, the user will see a series of tabs.  The user can step through Binner using the tabs at the top of the window or the Next>> button at the bottom right of the window.

Input Tab

  1. Under “Import Experimental Data,” use the “Import…” button to find and load the input data file.
  2. Under “Specify Columns,” use the dropdown arrows next to each column type to select the specific column from the input file that maps to the column type.
  3. To exclude one or more samples from the analysis, click the “Exclude Samples…” button.
  1. A popup window will appear listing the sample names from the input file (see image below).
  2. Check the boxes next to those samples that are to be excluded.
  3. Click the “OK” button.

  1. Under “Specify Ionization Mode” on the Input tab, select whether the mass spectrometry ionization mode was positive or negative.
  2. To move to the next section, click the “Next>>” button or click on a different tab at the top of the window.

Output Options Tab

  1. If the input file contained additional columns to be included in the output file, click the “Include Additional Columns…” button under “Specify Additional Columns for Output.”  
  1. A popup window will appear listing the column headings from the input file (see image below).
  2. Check the box next to a column to add it to the output file.
  3. Click the “OK” button.

  1. Under “Select Location” on the Output Options tab, use the “Browse…” button to choose where to save the output file.
  2. Under “Customize Output Tabs,” check the boxes next to the information to be included in the output file.  Each option will appear as a separate tab in the output file.  To learn more about each tab option, see the Output File section of this document.

Note: The number and types of tabs selected will impact processing time.

  1. Correlations – contains correlations between features.
  2. Mass Differences – contains the mass differences between features.
  3. Intensities – contains the signal intensity (abundance) for each feature.
  4. Features/Molecular Ions – contains (a) only the putative molecular ions annotated by Binner; (b) unannotated features.
  5. Mass Diff Distribution – contains the distribution of the mass differences across the entire data set.
  1. To move to the next section, click the “Next>>” button or click on a different tab at the top of the window.

Data Cleaning Tab

  1. Under “Set Filtering Options,” adjust the filtering options to handle outliers and missing data.
  1. By default, Binner identifies intensities more than 4.0 standard deviations from the feature mean and marks these as missing.  The user can change this filter by entering a different number in the text box.
  2. By default, Binner identifies missing intensity values and removes features that have more than 30% of data missing (including outliers identified above).  The user can change this filter by entering a different number in the text box.
  3. Any intensities marked as missing that still remain in the data set are imputed with the feature median.
  1. If data are not yet normalized, under “Set Normalization Options,” check the box next to “Log-Transform Data (using ln(1+x)).”

Note:  Data normalization is recommended.  If data are already normalized, uncheck this box.

  1. Under “Set Deisotoping Options,” click the check box next to “Deisotope Data” to perform deisotoping as part of the data cleaning process.  The user can also choose to eliminate isotopes from the mass difference distribution using the “Deisotope Mass Difference Distribution” check box.  Select the appropriate parameters to use in the detection and removal of isotopes within a bin.
  1. By default, Binner removes isotopes with a mass tolerance of 0.002 Daltons or lower.  The user can change this parameter by entering a different number in the “Mass Tolerance” text box.
  2. By default, Binner removes isotopes with a correlation of 0.6 or higher.  The user can change this parameter by entering a different number in the “Correlation Cutoff” text box.
  3. By default, Binner removes isotopes that elute within a retention time gap of 0.1 minutes or less.  The user can change this parameter by entering a different number in the “Maximum Retention Time Difference” text box.
  1. To move to the next section, click the “Next>>” button or click on a different tab at the top of the window.

Feature Grouping Tab

  1. Under “Set Binning Gap,” specify the retention time gap size to be used in determining the bins.  A gap size of 0.03 means that a new bin will be formed whenever there is a gap of 0.03 minutes or more between two consecutive retention times.

Note: The gap size is in minutes (default = 0.03 minutes).

  1. Under “Select Correlation Measure,” select either Pearson’s Correlation or Spearman’s Rank Correlation.
  2. Under “Choose Bins to Cluster,” select the appropriate options for the data.  This parameter helps to avoid unnecessary clustering of very highly correlated bins with closely eluting features.
  1. Bins Below Cutoff Score – will not cluster bins above a specified cutoff score.
  1. The default score is 2, but the user can change this using the Cutoff Value slider.
  2. See the Supplemental Cutoff Score section for more details about the cutoff score.
  1. Bins Above Cutoff Size – will not cluster bins below a specified cutoff size.
  1. The default size is 5, but the user can change this using the Cutoff Value slider.
  1. All Bins – all bins will be clustered.

Note:  Clusters are determined using Weighted Silhouette.  See the Supplemental Weighted Silhouette section for more details.

  1. Under “Sub-Divide Clusters,” select whether the clusters will be sub-divided by hierarchical clustering on retention time (“Cluster on RT”) or re-binned based on retention time gaps (“Rebin on RT).  This parameter helps sub-divide clusters that encompass a wide retention time range and could include multiple molecular ions.
  1. If “Cluster on RT” is selected, set the desired parameters for gap size (See the Sub-clustering section for details.
  2. If “Rebin on RT” is selected, Binner will break on gaps larger than the RT gap.
  1. To move to the next section, click the “Next>>” button or click on a different tab at the top of the window.

Annotation

Annotation File

The annotation file represents existing knowledge about common adducts, neutral losses, etc.  For some users, this file will be minimal, while for other users it will be extensive.  The annotation file is a .txt file that includes:

  • A column containing the annotations.
  • Individual columns for mass, mode, and charge for each annotation.
  • If the Mass column contains a negative number, Binner will interpret it as a loss group.
  • A column containing tier information (“1” or “2”).
  • Tier 1 charge carriers:
  • Are allowed as a base of an annotation group, e.g., [M+H] for the most intense feature of the annotation group is allowed only if H is a tier 1 charge carrier.
  • Allow complex annotations to appear even in the absence of simpler ones.  In particular,  an annotation of the form [M+CC+NM] is allowed even when [M+CC] does not appear in that annotation group.  Here, M is the putative molecular ion, CC is the tier 1 charge carrier (the same one in both instances), and NM is a putative neutral mass (gain or loss).
  • Tier 2 charge carriers:
  • Are not allowed as a base of an annotation group.
  • Do not allow complex annotations ([M+CC+NM]) to appear in the absence of simpler ones ([M+CC]).
  • Note that in neither case is a charge carrier allowed as a multimer base.  In other words, base annotations such as [2M+H] or [3M+K] are not considered for any annotation group.

The file is split into 2 sections:

  • Charge-carrying adducts that help form the hypothesis for what the underlying adducts are in the data.
  • Neutral additions and losses that complement previously found adducts.

Annotation Tab

  1. Under “Set Parameter Values,” change the parameters as appropriate for the data.
  1. By default, Binner looks for annotations that have a mass error of 0.002 Daltons or less.  The user can change this parameter by entering a different number in the “Annotation Mass Tolerance” text box.
  2. By default, Binner looks for annotations that have a retention time tolerance of 0.1 minutes.  The user can change this parameter by entering a different number in the “Annotation RT Tolerance” text box.
  1. Under “Select Adduct/Fragment File (must be .txt),” use the “Import…” button to find and load the annotation file.
  2. Under “Specify Columns,” use the dropdown arrows next to each column type to select the specific column from the annotation file that maps to the column type.
  3. Under “Set Annotation Preferences,” select options as appropriate.
  1. If “Use Neutral Masses To Help Determine Best Charge Carrier” is selected, Binner will include neutral masses to determine the carrier with the maximum number of annotation “hits.”  See the Supplemental Annotation Preferences section for more information.
  2. If “Allow Variable Charge Without Isotope Information” is selected, all charge carriers in the annotation file will be included in the candidate set regardless of the presence or absence of evidence for isotopes.  If it is not selected, Binner excludes multiply charged carriers from the candidate set unless evidence for isoptopes suggests their inclusion.   See the Supplemental Annotation Preferences section for more information.
  1. Under “Generate Annotated Report,” click the “Run Analysis” button.
  2. A progress bar will appear, keeping the user informed of the progress of the analysis.

Output File

Binner generates an .xlsx file with multiple tabs. This file includes the Summary tab, the Mass Diff Distribution tab, and the Bin Statistics tab, plus tabs specified by the user on the Binner Output Options Tab.  The file will be saved at the location specified by the user on the Binner Output Options Tab.

Summary Tab

The first tab is the Summary tab that lists parameters used during the analysis.  This includes:

  • The version of Binner used during the analysis.
  • Information about the input file.
  • Data cleaning information, such as the number of removed features, whether or not normalization was performed, and parameters used for deisotoping.
  • Information about the Annotation file, including the ionization mode.
  • Annotation preferences.
  • The location of the saved output file.
  • Data analysis information, such as the gap size between bins, the clustering method, and the number of groups and annotations.

Mass Difference Distribution

The Mass Difference Distribution tab (“Mass Diff Distribution”) shows a distribution of the bin-wise mass differences for the entire data set (or the range specified in the Binner Output Options Tab), highlighting the common mass differences.  This allows the user to see which mass differences appear more frequently than would be expected by chance.  If not already included in the annotation file, the user can add them for future use.  Additionally, this allows for assessment of mass accuracy for the input feature table.

  • The Mass Differences column shows the distribution of mass differences.
  • Each dash represents a number of times a specific mass difference appears in the data set.  
  • Each color represents a continuous range of mass.
  • The Count column shows the actual number of times the mass difference appears in the data set.
  • The Annotation column provides the annotation, if it exists in the user-provided annotation file, otherwise “no annotation” is used.

Bin Statistics

The Bin Statistics tab provides valuable statistics on the Binner results.  This includes:

  • Total number of bins and clusters, plus summary statistics on the features per bin and per cluster.
  • Cluster size distribution.
  • Bin size distribution.
  • Clusters per bin distribution.

Correlations

Correlation analysis results are visualized as a heatmap.  There are several options for the correlations output, each appearing as a separate tab.  Data can be presented on a by cluster or by bin basis.  There are 2 color options:  local or absolute.  Absolute color uses a range of correlation coefficients found in the entire dataset, whereas local color uses the range of the actual data values found within a cluster or bin.  Within each Correlations by Bin tab, data can be sorted by retention time or by cluster.

Select the desired output options on the Output Options Tab in Binner:

This is a screenshot of the Correlations section from the Binner Ouput Tab. It has the various options, each with a check box next to it.

Correlations by Cluster

The Correlations by Cluster tab (“Corrs by clust”) should be used for initial review of the annotation results.

  • See the Supplemental Isotope Detection section for information on how isotopes are detected within a bin.
  • In the Isotopes column:
  • Parent ions are highlighted green.
  • Isotopes are highlighted blue.
  • Isotopes are labeled with a lower case “i.”
  • Adducts are labeled with an upper case “M.”
  • In the Annotations and Derivations columns:
  • Putative molecular ions are in cells highlighted green.
  • Likely redundant features are in cells highlighted yellow.
  • In the Correlations columns:
  • The redder the cell, the stronger the correlation.
  • The greener the cell, the weaker the correlation.

Correlations by Bin

The Correlations by Bin tab (“Corrs by bin”) has similar information to the Correlations by Cluster tab, except the information is organized by bin instead of by cluster.  This tab allows users to evaluate feature relationships in a broader context.  Binner provides an option to sort the information in this tab by retention time or cluster.  

  • See the Supplemental Isotope Detection section for information on how isotopes are detected within a bin.
  • In the Isotopes column:
  • Parent ions are highlighted green.
  • Isotopes are highlighted blue.
  • Isotopes are labeled with a lower case “i.”
  • Adducts are labeled with an upper case “M.”
  • In the Adducts/NLs and Derivations columns:
  • Putative molecular ions are in cells highlighted green.
  • Likely redundant features are in cells highlighted yellow.
  • In the Correlations columns:
  • The redder the cell, the stronger the correlation.
  • The greener the cell, the weaker the correlation.
  • When sorted by cluster, the clusters are color-coded to easily determine which features belong to the same cluster (see image below).

Mass Differences

The mass differences output is provided by cluster or by bin.  When provided by bin, the user can sort the information by retention time, mass difference, or cluster.  

Select the desired output options on the Output Options Tab in Binner:

 This is a screenshot of the Mass Differences section from the Binner Ouput Tab. It has the various options, each with a check box next to it.

Mass Differences by Cluster

The Mass Differences by Cluster tab (“Mass diffs by clust”) shows the differences in masses between features within the same cluster.  

  • Commonly recognized mass differences are highlighted with different colors, representing different mass relationships.  For example, the mass difference between a sodium adduct and a protonated adduct is 21.98.  In the below image, all mass differences close to 21.98 are the same color.

Mass Differences by Bin

The Mass Differences by Bin tab (“Mass diffs by bin”) has similar information to the Mass Differences by Cluster tab, except it shows the differences in masses between features within the same bin.  In this tab, the user can sort the information by retention time, mass difference, or cluster.

  • Commonly recognized mass differences are highlighted with different colors, representing different mass relationships (see below image).

Intensities

The Intensities tabs (“Unadj intensities” and “Adj intensities”) provide the signal intensity for each feature.  The intensity values in the adjusted tab contain the imputed and log-transformed data.  The intensity values in the unadjusted tab are the original input data.

Molecular Ions

Depending on the user’s Binner Output Options Tab selection, this will include putative molecular ions, unannotated features, or both (each as a separate tab).

Additional Columns

Each tab includes the below additional columns.  To view these columns, it may be necessary to manually adjust the column widths as some are mostly “hidden” by default.  The user will find these collapsed columns between the “Mass Error” and “Bin” columns.

  • Molecular Ion Number – reports the “index” of the corresponding molecular ion.
  • Charge Carrier – shows the charge carrier for the corresponding annotation.
  • Adduct/NL – reports the neutral addition or loss portion of the annotation.
  • Mass Error – provides the difference between the observed m/z value and the theoretical  m/z value.

Supplemental Information

Cutoff Score

If a bin already has highly correlated and closely eluting features, there is no advantage to clustering.  As a result, the Feature Grouping tab has the “Bins Below Cutoff Score” option which tells Binner not to cluster bins above a specific cutoff score.  Binner’s default cutoff score is 2.  The cutoff score is equal to:

The formula for the cuttof score has  the average correlation squared in the numerator. The denominator is the log2 of n square root of maximum retention time minus minimum retention time.

Where:

  • Corravg is the average of the off-diagonal correlations of the bin correlation matrix.
  • n is the number of features within the bin.
  • rtmax is the maximum retention time (in minutes) of the bin features.
  • rtmin is the minimum retention time (in minutes) of the bin features.

Clustering

After binning and deisotoping, Binner clusters features within each bin using a hierarchical clustering algorithm with average linkage.  Feature-to-feature distances are represented by the Euclidian distance between correlation vectors.  If  is the Pearson correlation between feature k and feature l, then the distance between feature i and feature j is calculated as:

Stopping condition: To determine the optimal number of clusters, Binner utilizes a weighted variant of the classic silhouette diagnostic. In the traditional approach, at each clustering step a silhouette value (sk) is calculated for each clustered feature

where:

 is the average distance between feature k and all other features in the same cluster

 is the minimum average distance between feature k and all features not in the same cluster

Typically the cluster configuration with the highest average silhouette is chosen as optimal.

Silhouette weights: One drawback of the sk silhouette statistic is that it may select a configuration with only 2 clusters when the data naturally falls into a grouping where one cluster is much larger than the rest.  In this case, a variant silhouette method2 that downweights large clusters can be corrective:

However, for the problem at hand, a disadvantage of the fully weighted () diagnostic is that it may overcorrect, splitting the data into many extremely small groups with only 2 or 3 features per cluster.  To avoid both under- and over-splitting, Binner uses a hybrid statistic:

By examining the cluster size distribution as  varies from  (corresponding to sk) to  = 1.0 (corresponding to), Binner selects both a weight and a corresponding cluster configuration that best captures the underlying correlation structure.

Sub-clustering

Some correlation-based clusters determined at the clustering step may still span a retention time range not typically observed in co-eluting features.  To account for this, initial clusters are further divided by rebinning (splitting on any within cluster gaps greater than the specified RT gap) or by sub-clustering on RT. In the latter case, Binner again utilizes its hierarchical clustering algorithm with feature-to-feature distance measure  where RTj is the retention time for feature j.

Gap rule: Binner’s RT sub-clustering is tunable via a “gap rule” option. This rule specifies both a minimum and a maximum gap size; any natural gaps identified by RT clustering that are smaller than the minimum will not be split; conversely, the data will be split on any gap larger than the maximum even if the clustering algorithm does not identify this gap as a natural split point. 

Annotation Preferences

Binner provides two options for users who wish to customize the global annotation behavior of the program.  The first option controls the use of neutral masses in the annotation process.  If this option is selected, the process of determining the best base adduct of an annotation group will include consideration of secondary annotations of the form [M+CC+NM] in addition to those of the form [M+CC].  Here, M is the putative molecular ion, CC is the putative charge carrier, and NM is a putative neutral mass (gain or loss).  If the option is not selected, these types of secondary annotations will not be considered.  The second option controls the handling of multiple charges in annotation bases.  By default, this option is not selected.  This means that multiple charges (z=2 or z=3) are only allowed for the bases of annotation groups when isotope information specifically indicates that they are present.  See the Isotope Detection section for details.  When a putative annotation base has isotope information indicating a multiple charge, this charge is specifically assigned to the feature.  In the absence of such isotope information, the feature will be assigned a charge of z=1.  However, if the user wishes to loosen this restriction, selecting the second option will allow all charges (z=1, z=2 and z=3) to be considered regardless of any information provided by isotope detection.

Isotope Detection

This is an image of a powerpoint slides that has information on isotope detection.  The slide is titled "C13-based Isotopic Detection." It includes a graph with relative intensity on the y-axis and m/z on the x-axis. It points out the tallest peak as being monoisotopic mass.  The image includes the detection rules  (also described in the text below the image).  The bottom of the slide says "Approximately 1% of natural carbor are C13 (instead of CU12)"

As shown above, the rules used to detect isotopes within a bin are:

  • Elute within a small retention time gap.  By default, this is 0.1 minutes
  • Meet a correlation threshold.  By default, this is 0.6.
  • Meet the assumption of dwindling intensity, so each subsequent isotope is smaller than the previous one.
  • Determine the charged state on a compound using the difference in mass between two suspected isotopes.  For example, if the mass difference is 0.5 Daltons, it is a double charged compound.

References

(1) Rousseeuw, P. J. Journal of Computational and Applied Mathematics 1987, 20, 53-65.

(2) Starczewski, A.; Krzyżak, A.; Springer International Publishing: Cham, 2016, pp 114-124.

Page  of

[a]Enter manuscript after submitted?