1 of 11

Speaker Introduction

Rajneesh Tiwari

Co-founder : Bulian AI, CueNex

2 of 11

CueNex

Bulian AI

Low-code, API-first Synthetic Data

Decisioning made easy!

3 of 11

Mars Spectrometry: Gas Chromatography�

8th place solution – Rajneesh Tiwari

4 of 11

Did Mars ever have environmental conditions that could have supported life?

  • NASA missions like the Curiosity and Perseverance rovers collect rock and soil samples and take measurements that can be used to determine their chemical makeup.
  • These samples can be analyzed for chemical signatures that indicate the environment's habitability, or potentially even signs of microbial life directly.
  • In this challenge, your goal is to build a model to automatically analyze mass spectrometry data for geological samples of scientific interest in understanding the present and past habitability of Mars.
  • Specifically, the model should detect the presence of certain families of chemical compounds in data collected from performing gas chromatography–mass spectrometry (GCMS) on a set of geological material samples.

5 of 11

  • Analysis begins with the gas chromatograph, where the sample is effectively vaporized into the gas phase and separated into its various components using a capillary column coated with a stationary phase. The compounds are propelled by an inert carrier gas such as helium, hydrogen or nitrogen.
  • Once the components leave the GC column, they are ionized and fragmented by the mass spectrometer using electron or chemical ionization sources. Ionized molecules and fragments are then accelerated through the instrument’s mass analyzer. It is here that ions are separated based on their different mass-to-charge (m/z) ratios.
  • The final steps of the process involve ion detection and analysis, with fragmented ions appearing as a function of their m/z ratios. Peak areas, meanwhile, are proportional to the quantity of the corresponding compound. When a complex sample is separated by GC-MS, it will produce many different peaks in the gas chromatogram and each peak generates a unique mass spectrum used for compound identification.

Data Generation: �Gas Chromatography Mass-Spectrometry

6 of 11

  • The output measurements for a typical mass spectrometry experiment are visualized as a mass spectrum—a histogram with relative intensity on the y-axis and m/z on the x-axis. To infer the composition of the sample under analysis, scientists can use domain knowledge of how materials fragment under ionization or compare the mass spectrum to reference spectra measured from known substances.

  • When using gas chromatography, scientists can also use the time at which compounds are eluted (in addition to their mass and intensity) to identify them.�

Data Sample

Example of a mass spectrum. This mass spectrum shows a large peak at m/z=147.0 and smaller peaks at 73.0, 233.0 and 40.0. Plotted data is for sample S0002 taken at time=9.9513.

Example visualization for sample S0002 showing intensities over time for all ions by mass, as a chromatogram. Each m/z is plotted as a separate time series, with m/z values of 147, 73, 233, 40, and 44 highlighted. In contrast to the previous mass spectrum showing a time snapshot, we can see that these different masses peak at different times in the analysis run

7 of 11

    • 101 X 351
    • 201 X 351
    • 201 X 501

Construct Mean/Max Intensity Spectrogram (scale: Time X Mass)

Feature Creation: Spectrogram like images and tabular features

    • Round mass
    • Drop Helium
    • Drop m/z<350 (optional)

Clean up

Mean/max intensity by time & rounded mass

Aggregate

Minimum abundance subtracted for all observations

Remove b/g intensity

Min-max scale intensities (0-1)

Min-max Scale

Divide time in 0.25/0.5 sec buckets

Construct time-buckets

Spectrograms

  • Peaks: Peak identification, Time to peaks, Prominence, Height, Peak mass etc
  • Intensity Aggregations: Min, Max, Mean, Median, Skewness, Kurtosis across Time and Mass buckets

Tabular Features

8 of 11

Model Pipeline – Vision Models

Spectrograms

K-folds

convnext_base_in22ft1k

convnext_tiny in22ft1k

maxvit_tiny_rw_224

CAIT-s24-224

coatnet_1_rw_224

coatnet_0_rw_224

vit_small_r26_s32_224

vit_small_patch32_224

volo_d1_224

nf_regnet_b0

densenet121

eca_nfnet_l0

eca_ecaresnet50t

Nf_regnet_b1

resnet50d

seresnext50d

2D Backbones

LSTM

1D LSTM

Concat

Multi-label predictions

2D Backbones

9 of 11

Model Pipeline: Tabular Models

Tabular Features

  • Peaks: Peak identification, Time to peaks, Prominence, Height, Peak mass etc
  • Intensity Aggregations: Min, Max, Mean, Median, Skewness, Kurtosis across Time and Mass buckets

K-folds

Catboost

XGBOOST

LogReg

Random Forest

LightGBM

Lasso

Multi-label predictions

Multi-label tabular models

10 of 11

Greedy Hill-Climbing Ensemble Weight Optimization

Catboost

XGBOOST

LogReg

Random Forest

LightGBM

Lasso

convnext_base_in22ft1k

convnext_tiny in22ft1k

maxvit_tiny_rw_224

CAIT-s24-224

coatnet_1_rw_224

coatnet_0_rw_224

vit_small_r26_s32_224

vit_small_patch32_224

volo_d1_224

nf_regnet_b0

densenet121

eca_nfnet_l0

eca_ecaresnet50t

Nf_regnet_b1

resnet50d

seresnext50d

Greedy Hill Climbing via Optuna

Optimized Weights

(OOF)

Ensemble Predictions (Preds)

8th place (Pvt LB)

2D Backbone + LSTM

Tabular Models

11 of 11

Thanks ☺