1 of 24

ASCOT 3: NONLINEAR PRINCIPAL COMPONENTS ANALYSIS AND UNCERTAINTY QUANTIFICATION IN EARLY LIFECYCLE SPACECRAFT FLIGHT SOFTWARE COST ESTIMATION

FIRST INTERNATIONAL BOEHM FORUM ON COCOMO AND SYSTEMS AND SOFTWARE COST MODELING

NOVEMBER 9-10, 2022

Sam Fleischer, PhD, Samuel.R.Fleischer@jpl.nasa.gov*

Patrick Bjornstad, Patrick.T.Bjornstad@jpl.nasa.gov

Jairus Hihn, PhD, Jairus.M.Hihn@jpl.nasa.gov

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, CA 91109

James Johnson, James.K.Johnson@nasa.gov*

National Aeronautics and Space Administration

Washington, DC 20546

*Corresponding Authors

2 of 24

OVERVIEW

  • Challenges in spacecraft flight software cost estimation
  • Why does ASCoT exist?
  • Bayesian regression and improving our understanding of uncertainty
  • Non-numerical data and Nonlinear Principal Components Analysis
  • k-Nearest-Neighbors and Clustering algorithms

3 of 24

CHALLENGES IN SPACECRAFT FLIGHT SOFTWARE COST ESTIMATION

  • Requirements are not known at early phases of the mission, and architecture trade studies are routine.
  • Software estimation is, to some degree, fundamentally uncertain under the best conditions.
  • It is difficult to budget with a large amount of uncertainty.
  • Budget ‘bogies’ get set very early in the lifecycle… sometimes based on casual conversation… and project managers will want to hold you to that number.
  • Current proposal and planning processes encourages (demands) under-estimating.

4 of 24

WHY DOES ASCOT EXIST? (1/2)

  • ASCoT was created to enable estimators to better embrace the uncertainty
  • ASCoT expands the range of cost estimation models to include formal analogic cost estimation, which can be better suited to early project formulation
    • Analogic models can perform much better than parametric models with sparse, noisy data
    • Analogic models represent what is known in the very early lifecycle more accurately than parametric models
  • ASCoT provides models that only require basic system-level inputs that are known in the early lifecycle

5 of 24

WHY DOES ASCOT EXIST? (2/2) – THE DATASAURUS DOZEN

All of these datasets have identical statistics when rounded to the nearest 100th.

6 of 24

BAYESIAN REGRESSION AND IMPROVING OUR UNDERSTANDING OF UNCERTAINTY

  • When regression is appropriate, ASCoT improves parametric models by providing as much uncertainty as is appropriate, in the regression.
    • Epistemic uncertainty is uncertainty in model parameters or model form
    • Aleatoric uncertainty is uncertainty inherent to the data generation process (i.e. distribution around the mean line)
  • Bayesian statistics allows us to set smart priors based on expert opinion prior to ingesting data.
    • “In the absence of data, what is appropriate to assume?”

7 of 24

BAYESIAN CER – POSTERIOR DISTRIBUTION

 

 

8 of 24

BAYESIAN CER – POSTERIOR PREDICTIVE DISTRIBUTION

 

credibility intervals

 

9 of 24

BAYESIAN CER – POSTERIOR PREDICTIVE DISTRIBUTION

credibility intervals

  • Model with skew normal error term performs better predictively than model with normal error term
    • Captures low outliers without pulling median prediction down
  • Simple regression performs better predictively than regression models with other perceived software cost drivers such as number of instruments, destination, or redundancy (short version: avoids overfitting)

10 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (1/3)

  • Inheritance (as-is or modified code from a previous mission)
    • Theoretically, this is a number between 0% and 100%.
    • In practice, Project Software Systems Engineers (PSSEs) have only rough estimates. We categorize them into five bins:
      • “Very Low to None” “Low” “Medium” “High” “Very High”
  • Mission Size (total mission cost, including operations)
    • Theoretically, this is a precise positive number.
    • In practice, we have only rough estimates of what will be the total cost
    • However, we have a very good idea of the cost target or mission class. The categories are:
      • “Small” “Medium” “Large” “Very Large”

11 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (2/3)

  • Mission Type
    • “Orbiter/flyby” “Observatory” “Lander” “Rover”
  • Redundancy
    • “Single String” (no backup computer on board) “Dual String – Cold” (backup on board but nominally off)

“Dual String – Warm” (backup maintaining continuous operations)

  • Destination
    • “Earth” “Inner Planetary” “Asteroid / Comet” “Outer Planetary”
  • Number of Instruments (particle detectors, magnetometers, spectometers, and other scientific instruments)
  • Number of Deployables (solar arrays, booms, arms, etc.)

12 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (3/3)

Numerical Variables

Number of Instruments

Number of Deployables

Nominal and Categorical Variables

Inheritance

Mission Size

Mission Type

Redundancy

Destination

How do you calculate the “distance” between missions with non-numerical data?

Is the “distance” between 2 instruments and 3 instruments equal to that of between 3 instruments and 4?

13 of 24

HOW DO YOU NUMERICIZE CATEGORICAL DATA?

  • kNN and Clustering algorithms need numbers, so we need to quantify the non-numerical data.

  • We let the data tell us what really is the best way to quantify the data.
    • We rely on a Nonlinear Principal Components Analysis (NLPCA) algorithm to teach us the optimal weights.

 

 

 

 

14 of 24

NONLINEAR PRINCIPAL COMPONENTS ANALYSIS –�AUTO-ASSOCIATIVE NEURAL NETWORKS (ANN)

Auto-associative neural network

ANN parameters are optimized such that the difference between the output layer and the input layer is minimized.

Goal: the low-dimensional bottleneck layer must adequately retain the information contained in the input layer.

Result: A non-numeric input layer can be projected onto a numeric, low-dimensional space.

15 of 24

KNN ALGORITHM OVERVIEW

  •  

 

 

 

 

Your Project

 

 

 

 

 

16 of 24

KNN MODEL EXAMPLE OUTPUT

Effort (work-months)

Cumulative Effort Distribution

Probability of being one of the three nearest neighbors

Model Input:

  • Medium Inheritance
  • Small Mission Size
  • Earth orbiter
  • Single-string
  • Two instruments
  • Zero deployables

Uncertainty in the NLPCA leads to uncertainty in the kNN result.

17 of 24

CLUSTERING ALGORITHM OVERVIEW

Probabilistic Linkage Matrices

Calculated using the k-Means algorithm in NLPCA space

(Cassini, Galileo, and Rovers and Landers are removed).

18 of 24

Effort Model Clusters

1. Very Large, Old, Outer Planetary

2. Rovers

3. Landers

4. Large, Complex,�Inner-Outer Planetary

5. Large, Complex, Earth-Inner Planetary

6. Smaller, Higher�Inheritance

7. Large, Earth�Observatories and�Constellations

Cassini

MER

Insight

Dawn

Deep Impact

DS1

GRO

Galileo

MPF

Phoenix

GRAIL

Genesis

GLORY

HST

 

MSL

 

JUNO

GPM Core

NuStar

MMS

 

 

 

Kepler

LRO

OCO-1

SDO

 

 

 

LADEE

Mars Observer

WISE

Spitzer

 

 

 

MAVEN

Mars Odyssey

 

 

 

 

 

Messenger

OSIRIS-REx

 

 

 

 

 

MRO

SMAP

 

 

 

 

 

New Horizons

Stardust

 

 

 

 

 

Parker Solar Probe

STEREO

 

 

 

 

 

 

TIMED

 

 

 

 

 

 

Van Allen Probe

 

 

SLOC Model Clusters

1. Very Large, Old, Outer

Planetary

2. Rovers

3. Landers

4. Large, Complex, Inner-Outer Planetary

5. Large, Moderately

Complex, Dual String (Cold)

6. Smaller or Simple, Earth –

Asteroid/ Comet

7. Small-Medium, Single-String Inner-Planetary or Dual String (Cold)�Asteroid/Comet

 

8. Large, Earth�Observatories and�Constellations

Cassini

MER

Insight

JUNO

Deep Impact

DS1

Contour

GLAST

Galileo

MPF

Phoenix

Mars Observer

Genesis

EO1

Dawn

GRO

 

MSL

 

MAVEN

GOES-R

GLORY

GRAIL

HST

 

 

 

Messenger

LDCM

GPM Core

LADEE

MMS

 

 

 

MRO

Mars Odyssey

IRIS

LCROSS

SDO

 

 

 

New Horizons

NPP

NuStar

LRO

Spitzer

 

 

 

Parker Solar Probe

OSIRIS-REx

OCO-1

 

STEREO

 

 

 

 

Stardust

SMAP

 

 

 

 

 

 

Van Allen Probe

TIMED

 

 

 

 

 

 

 

WISE

 

 

19 of 24

CLUSTERING ALGORITHM�OVERVIEW

  • Once we have our missions in a low-dimensional numeric space, we can calculate the distance from each mission to the “center” of any cluster
  • Once in a cluster with k missions, use the kNN weighted average formula for the estimate.

Cluster 2 centroid

Your Project

 

Cluster 4 centroid

Cluster 3 centroid

Cluster 1 centroid

Cluster 5 centroid

 

20 of 24

CLUSTERING MODEL EXAMPLE OUTPUT

Model Input:

  • Medium Inheritance
  • Small Mission Size
  • Earth orbiter
  • Single-string
  • Two instruments
  • Zero deployables

Uncertainty in the NLPCA leads to uncertainty in the cluster result.

Cluster Number

Probability of falling into the cluster

Effort (work-months)

Cumulative Effort Distribution

Cluster 6 (Smaller, Higher�Inheritance)

DS1

GLORY

NuStar

OCO-1

WISE

Uncertainty in the Effort distribution is caused by uncertainty in the NLPCA as well as uncertainty in the cluster.

21 of 24

THANKS!

We love to chat about collecting and cleaning data, statistics and machine learning, and software costing.

  • Thank you! Any questions?

©2023. All rights reserved. Government sponsorship acknowledged. NASA HQ OCFO Strategic Investments Division provides the funding and management for the development of the ASCoT model and the ONSET framework. The cost information contained in this document is of a budgetary and planning nature and is intended for informational purposes only. It does not constitute a commitment on the part of JPL and/or Caltech.

22 of 24

BACKUP

23 of 24

BAYESIAN SIMPLE LINEAR REGRESSION USING THE R PACKAGE�BRMS (BAYESIAN REGRESSION MODELS USING STAN) (1/2)

slope <- 1.9

intercept <- 0.4

sigma <- 1.3

N <- 20

xs <- runif(N, min=-3, max=3)

signal <- slope*xs + intercept

noise <- rnorm(N, mean=0, sd=sigma)

ys <- signal + noise

plot(xs, ys)

Set the parameters of the model.

Simulate the process.

Plot the data.

Plot the data.

Note that even with known parameters, there is noise in the data. This noise is due to the inherent uncertainty in the process. This is called aleatoric uncertainty.

24 of 24

BAYESIAN SIMPLE LINEAR REGRESSION USING THE R PACKAGE�BRMS (BAYESIAN REGRESSION MODELS USING STAN) (2/2)

library(brms)

d <- data.frame(x=xs, y=ys)

model <- brm(y~x, data=d)

plot(model)

plot(conditional_effects(

model, method=‘predict’),

points=TRUE)

post <- as_draws_df(model)

head(post)

b_Intercept b_x sigma lprior lp__

1 -0.1447024 1.711225 1.3742691 -3.865604 -35.77348

2 0.8070629 1.549148 1.1941961 -3.864858 -35.20632

3 0.6598251 1.584357 1.1770281 -3.852215 -34.26019

4 0.1113087 1.498301 1.2273181 -3.841217 -35.30354

5 0.3810465 1.884662 0.8651947 -3.803884 -35.22898

6 0.5099172 1.653497 1.4078682 -3.876983 -34.48557

Load the BRMS library.

Define and fit the model.

See how the model looks over the data.

See the fitted parameters

See the fitted parameters

See how the model looks over the data.

Sample from the posterior.

There is uncertainty in the fitted parameters. This is called epistemic uncertainty and represents a lack of knowledge.