JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 24

ASCOT 3: NONLINEAR PRINCIPAL COMPONENTS ANALYSIS AND UNCERTAINTY QUANTIFICATION IN EARLY LIFECYCLE SPACECRAFT FLIGHT SOFTWARE COST ESTIMATION

FIRST INTERNATIONAL BOEHM FORUM ON COCOMO AND SYSTEMS AND SOFTWARE COST MODELING

NOVEMBER 9-10, 2022

Sam Fleischer, PhD, Samuel.R.Fleischer@jpl.nasa.gov*

Patrick Bjornstad, Patrick.T.Bjornstad@jpl.nasa.gov

Jairus Hihn, PhD, Jairus.M.Hihn@jpl.nasa.gov

NASA Jet Propulsion Laboratory

California Institute of Technology

Pasadena, CA 91109

James Johnson, James.K.Johnson@nasa.gov*

National Aeronautics and Space Administration

Washington, DC 20546

*Corresponding Authors

2 of 24

OVERVIEW

Challenges in spacecraft flight software cost estimation
Why does ASCoT exist?
Bayesian regression and improving our understanding of uncertainty
Non-numerical data and Nonlinear Principal Components Analysis
k-Nearest-Neighbors and Clustering algorithms

3 of 24

CHALLENGES IN SPACECRAFT FLIGHT SOFTWARE COST ESTIMATION

Requirements are not known at early phases of the mission, and architecture trade studies are routine.
Software estimation is, to some degree, fundamentally uncertain under the best conditions.
It is difficult to budget with a large amount of uncertainty.
Budget ‘bogies’ get set very early in the lifecycle… sometimes based on casual conversation… and project managers will want to hold you to that number.
Current proposal and planning processes encourages (demands) under-estimating.

4 of 24

WHY DOES ASCOT EXIST? (1/2)

ASCoT was created to enable estimators to better embrace the uncertainty
ASCoT expands the range of cost estimation models to include formal analogic cost estimation, which can be better suited to early project formulation

Analogic models can perform much better than parametric models with sparse, noisy data
Analogic models represent what is known in the very early lifecycle more accurately than parametric models

ASCoT provides models that only require basic system-level inputs that are known in the early lifecycle

5 of 24

WHY DOES ASCOT EXIST? (2/2) – THE DATASAURUS DOZEN

J. Matejka and G. Fitzmaurice, 2017

All of these datasets have identical statistics when rounded to the nearest 100^th.

6 of 24

BAYESIAN REGRESSION AND IMPROVING OUR UNDERSTANDING OF UNCERTAINTY

When regression is appropriate, ASCoT improves parametric models by providing as much uncertainty as is appropriate, in the regression.

Epistemic uncertainty is uncertainty in model parameters or model form
Aleatoric uncertainty is uncertainty inherent to the data generation process (i.e. distribution around the mean line)

Bayesian statistics allows us to set smart priors based on expert opinion prior to ingesting data.

“In the absence of data, what is appropriate to assume?”

7 of 24

BAYESIAN CER – POSTERIOR DISTRIBUTION

8 of 24

BAYESIAN CER – POSTERIOR PREDICTIVE DISTRIBUTION

credibility intervals

9 of 24

BAYESIAN CER – POSTERIOR PREDICTIVE DISTRIBUTION

credibility intervals

Model with skew normal error term performs better predictively than model with normal error term

Captures low outliers without pulling median prediction down

Simple regression performs better predictively than regression models with other perceived software cost drivers such as number of instruments, destination, or redundancy (short version: avoids overfitting)

10 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (1/3)

Inheritance (as-is or modified code from a previous mission)

Theoretically, this is a number between 0% and 100%.
In practice, Project Software Systems Engineers (PSSEs) have only rough estimates. We categorize them into five bins:

“Very Low to None” “Low” “Medium” “High” “Very High”

Mission Size (total mission cost, including operations)

Theoretically, this is a precise positive number.
In practice, we have only rough estimates of what will be the total cost
However, we have a very good idea of the cost target or mission class. The categories are:

“Small” “Medium” “Large” “Very Large”

11 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (2/3)

Mission Type

“Orbiter/flyby” “Observatory” “Lander” “Rover”

Redundancy

“Single String” (no backup computer on board) “Dual String – Cold” (backup on board but nominally off)

“Dual String – Warm” (backup maintaining continuous operations)

Destination

“Earth” “Inner Planetary” “Asteroid / Comet” “Outer Planetary”

Number of Instruments (particle detectors, magnetometers, spectometers, and other scientific instruments)
Number of Deployables (solar arrays, booms, arms, etc.)

12 of 24

K-NEAREST-NEIGHBORS AND CLUSTERING ALGORITHMS – INPUT VARIABLES (3/3)

Numerical Variables

Number of Instruments

Number of Deployables

Nominal and Categorical Variables

Inheritance

Mission Size

Mission Type

Redundancy

Destination

How do you calculate the “distance” between missions with non-numerical data?

Is the “distance” between 2 instruments and 3 instruments equal to that of between 3 instruments and 4?

13 of 24

HOW DO YOU NUMERICIZE CATEGORICAL DATA?

kNN and Clustering algorithms need numbers, so we need to quantify the non-numerical data.

We let the data tell us what really is the best way to quantify the data.

We rely on a Nonlinear Principal Components Analysis (NLPCA) algorithm to teach us the optimal weights.

14 of 24

NONLINEAR PRINCIPAL COMPONENTS ANALYSIS –�AUTO-ASSOCIATIVE NEURAL NETWORKS (ANN)

Auto-associative neural network

ANN parameters are optimized such that the difference between the output layer and the input layer is minimized.

Goal: the low-dimensional bottleneck layer must adequately retain the information contained in the input layer.

Result: A non-numeric input layer can be projected onto a numeric, low-dimensional space.

15 of 24

KNN ALGORITHM OVERVIEW

Your Project

16 of 24

KNN MODEL EXAMPLE OUTPUT

Effort (work-months)

Cumulative Effort Distribution

Probability of being one of the three nearest neighbors

Model Input:

Medium Inheritance
Small Mission Size
Earth orbiter
Single-string
Two instruments
Zero deployables

Uncertainty in the NLPCA leads to uncertainty in the kNN result.

17 of 24

CLUSTERING ALGORITHM OVERVIEW

Probabilistic Linkage Matrices

Calculated using the k-Means algorithm in NLPCA space

(Cassini, Galileo, and Rovers and Landers are removed).

18 of 24

Effort Model Clusters
1. Very Large, Old, Outer Planetary	2. Rovers	3. Landers	4. Large, Complex,�Inner-Outer Planetary	5. Large, Complex, Earth-Inner Planetary	6. Smaller, Higher�Inheritance	7. Large, Earth�Observatories and�Constellations
Cassini	MER	Insight	Dawn	Deep Impact	DS1	GRO
Galileo	MPF	Phoenix	GRAIL	Genesis	GLORY	HST
	MSL		JUNO	GPM Core	NuStar	MMS
			Kepler	LRO	OCO-1	SDO
			LADEE	Mars Observer	WISE	Spitzer
			MAVEN	Mars Odyssey
			Messenger	OSIRIS-REx
			MRO	SMAP
			New Horizons	Stardust
			Parker Solar Probe	STEREO
				TIMED
				Van Allen Probe

SLOC Model Clusters
1. Very Large, Old, Outer Planetary	2. Rovers	3. Landers	4. Large, Complex, Inner-Outer Planetary	5. Large, Moderately Complex, Dual String (Cold)	6. Smaller or Simple, Earth – Asteroid/ Comet	7. Small-Medium, Single-String Inner-Planetary or Dual String (Cold)�Asteroid/Comet	8. Large, Earth�Observatories and�Constellations
Cassini	MER	Insight	JUNO	Deep Impact	DS1	Contour	GLAST
Galileo	MPF	Phoenix	Mars Observer	Genesis	EO1	Dawn	GRO
	MSL		MAVEN	GOES-R	GLORY	GRAIL	HST
			Messenger	LDCM	GPM Core	LADEE	MMS
			MRO	Mars Odyssey	IRIS	LCROSS	SDO
			New Horizons	NPP	NuStar	LRO	Spitzer
			Parker Solar Probe	OSIRIS-REx	OCO-1		STEREO
				Stardust	SMAP
				Van Allen Probe	TIMED
					WISE

19 of 24

CLUSTERING ALGORITHM�OVERVIEW

Once we have our missions in a low-dimensional numeric space, we can calculate the distance from each mission to the “center” of any cluster
Once in a cluster with k missions, use the kNN weighted average formula for the estimate.

Cluster 2 centroid

Your Project

Cluster 4 centroid

Cluster 3 centroid

Cluster 1 centroid

Cluster 5 centroid

20 of 24

CLUSTERING MODEL EXAMPLE OUTPUT

Model Input:

Medium Inheritance
Small Mission Size
Earth orbiter
Single-string
Two instruments
Zero deployables

Uncertainty in the NLPCA leads to uncertainty in the cluster result.

Cluster Number

Probability of falling into the cluster

Effort (work-months)

Cumulative Effort Distribution

Cluster 6 (Smaller, Higher�Inheritance)
DS1
GLORY
NuStar
OCO-1
WISE

Uncertainty in the Effort distribution is caused by uncertainty in the NLPCA as well as uncertainty in the cluster.

21 of 24

THANKS!

Please contact

Sam Fleischer (me; Samuel.R.Fleischer@jpl.nasa.gov),
Patrick Bjornstad (Patrick.T.Bjornstad@jpl.nasa.gov), and
Jairus Hihn (Jairus.M.Hihn@jpl.nasa.gov)
James Johnson (James.K.Johnson@nasa.gov)

We love to chat about collecting and cleaning data, statistics and machine learning, and software costing.

Thank you! Any questions?

©2023. All rights reserved. Government sponsorship acknowledged. NASA HQ OCFO Strategic Investments Division provides the funding and management for the development of the ASCoT model and the ONSET framework. The cost information contained in this document is of a budgetary and planning nature and is intended for informational purposes only. It does not constitute a commitment on the part of JPL and/or Caltech.

22 of 24

BACKUP

23 of 24

BAYESIAN SIMPLE LINEAR REGRESSION USING THE R PACKAGE�BRMS (BAYESIAN REGRESSION MODELS USING STAN) (1/2)

slope <- 1.9

intercept <- 0.4

sigma <- 1.3

N <- 20

xs <- runif(N, min=-3, max=3)

signal <- slope*xs + intercept

noise <- rnorm(N, mean=0, sd=sigma)

ys <- signal + noise

plot(xs, ys)

Set the parameters of the model.

Simulate the process.

Plot the data.

Note that even with known parameters, there is noise in the data. This noise is due to the inherent uncertainty in the process. This is called aleatoric uncertainty.

24 of 24

BAYESIAN SIMPLE LINEAR REGRESSION USING THE R PACKAGE�BRMS (BAYESIAN REGRESSION MODELS USING STAN) (2/2)

library(brms)

d <- data.frame(x=xs, y=ys)

model <- brm(y~x, data=d)

plot(model)

plot(conditional_effects(

model, method=‘predict’),

points=TRUE)

post <- as_draws_df(model)

head(post)

b_Intercept b_x sigma lprior lp__

1 -0.1447024 1.711225 1.3742691 -3.865604 -35.77348

2 0.8070629 1.549148 1.1941961 -3.864858 -35.20632

3 0.6598251 1.584357 1.1770281 -3.852215 -34.26019

4 0.1113087 1.498301 1.2273181 -3.841217 -35.30354

5 0.3810465 1.884662 0.8651947 -3.803884 -35.22898

6 0.5099172 1.653497 1.4078682 -3.876983 -34.48557

Load the BRMS library.

Define and fit the model.

See how the model looks over the data.

See the fitted parameters

See how the model looks over the data.

Sample from the posterior.

There is uncertainty in the fitted parameters. This is called epistemic uncertainty and represents a lack of knowledge.