2 of 23

Your Digital Professional Life in 3 parts:

Organization, retrieval, and backup of information

Can you find something you did when you were an undergraduate? You will be able to if you plan ahead.

Software-based projects

Scientists have techniques (especially for large projects) that make all the difference.

Scientific Computing techniques

Techniques for large numbers, images, matrix calculations

- B. Grossan. Use requires attribution of all sources -

3 of 23

Part I: Your Information

Are computers good at storing information?

They can store lots, you can do ascii searches, you can make complex associative organization�
BUT computers also HIDE information: �with 200 GRB papers in a file, how can you find the right one you? �All include terms “GRB”, “light curve”, and ”redshift” !

How to find a figure? A data file? How do you search for that?

- B. Grossan. Use requires attribution of all sources -

4 of 23

Plan to find everything useful 5 years from now��

Kevin Hurley ASCII NOTES: Detailed ascii text notes on every paper you read:

ascii text won’t be replaced like e.g. “readynote” format
no wasting time looking up a reference, everything is in that one file

Bruce Lab Notebook: Record EVERYTHING in “Lab Notebook”

ideas, conversations, talks, names to remember, papers, projects..
use “meta-URLs”, keyword spam�e.g.: today 251104 Bruce talked about LAST telescope array projects GRBs Transients Weizmann institute. Ernazar had a colleague Ofer Lahav, working on GRB080319b structured jets. Bruce said use Schlegel dust server for dust reddening N_H column ipac.Caltech.edu/dust (dust, reddening, E(B-V), extinction) . Zhanat said use flat [data server://flats/251015/iflat, see analysis [eclwiki://opm/blog] something about standard star deviations …

- B. Grossan. Use requires attribution of all sources -

5 of 23

...find everything useful 5 years from now (cont)��

Technique: “meta-URLs”

URL: [format/instruction://][location] like telnet://128.32.245.16
can be generalized to almost anything: optics_lab://lefthand_cabinet/lense_shelf/gfilter�SCP format: my_computer_name!~bruce/grb/obs/zhanatflats�

Technique: KEYWORD SPAM: �single words never enough for text search on your computer:

“reddening” not enough: write: “reddening curve, extinction, Galactic, extragalactic, H2106-099, color evolution, red to blue, Milky Way, MW, LMC, SMC, schelgel, server, dust, PAH, N_H, hydrogen column” … if you really want to be able to find your talk notes, or e.g. a single plot you kept.

README files –Put in EVERY software directory

In five years, will you really remember what “flat.py” did?

- B. Grossan. Use requires attribution of all sources -

6 of 23

Part II. Software Projects

Do you spend too much time on software?

Spend less and less time by re-using code

How do the smartest scientists (not others who write books and web pages) make their software projects?

Control files, ”name-per-run” allow you to really track what you do

How do you know your result is correct?

If not, you are IN TROUBLE. Build-in ways to test, check!�

- B. Grossan. Use requires attribution of all sources -

7 of 23

Hint 1: Write programs for re-use

Typically, we “just write a little program…”

…then we add to, and add to it, it until it’s a mess.
Instead, think about how you might re-use everything you write

EXAMPLE: photometry: the only things that change are filenames, GRB/standards locations, and FWHM.

Did you ”pull out” (or abstract; see below) those values so you could easily re-use your code?

- B. Grossan. Use requires attribution of all sources -

8 of 23

Hint 2: Go from simple to complex

Think in terms of columns of data in, columns of data out. Build from there.

Do this for 100 steps to make a complicated, big program. Each step is simple, but -repeated can be powerful, -can be checked at EACH step:��

Example: Photometry

This turns out well if you test each step and keep them simple.�

; DATA IN COLUMNS ____ _____ ______

____ _____ ______

Program

; OUTPUT IN COLUMNS ____ _____ ______

____ _____ ______

List of images

____ _____

co-adder

; OUTPUT list of images, star positions

____ _____ ______

sex-tractor

; OUTPUT list of images, photometry

____ _____ ______

- B. Grossan. Use requires attribution of all sources -

9 of 23

Technique: Parameter/”control” files

Use parameter files to control how programs are run:

prog_run14.pars:�; this run uses co-adds by 3 from the data with the old flats for GRB251013C�fileinlists: imglistcoaddby3.coldat imrlistcoaddby3.coldat imilistcoaddby3.coldat�star1x 165 292�star1y 23 342 �…. “keyword” value1 value2 ….� (above I used “;” as comment character)
Simple format: Keyword, then values – easy to read, parse, use.

Leaves a RECORD of what you did each run

records all inputs and if done right, outputs
effective when used with next technique….

- B. Grossan. Use requires attribution of all sources -

10 of 23

Technique: Name-per-run

You run the same program and data many times.

before you fix an error, with different images, different co-adds...
If output file names are always the same, this is a potential DISASTER MESS.

Names allow you to track EVERY output:

short basename, e.g. “phot”
filetype, + “out” for any outputs, e.g. photinlist.col, photout.csv, photoutplotstandsvcolor.png
name-per-run track by parameters good: photoutcoadd3ims.coldat
name-per-run rack by run number for multiple parameters: �photoutrun14b.coldat tells you to look in photrun14b.params for input files, other settings.
Later, e.g.,when writing paper, you know exactly what made photout14b.coldat, from a single glance at directory

WHY BOTHER? �

Two years later, you can still figure out what you did:�

>ls phot/GRB251013C

photrun5.params (starts: ; this run no coadds)

photinlist5.coldat

photoutlist5.coldat

photrun8.params (;coadds ≤ 180s; standard stars 13-17mag; skip cloudy time)

photinlist8.coldat

photoutlist8.coldat

….this way you can know PRECISELY what you did those 2 years ago!

- B. Grossan. Use requires attribution of all sources -

11 of 23

Software is unpredictable: Deep Reasons

Cellular Automata (Von Neuman)

SIMPLE system, numbers in grid, simple rules, e.g. each step cell is replaced with e.g. binary sum of neighbor values, runs in steps

Results CANNOT be predicted
Patterns can emerge on few, hundreds, billion cycles, unpredictable

a general property of �any system with “memory”

CONCLUSION:�TEST SOFTWARE!!! �results are NOT predictable!

- B. Grossan. Use requires attribution of all sources -

12 of 23

Software is unpredictable: TEST THOROUGHLY

“Bugs” are inherent to software. Assuming bug-free software is always wrong.

1. Simplify, simplify, simplify

simplify formulae before you code (avoids mistakes)
design, before you start, for simplicity�

2. TEST OR GET THE WRONG ANSWER.

Always use data with KNOWN ANSWER to validate code before real run

3. Write in small sections; TEST INDEPENDENTLY

Best Practice: print out variables, then STOP @ end of each section, so you can check progress each step
Print intermediate results to log file, e.g. “loop jj=__ gives x= __before calibration applied; after calibration, x= __. “

- B. Grossan. Use requires attribution of all sources -

13 of 23

Hint 4: Document as you go

Use a template header to start every program:

; myprog,filenames,stdev_for_std_rejection,number_of_coadds�; purpose: coadd with standard star rejection for clouds�; inputs: …..�; algorithm: reject image if C_lambda > 2.5 sig from avg.�; special file formats: must have ‘t-t0’ keyword in image FITS headers�; data sources: …..

Don’t forget README files

REAME.txt:�e.g. “This is the directory for GRB251109, NOT an alert, but from GCN4211The images notimg*, notimr*,notimz* are from the NOT telescope, not NUTTela….”

-helps you not forget what you did, why

- B. Grossan. Use requires attribution of all sources -

14 of 23

Technique: abstract paths (for portability)

Simple version:

myprogram: �datafile=‘/home/users/bruce/myprogram/data/datafile.dat’
… will never work on any other computer�

“abstract” – put in separate file, read at run-time

runtime.params:�; edit this file to put in the right file names:�basedir /home/users/yourname/yourdir ; fill in your directory path here�datafile /home/users/yourname/yourdir/data/datafile.dat ; your data file��myprogram: �readparams(./runtime.params)�read,1,datafile�….�[print results to file in basedir/results.txt]

-easily made runs on any computer, setup of data files, directories, etc.

- B. Grossan. Use requires attribution of all sources -

15 of 23

Part III: Scientific Computing Techniques

Burned by overflow

IDL> ii=12000 & help,ii

II INT = 12000

IDL> print,'what is 4*12k?’ & print,4*ii

what is 4*12k?

-17536 integer overflow.

IDL> ff=12000.0 & print,4.*ff, 2.^65.*ff

48000.0 4.42722e+23 Inf <-float overflow

Example from Image Reduction

co-add, 25 images, you will get overflow because images are integers. In python, convert to FLOAT:

import numpy as np

rows = 1025

cols = 1024

float_zeros_array = np.zeros((rows, cols), dtype=np.float64) # or np.float32, etc.

(In IDL, just newimage=float(firstimagearray)

For really big numbers, type double is usually 10^302.

- B. Grossan. Use requires attribution of all sources -

16 of 23

Technique: calculate mantissa, put exp in name

Physics has huge dynamic range of constants:�1. Compute mantissa 2. add exponents

h = 6.63e-27 in cgs; 6.63 is is mantissa, -27 is exponent.
k_b=1.38 e-16; k_b / h ~ 10⁴⁴ <-FLOAT OVERFLOW

2. Exponents in names:

h*nu / (K*T)=Boltzman factor
write exponents in names; “_” = minus kb_16=1.38 & hp_27=6.63 & nu14=4.48 & tk3=5.0�boltzfmantissa=hp_27*nu14/(kb_16*tk3)�boltzfexp=-27.+14. / (-16.+3.) (note:decimal points force float)

ANSWER IS boltzfmantissa X10^boltzfexp

Log (Boltz factor) easy to plot, no risk of over/underflow.

- B. Grossan. Use requires attribution of all sources -

17 of 23

Data Vectors, Vectorization

Computers know NOTHING about images or arrays

images are stored internally as just a 1-d sequence of numbers.
In many languages, calculations are required to get 2-d indices: �IDL> wgt_ten=where(image gt 10.)�IDL> x=wgt_ten mod ncolumns & y=wgtten / ncolumns�

Vector operations much faster than loops

most languages have built-in optimized functions for data vectors �Soldani+2016 says 20X faster
Example: newim=img1/flat is much faster than �for xx=0,1023 do for yy=0,1023 newim[xx,yy]=img1[xx,yy]/flat{xx,yy]

Tutorials on Tools and Methods using High Performance Computing resources for Big Data, Sodani+2016

- B. Grossan. Use requires attribution of all sources -

18 of 23

Matrix Operations

Computers no NOTHING about images or arrays
For standard matrix operations like coordinate transformations, need �SPECIAL OPERATORS for matrix math

Example: X’ = R X where R is a rotation matrix

IDL> xprime=r # X <- Weird “#” operator�

python:

import numpy as np

xprime= R@X or <-Special “@” operator

xprime=np.dot(R,X) <-Special Function np.dot()

�

- B. Grossan. Use requires attribution of all sources -

19 of 23

Computer Image Manipulation

- B. Grossan. Use requires attribution of all sources -

20 of 23

Image Convolution

Smoothing

1-d: for every pixel i smdat[i]=average(data[i-(N/2-1)]…data[i+(N/2-1)] for smoothing parameter N (definition of edge values varies)
2-d: same idea over both dimensions x and y

Convolution, written f*g

�Used in sextractor: Increase signal from PSFs; decrease signal from other:�math notation: detectionim=im*PSF
IDL> detectionim=convolve(im,psf)
python: import cv2�detectionim=cv2.filter2D(src=image,kernel=psf)
Used to smooth artifacts: �left image, im has missing lines�right= im * Gaussian�where Gaussian[0:4,0:4] has sigma=1

taken from IDL Convol manual page

wikipedia

- B. Grossan. Use requires attribution of all sources -

21 of 23

Simulating Stars for Photometry 1

Sky Background

if images have simple background, simulate with background=random noise where sigma(random noise) =sigma(real background)
IDL> noiseimage=sigma_real*randomn(seed,1024,1024)
python: importy numpy as np�noise = np.random.normal(loc=mean, scale=std_dev, size=original_array.shape)�

Simulated stars – use a real PSF

airy disk by Ben Waterman

- B. Grossan. Use requires attribution of all sources -

22 of 23

Simulating Stars for Photometry 2

Simulated stars – use a real PSF

“Gaussian” PSF least real approximation
“Airy Function” PERFECT THEORETICAL response for round aperture
NUTTelA PSF not round, rings washed out

airy disk by Ben Waterman

Airy Disk

NUTTelA PSF

- B. Grossan. Use requires attribution of all sources -

23 of 23

Simulating Stars for Photometry-3

Simulated stars – use a real PSF

DO NOT use Gaussian or Airy function; won’t test real images
Use bright non-saturated star on real image

IDL:

; copy bright star at xbrt,ybrt (; denotes comment)

star = realimage[xbrt-2*FWHM:xbt+28FWHM, ybrt-2*FWHM:ybrt+2*FWHM]

star = star/total(star) ; normalize

; place star on simulated image at xstar, ystar

simulated_im[xstar-2*FHWM:xstar+2*FHWM, ystar 2*FHWM:ystar+2*FHWM]=star*flux

…. ; place all the test stars you want, then add background noise:

noiseim=randomn(seed,xsize,ysize) & simulated_im=simulated_im+noiseim

- B. Grossan. Use requires attribution of all sources -