1 of 23

Basics of Scientific Computing

Bruce Grossan

- B. Grossan. Use requires attribution of all sources -

2 of 23

Your Digital Professional Life in 3 parts:

  • Organization, retrieval, and backup of information
    • Can you find something you did when you were an undergraduate? You will be able to if you plan ahead.

  • Software-based projects
    • Scientists have techniques (especially for large projects) that make all the difference.

  • Scientific Computing techniques
    • Techniques for large numbers, images, matrix calculations

2

- B. Grossan. Use requires attribution of all sources -

3 of 23

Part I: Your Information

  • Are computers good at storing information?

    • They can store lots, you can do ascii searches, you can make complex associative organization�
    • BUT computers also HIDE information: �with 200 GRB papers in a file, how can you find the right one you? �All include terms “GRB”, “light curve”, and ”redshift” !

    • How to find a figure? A data file? How do you search for that?

3

- B. Grossan. Use requires attribution of all sources -

4 of 23

Plan to find everything useful 5 years from now��

  • Kevin Hurley ASCII NOTES: Detailed ascii text notes on every paper you read:
    • ascii text won’t be replaced like e.g. “readynote” format
    • no wasting time looking up a reference, everything is in that one file

  • Bruce Lab Notebook: Record EVERYTHING in “Lab Notebook”
      • ideas, conversations, talks, names to remember, papers, projects..
      • use “meta-URLs”, keyword spam�e.g.: today 251104 Bruce talked about LAST telescope array projects GRBs Transients Weizmann institute. Ernazar had a colleague Ofer Lahav, working on GRB080319b structured jets. Bruce said use Schlegel dust server for dust reddening N_H column ipac.Caltech.edu/dust (dust, reddening, E(B-V), extinction) . Zhanat said use flat [data server://flats/251015/iflat, see analysis [eclwiki://opm/blog] something about standard star deviations …

4

- B. Grossan. Use requires attribution of all sources -

5 of 23

...find everything useful 5 years from now (cont)��

  • Technique: “meta-URLs”
    • URL: [format/instruction://][location] like telnet://128.32.245.16
    • can be generalized to almost anything: optics_lab://lefthand_cabinet/lense_shelf/gfilter�SCP format: my_computer_name!~bruce/grb/obs/zhanatflats�
  • Technique: KEYWORD SPAM:single words never enough for text search on your computer:
    • “reddening” not enough: write: “reddening curve, extinction, Galactic, extragalactic, H2106-099, color evolution, red to blue, Milky Way, MW, LMC, SMC, schelgel, server, dust, PAH, N_H, hydrogen column” … if you really want to be able to find your talk notes, or e.g. a single plot you kept.

  • README files –Put in EVERY software directory
    • In five years, will you really remember what “flat.py” did?

5

- B. Grossan. Use requires attribution of all sources -

6 of 23

Part II. Software Projects

  • Do you spend too much time on software?
    • Spend less and less time by re-using code

  • How do the smartest scientists (not others who write books and web pages) make their software projects?
    • Control files, ”name-per-run” allow you to really track what you do

  • How do you know your result is correct?
    • If not, you are IN TROUBLE. Build-in ways to test, check!�

- B. Grossan. Use requires attribution of all sources -

7 of 23

Hint 1: Write programs for re-use

  • Typically, we “just write a little program…”
    • …then we add to, and add to it, it until it’s a mess.
    • Instead, think about how you might re-use everything you write
  • EXAMPLE: photometry: the only things that change are filenames, GRB/standards locations, and FWHM.
    • Did you ”pull out” (or abstract; see below) those values so you could easily re-use your code?

- B. Grossan. Use requires attribution of all sources -

8 of 23

Hint 2: Go from simple to complex

  • Think in terms of columns of data in, columns of data out. Build from there.
    • Do this for 100 steps to make a complicated, big program. Each step is simple, but -repeated can be powerful, -can be checked at EACH step:�����
  • Example: Photometry

  • This turns out well if you test each step and keep them simple.�

8

; DATA IN COLUMNS ____ _____ ______

____ _____ ______

____ _____ ______

____ _____ ______

____ _____ ______

Program

; OUTPUT IN COLUMNS ____ _____ ______

____ _____ ______

____ _____ ______

____ _____ ______

____ _____ ______

List of images

____ _____

____ _____

____ _____

____ _____

co-adder

; OUTPUT list of images, star positions

____ _____ ______

____ _____ ______

____ _____ ______

sex-tractor

; OUTPUT list of images, photometry

____ _____ ______

____ _____ ______

____ _____ ______

- B. Grossan. Use requires attribution of all sources -

9 of 23

Technique: Parameter/”control” files

  • Use parameter files to control how programs are run:
    • prog_run14.pars:�; this run uses co-adds by 3 from the data with the old flats for GRB251013C�fileinlists: imglistcoaddby3.coldat imrlistcoaddby3.coldat imilistcoaddby3.coldat�star1x 165 292�star1y 23 342 �…. “keyword” value1 value2 ….� (above I used “;” as comment character)
    • Simple format: Keyword, then values – easy to read, parse, use.

  • Leaves a RECORD of what you did each run
    • records all inputs and if done right, outputs
    • effective when used with next technique….

- B. Grossan. Use requires attribution of all sources -

10 of 23

Technique: Name-per-run

  • You run the same program and data many times.
    • before you fix an error, with different images, different co-adds...
    • If output file names are always the same, this is a potential DISASTER MESS.

  • Names allow you to track EVERY output:
    • short basename, e.g. “phot”
    • filetype, + “out” for any outputs, e.g. photinlist.col, photout.csv, photoutplotstandsvcolor.png
    • name-per-run track by parameters good: photoutcoadd3ims.coldat
    • name-per-run rack by run number for multiple parameters: �photoutrun14b.coldat tells you to look in photrun14b.params for input files, other settings.
    • Later, e.g.,when writing paper, you know exactly what made photout14b.coldat, from a single glance at directory

WHY BOTHER? �

Two years later, you can still figure out what you did:�

>ls phot/GRB251013C

photrun5.params (starts: ; this run no coadds)

photinlist5.coldat

photoutlist5.coldat

photrun8.params (;coadds ≤ 180s; standard stars 13-17mag; skip cloudy time)

photinlist8.coldat

photoutlist8.coldat

….this way you can know PRECISELY what you did those 2 years ago!

- B. Grossan. Use requires attribution of all sources -

11 of 23

Software is unpredictable: Deep Reasons

  • Cellular Automata (Von Neuman)
    • SIMPLE system, numbers in grid, simple rules, e.g. each step cell is replaced with e.g. binary sum of neighbor values, runs in steps

  • Results CANNOT be predicted
  • Patterns can emerge on few, hundreds, billion cycles, unpredictable
    • a general property of �any system with “memory”

  • CONCLUSION:�TEST SOFTWARE!!! �results are NOT predictable!

11

- B. Grossan. Use requires attribution of all sources -

12 of 23

Software is unpredictable: TEST THOROUGHLY

  • “Bugs” are inherent to software. Assuming bug-free software is always wrong.

  • 1. Simplify, simplify, simplify
    • simplify formulae before you code (avoids mistakes)
    • design, before you start, for simplicity�
  • 2. TEST OR GET THE WRONG ANSWER.
    • Always use data with KNOWN ANSWER to validate code before real run

  • 3. Write in small sections; TEST INDEPENDENTLY
    • Best Practice: print out variables, then STOP @ end of each section, so you can check progress each step
    • Print intermediate results to log file, e.g. “loop jj=__ gives x= __before calibration applied; after calibration, x= __. “

12

- B. Grossan. Use requires attribution of all sources -

13 of 23

Hint 4: Document as you go

  • Use a template header to start every program:
    • ; myprog,filenames,stdev_for_std_rejection,number_of_coadds; purpose: coadd with standard star rejection for clouds�; inputs: …..�; algorithm: reject image if C_lambda > 2.5 sig from avg.�; special file formats: must have ‘t-t0’ keyword in image FITS headers�; data sources: …..
  • Don’t forget README files
    • REAME.txt:�e.g. “This is the directory for GRB251109, NOT an alert, but from GCN4211The images notimg*, notimr*,notimz* are from the NOT telescope, not NUTTela….”

-helps you not forget what you did, why

13

- B. Grossan. Use requires attribution of all sources -

14 of 23

Technique: abstract paths (for portability)

  • Simple version:
    • myprogram: �datafile=‘/home/users/bruce/myprogram/data/datafile.dat’
    • … will never work on any other computer�
  • “abstract” – put in separate file, read at run-time
    • runtime.params:�; edit this file to put in the right file names:�basedir /home/users/yourname/yourdir ; fill in your directory path here�datafile /home/users/yourname/yourdir/data/datafile.dat ; your data file��myprogram: �readparams(./runtime.params)�read,1,datafile�….�[print results to file in basedir/results.txt]

-easily made runs on any computer, setup of data files, directories, etc.

14

- B. Grossan. Use requires attribution of all sources -

15 of 23

Part III: Scientific Computing Techniques

  • Burned by overflow

15

IDL> ii=12000 & help,ii

II INT = 12000

IDL> print,'what is 4*12k?’ & print,4*ii

what is 4*12k?

-17536 integer overflow.

IDL> ff=12000.0 & print,4.*ff, 2.^65.*ff

48000.0 4.42722e+23 Inf <-float overflow

Example from Image Reduction

co-add, 25 images, you will get overflow because images are integers. In python, convert to FLOAT:

import numpy as np

rows = 1025

cols = 1024

float_zeros_array = np.zeros((rows, cols), dtype=np.float64) # or np.float32, etc.

(In IDL, just newimage=float(firstimagearray)

For really big numbers, type double is usually 10^302.

- B. Grossan. Use requires attribution of all sources -

16 of 23

Technique: calculate mantissa, put exp in name

  • Physics has huge dynamic range of constants:�1. Compute mantissa 2. add exponents
    • h = 6.63e-27 in cgs; 6.63 is is mantissa, -27 is exponent.
    • kb=1.38 e-16; kb / h ~ 1044 <-FLOAT OVERFLOW

  • 2. Exponents in names:
    • h*nu / (K*T)=Boltzman factor
    • write exponents in names; “_” = minus kb_16=1.38 & hp_27=6.63 & nu14=4.48 & tk3=5.0�boltzfmantissa=hp_27*nu14/(kb_16*tk3)�boltzfexp=-27.+14. / (-16.+3.) (note:decimal points force float)
  • ANSWER IS boltzfmantissa X10^boltzfexp
    • Log (Boltz factor) easy to plot, no risk of over/underflow.

16

- B. Grossan. Use requires attribution of all sources -

17 of 23

Data Vectors, Vectorization

  • Computers know NOTHING about images or arrays
    • images are stored internally as just a 1-d sequence of numbers.
    • In many languages, calculations are required to get 2-d indices: �IDL> wgt_ten=where(image gt 10.)�IDL> x=wgt_ten mod ncolumns & y=wgtten / ncolumns�
  • Vector operations much faster than loops
    • most languages have built-in optimized functions for data vectors �Soldani+2016 says 20X faster
    • Example: newim=img1/flat is much faster than �for xx=0,1023 do for yy=0,1023 newim[xx,yy]=img1[xx,yy]/flat{xx,yy]

17

Tutorials on Tools and Methods using High Performance Computing resources for Big Data, Sodani+2016

- B. Grossan. Use requires attribution of all sources -

18 of 23

Matrix Operations

  • Computers no NOTHING about images or arrays
  • For standard matrix operations like coordinate transformations, need �SPECIAL OPERATORS for matrix math

Example: X’ = R X where R is a rotation matrix

IDL> xprime=r # X <- Weird “#” operator

python:

import numpy as np

xprime= R@X or <-Special “@” operator

xprime=np.dot(R,X) <-Special Function np.dot()

18

- B. Grossan. Use requires attribution of all sources -

19 of 23

Computer Image Manipulation

- B. Grossan. Use requires attribution of all sources -

20 of 23

Image Convolution

  • Smoothing
    • 1-d: for every pixel i smdat[i]=average(data[i-(N/2-1)]…data[i+(N/2-1)] for smoothing parameter N (definition of edge values varies)
    • 2-d: same idea over both dimensions x and y
  • Convolution, written f*g
    • �Used in sextractor: Increase signal from PSFs; decrease signal from other:�math notation: detectionim=im*PSF
    • IDL> detectionim=convolve(im,psf)
    • python: import cv2�detectionim=cv2.filter2D(src=image,kernel=psf)
    • Used to smooth artifacts: �left image, im has missing lines�right= im * Gaussian�where Gaussian[0:4,0:4] has sigma=1

20

taken from IDL Convol manual page

wikipedia

- B. Grossan. Use requires attribution of all sources -

21 of 23

Simulating Stars for Photometry 1

  • Sky Background
    • if images have simple background, simulate with background=random noise where sigma(random noise) =sigma(real background)
    • IDL> noiseimage=sigma_real*randomn(seed,1024,1024)
    • python: importy numpy as np�noise = np.random.normal(loc=mean, scale=std_dev, size=original_array.shape)�
  • Simulated stars – use a real PSF

21

airy disk by Ben Waterman

- B. Grossan. Use requires attribution of all sources -

22 of 23

Simulating Stars for Photometry 2

  • Simulated stars – use a real PSF
    • “Gaussian” PSF least real approximation
    • “Airy Function” PERFECT THEORETICAL response for round aperture
    • NUTTelA PSF not round, rings washed out

22

airy disk by Ben Waterman

Airy Disk

NUTTelA PSF

- B. Grossan. Use requires attribution of all sources -

23 of 23

Simulating Stars for Photometry-3

  • Simulated stars – use a real PSF
    • DO NOT use Gaussian or Airy function; won’t test real images
    • Use bright non-saturated star on real image

23

IDL:

; copy bright star at xbrt,ybrt (; denotes comment)

star = realimage[xbrt-2*FWHM:xbt+28FWHM, ybrt-2*FWHM:ybrt+2*FWHM]

star = star/total(star) ; normalize

; place star on simulated image at xstar, ystar

simulated_im[xstar-2*FHWM:xstar+2*FHWM, ystar 2*FHWM:ystar+2*FHWM]=star*flux

…. ; place all the test stars you want, then add background noise:

noiseim=randomn(seed,xsize,ysize) & simulated_im=simulated_im+noiseim

- B. Grossan. Use requires attribution of all sources -