1 of 52

Foundations of Astronomical Data Science

Instructor Onboarding

Jan 2022

Azalee Bostroem

2 of 52

Motivation

Working with astronomical data is challenging.

It requires computational tools and good practices.

3 of 52

Who is this for?

  • Astronomers
  • We assume:
    • At least undergrad astronomy
    • Basic Python
    • Familiarity with the command line

4 of 52

Tools

  • Python
  • Astropy/Astroquery
  • SQL (ADQL)
  • Pandas
  • Matplotlib

5 of 52

Practices

  • Developing and testing SQL and Python
  • Working with remote databases and local storage
  • Validation of data and analysis pipelines
  • Visualization of results

6 of 52

Important files

  • Function file: facilitates import and exports
  • Isochrone file: episode 7
  • Custom Style file
  • Setup files
  • Data files (last resort)

student_download

episode_functions.py

gd1_isochrone.hdf5

az-paper-twocol.mplstyle

environment.yml

test_setup.ipynb

backup-data

gd1_data.csv gd1_data.hdf gd1_results.fits

7 of 52

Provide Scientific Background

Dataset: Stellar Streams: Gaia DR2 + Pan-STARRS

    • Off the Beaten Path: Gaia Reveals GD-1 Stars Outside of Main Stream (Price-Whelan & Bonaca)

ESA

8 of 52

Workshop Overview: Figure 1

9 of 52

Why this dataset?

  • Gaia and Pan-STARRS
  • Too big to be done locally
  • Compelling science and use case
    • Extend GD-1
    • Detailed substructure
    • Progenitor
    • Data release → arXiv < 1 week

10 of 52

Connect Back to the Big Picture Frequently

  • Episode 1: Select a subset of data to prototype on
    • General skill: Querying remote databases with SQL/ADQL
  • Episode 2: Modify results to most useful format and save
    • General skill: Filtering data by coordinate with Astropy, coordinate transformation, writing FITS files
  • Episode 3: Inspect, merge, and save results
    • General Skill: Astropy Tables, Pandas DF, functions, matplotlib
  • Episode 4: Figure out additional filtering to decrease number of rows selected with downloaded data
    • General Skill: Refining a filter, creating a mask, writing HDF5 files

11 of 52

Connect Back to the Big Picture Frequently

  • Episode 5: Add new filter to SQL query, search on all of the data
    • General skill: Expanding your prototype, writing multi-extinsion HDF5 files
  • Episode 6: JOIN to another Table to get even more information
    • Genearl skill: Combining information from more than one table with SQL
  • Episode 7: Figure out an even more restrictive filter with new data and apply
    • General Skill: Selecting and filtering data with Pandas
  • Episode 8: Communicate your results
    • General Skill: Creating a publication quality figure with Matplotlib

12 of 52

Advice for Every Episode

  • Connect code to Figure 1 often
  • Connect skills to broader context (how can students use this tool broadly/ what is the generalization?)
  • Talk through reason for doing something

13 of 52

Day 1

14 of 52

Episode 1: Big Picture

  • Introduce scientific context
  • Databases: the future of astronomy
  • Astroquery: many catalogs (e.g. Vizier, MAST, NED, USNO)
  • Universal SQL command structure
  • Pipeline analysis process: quick, reproducible research

15 of 52

Episode 1: Summary

Notebook intro

Select and download data from the Gaia Database:

  • Make a connection to the Gaia server
  • Explore information about the database and tables it contains
  • Write a query and send it to the server, and finally
  • Download and view the response from the server

SELECT TOP 10

source_id, ra, dec, pmra, pmdec, parallax, parallax_error, radial_velocity

FROM gaiadr2.gaia_source

WHERE parallax < 1

AND bp_rp >-0.75 AND bp_rp < 2

Far away

Exclude red M dwarfs

16 of 52

Episode 1: Best Practices

  • Use queries to select the data you need
  • Read the metadata and the documentation
  • Develop queries incrementally
  • Use TOP and COUNT to test before you run
  • Capitalize SQL keywords to make code more readable
  • Checking type of output

17 of 52

Episode 1: Beyond this workshop/use case: SQL

SELECT <columns>

FROM <table>

WHERE <condition>

18 of 52

Episode 1: Pitfalls

  • load_table vs load_tables
  • launch_job vs launch_job_async
  • Debugging queries
  • We don’t use the parallax column until a later episode

19 of 52

Episode 2: Summary

  • Define a rectangle region to select in GD-1 frame
  • Transform vertices to ICRS frame
  • Query Gaia to select stars in polygon

20 of 52

Episode 2 Skills

  • Use Quantity objects with units (astropy.units, astropy.coordinates)
  • Transform coordinates between ICRS and GD-1 frames
  • Use ADQL commands POLYGON, CONTAINS, and POINT
  • Use Astropy Tables and FITS files

21 of 52

Episode 2: Best Practices

  • Use Quantity objects with units to catch errors
  • Use the format function to compose complex queries
  • Develop queries incrementally
  • Once you have a query working, save the data

22 of 52

Episode 2: Pitfalls

  • Confusing CONTAINS syntax:

cone_query = """SELECT

TOP 10

source_id

FROM gaiadr2.gaia_source

WHERE 1=CONTAINS(

POINT(ra, dec),

CIRCLE(88.8, 7.4, 0.08333333))

"""

  • Transformations: what and why
  • Emphasize: starting on subset of data to prototype query

23 of 52

Episode 3: Summary

  • Transform coordinates back to GD-1 frame
  • Put results in a Pandas DataFrame

ICRS Reference Frame

GD-1 Reference Frame

24 of 52

Episode 3: Skills

  • Matplotlib basics
  • Intro to Astropy Table and Pandas DataFrames
  • Converting between Tables and DataFrames
  • When to write your own functions
  • Save to HDF5 files

25 of 52

Episode 3: Best Practices

  • Adjust parameters of scatter plot to avoid overplotting.
  • Astropy Table and a Pandas DataFrame are similar; choose best one based on needs

26 of 52

Episode 3: Pitfalls

  • Distance and radial velocity are set to constants - necessary but not important to understand
  • Annoying specific detail: reflex correction

27 of 52

Episode 4: Summary

  • Select stars very close to GD-1 stream
  • Plot proper motion of selection to identify stars likely to be in GD-1
  • Define region of GD-1 proper motion
  • Select and plot the most likely candidates

Space

Space

Space

Motion

GD-1

28 of 52

Episode 4: Skills

Episode 4: Best Practices

  • Inspecting/Spot checking data in Pandas
  • Filtering in DataFrames
  • Save multiple datasets to HDF5 files
  • Inspect basic statistics on results

29 of 52

Episode 4: Pitfalls

  • Easy to get lost in different DataFrames
  • Putting DataFrames all into the same file - don’t overwrite

30 of 52

Day 2

  • New Notebook
  • Make sure to load functions and data from “Starting from this episode” collapsible text

31 of 52

Episode 5: Summary

  • Identify polygon surrounding GD-1 proper motion in ICRS coordinates
  • Assemble query that selects only the stars likely to be in GD-1 based on proper motion, location, color, and distance
  • Expand physical region these stars are selected from
  • Convert back to GD-1 frame and plot

Motion

GD-1 Frame

Motion

ICRS Frame

32 of 52

Episode 5: Skills

  • Review previous skills (units, coordinate transformations)
  • Build more increasingly complex queries
  • Use Pandas Series

33 of 52

Episode 5: Best Practices

  • Reduce dataset size on server before download
  • Prototype first on small dataset then expand to full dataset
  • Save incremental results

34 of 52

Episode 5: Pitfalls

  • Can feel like a repeat: expanding prototype, moving computation to the data
  • Lesson moves quickly - not many new concepts
  • Phi1_min, phi1_max, etc are different in this episode

35 of 52

Episode 6: Summary

  • Explore PanSTARRs Tables
  • Practice joining tables with a simple query
  • Join PanSTARRs Tables to Gaia to get g and i photometry of our candidates
  • Explore returned data
  • Write to HDF5 file

36 of 52

Episode 6: Summary

37 of 52

Episode 6: Skills

  • Understand and build SQL queries that JOIN multiple tables
  • Use Pandas to evaluate basic quality of data
  • More comfortable with HDF5 files

SELECT col1, … coln

FROM table1

JOIN table2

ON table1.key1 = table2.key2

WHERE

conditions

38 of 52

Episode 6: Best Practices

  • Develop and Test query incrementally
  • Run basic statistics on output to evaluate quality

39 of 52

Episode 6: Pitfalls

  • Uses a python head function that we don’t explain
  • Most complex episode
  • Final exercises can feel overwhelming
  • Neighbours is spelled with a “u”

40 of 52

Episode 7: Summary

  • Identify around GD-1 isochrone
  • Create a POLYGON that encompasses these stars
  • Select stars in this POLYGON
  • Make GD-1 photometry and proper motion selection plot

41 of 52

Episode 7: Skills

Matplotlib tricks: using contains, making a polygon

More practice using Pandas and HDF5

42 of 52

Episode 7: Best Practices

  • Record every element of the data analysis pipeline �that would be needed to replicate the results.

43 of 52

Episode 7: Pitfalls

  • Need local Isochrone file
  • Making of isochrone in _extras
  • Our selection around isochrone is a simplified version of paper
  • Story line can get lost in details

44 of 52

Episode 8: Summary

  • Think about what makes a good figure
  • Customize Figures with matplotlib
  • Make Figure 1!

45 of 52

46 of 52

Episode 8: Skills

  • Fill in Matplotlib gaps
    • Annotations, Patches
    • Parameter customization
    • Multiple panels
  • Streamline figure creation process
  • Power of plotting functions

47 of 52

Episode 8: Best Practices

  • Think about the story you want to tell
  • Annotations: minimize work for reader
  • Override defaults to improve figure
  • Create your own style sheet if you are making the same customizations over an over

48 of 52

Episode 8: Pitfalls

  • Lots of links
  • Need local style file

49 of 52

Think about Workshop Mechanics

  • Communicating to learners
    • Links
    • Exercises
  • Notebooks:
    • Same vs new
  • Instructor/helper to instructor/helper communications

50 of 52

This is a lot of cumulative material

  • Share link to lessons with learners before teaching
  • Reassure learners that they can get caught up
  • Feel free to copy and paste
  • Consider a slack channel with a helper typing commands as you go

51 of 52

If you’re running short on time…

Very little of this material can be skipped

Recommendations for speeding things up:

  • Treat episode 8 as show and tell or cut sections
  • Streamline or cut CSV section of episode 6
  • Cut checking the size of the HDF5 file in episode 4
  • Cut discussion of context managers in episode 4

52 of 52

More advice? Check out the Instructor Notes

https://datacarpentry.org/astronomy-python/guide/index.html