1 of 15

Data Management

Julian Borrill

Interim Data Management L2 Scientist

2 of 15

Range

  • Receive raw data from Data Acquisition
    • The raw data become DM’s when they hit site storage
  • Deliver science-quality intermediate data products to the collaboration
    • “Intermediate data products” are single-frequency maps and transient alerts
    • “Science-quality” includes all documentation and ancillary data products required to analyze the data
  • Deliver science-quality intermediate and final data products, and the software used to generate them, to the community
  • The Data Management construction project must:
    • Support the optimization and validation of the experiment design
    • Be ready to transition to operations at first telescope commissioning

2

3 of 15

Scope

  • Data registration on receipt from DAQ
  • Data movement from sites to US and between US data centers
  • Archival storage of raw data and derived data products
  • Production of daily single-frequency maps from all telescopes
  • Identification of transients in daily maps & issuing of science alerts
  • Monitoring of data quality in daily maps & issuing of operational alerts
  • Characterization of the experiment (instrument + observation) from design, laboratory & field data
  • Production of bulk single-frequency maps, including systematics mitigation
  • Characterization of bulk single-frequency maps
  • Production of mock datasets for design validation & data characterization
  • Delivery of science-grade intermediate data to the collaboration
  • Receipt of data quality/sufficiency feedback from the collaboration
  • Delivery of science-grade intermediate and final data to the community

3

4 of 15

Work Breakdown Structure

  • Intentionally distributed leadership to leverage the full range of Stage 3 expertise and to interface with the collaboration as widely as possible.

*interim

4

5 of 15

Design Drivers - Data Rates

Data rates set site bandwidth and local storage requirements.

Design: sufficient network bandwidth (1.1 Gbps) from Chile; insufficient network bandwidth (0.7 Gbps) from South Pole.

Design: 1 month backup (382 TB) in Chile; 1 year backup (5.4 PB) at South Pole.

5

TELESCOPES

DETECTORS

SAMPLING FREQUENCY

(Hz)

DATA RATE

RAW

(Samples/Sec)

COMPRESSED

(Gbps)

CHLAT

243,520

400

9.7 x 107

1.09

SPLAT

114,432

400

4.6 x 107

0.51

SAT

153,232

100

1.5 x 107

0.17

6 of 15

Design Drivers - Data Volumes

Data volumes set US data center storage requirements

Design: 1 year of raw data + 7 years of science data (17 PB) spinning at each data center

Design: 7 years raw data (49 PB) archived at each data center

6

TELESCOPES

DAILY DATA

(TB)

SPINNING DATA

(PB)

ARCHIVAL DATA

(PB)

CHLAT

11.8

4.3

30.1

SPLAT

5.5

2.0

14.2

SAT

1.9

0.7

4.7

7 of 15

Design Drivers - Daily Data Processing

Daily data processing drives the fast-access computational requirements

Design: CHLAT processing in the US; SPLAT+SAT processing at South Pole

7

TELESCOPES

CYCLES

(TFLOP)

PEAK MEMORY

(TB)

SINGLE DAILY MAP DATA

(GB)

TOTAL DAILY MAP DATA

(TB)

CHLAT

0.84

11.5

54

138

SPLAT

0.40

5.7

4.5

12

SAT

0.13

1.2

0.02

0.05

8 of 15

Design Drivers - Bulk Data Processing

Simulating and reducing the entire dataset requires:

  • Cycles: 21.9 EFLOP
  • Peak memory: 11.7 PB
  • Peak scratch: 2.1 PB

Design: bulk computational resources must be allocated at national computing centers

Design: balance having sufficient centers to accommodate down-times & support diverse approaches (high performance + high throughput) against the cost per center of maintaining/optimizing the software stack.

8

9 of 15

PBD Hardware Schematic

Allocated computational resources are planned, not confirmed.

9

10 of 15

PBD Software Schematic

10

11 of 15

Interfaces

All interfaces must be documented:

  • Interface Control Documents with other L2 Subsystems
  • Memorandum of Agreement with the Collaboration

11

12 of 15

Data Challenges

  • Data Challenges are at the core of the DM construction project:
    • Experiment (Instrument + Observation) design validation
    • Data management subsystem validation
      • Including sufficiency of allocated computational resources
    • Analysis pipeline validation
  • Each agency review features an enhanced/matured design to be validated.
  • Each review is informed by a preceding data challenge.
  • Each Data Challenge is a 6-month process.

12

13 of 15

Data Challenge Schedule/Process

13

14 of 15

Parallel Session

Presentation by each L3 team

  • More detailed dive into each L3 subsystem
  • Open issues and questions
  • Path to CD-1/PDR

Note also the “Design Validation: Technical to Measurement” theme which will focus on Data Challenge 1 and validating the Preliminary Baseline Design.

14

15 of 15

Open Questions

  • Do we have the full set of data product requirements from the collaboration?
  • What is the exact boundary between project and collaboration wrt transients?
  • What is the optimal map-making approach when computational cost is included? Does it vary with science case?
  • Are there other computational resources we should be looking to use?
    • FABRIC in network computing
    • Joint South Pole computational infrastructure with IceCube
  • How best do we deliver data to the collaboration? Can/must we deliver computational resources too?
  • Do we have the right feedback loops with TWGs and AWGs?

15