LSSTC enabling science 2020 broker workshop
Minutes and summary document
Website (slides and videos available): http://workshops.alerce.online/lsst-enabling-science-2020-broker-workshop/
Moderators: Gautham Narayan & Paula Sánchez-Sáez
15.00 - 15.10 UTC Introduction by Anais Möller (10 min)
First follow-up after the Seattle broker meeting in June 2019 and the broker session at PCW 2020.
First part: current status and requirements
Second part: Developments triggered by the first workshop and reflections from the different users and collaborations.
15.10 - 15.30 UTC LSST Data Management, Eric Bellm & Rubin Obs. Data Management (20 min)
Main document for alerts: LDM-612
Main place to ask for questions on alerts: the forum at community.lsst.org (see especially the okAlerts & Brokers topic: https://community.lsst.org/c/alertsbrokers/46 )
Simulated SSO available around November 15th, DIAObject timeseries features in progress.
Pre-operations (interim data facility) will be on a cloud provider. Exercise began. It is not clear where operations will happen.
Problem of cutout size: still investigating the different options to maximise the science output without impacting too much the bandwidth. Note that data right holders can retrieve cutouts by 24h of alert emission independently of the live stream.
Real bogus: how to detect efficiently artifacts? Likely to use the Zooniverse project to develop training samples.
Data right holders should also look at the Rubin Science Platform (RSP) that will provide access to the data. ‘Next-to-data’ analysis: jupyter notebook, API to access tabular and image data.
Can brokers interact with the RSP?
Information on derived data products: see RDO-013
Q: will the pre-operations cloud provider be announced before Dec 15th? A: We know it is Google. Press announcement soon.
Q: When broker-RSP integration will start? A: DM would be setting up API during the Data preview 1 (ComCam data). Note that Data preview 0 will be using DESC DC2 data, and Data preview 2 will be using LSSTCam data.
Q: Are there any plans to do transient/variable classifications apart from the brokers (i.e. acting on several years worth of data)? A: the LSST project will only characterize, rather than classify
15.30 - 15.43 UTC ZTF lessons learned, Matthew Graham (13 min)
To all brokers: contact us to listen to the alert stream!
ZTF is 10% LSST - scaling is not an issue. In addition, end-users do not request the full stream, but only a small fraction (will still be true for LSST).
ZTF is moving to a fully automatized classification system (without human in the loop, that brings latency, inefficiency and non-repeatability)
ZTF fields million queries/day. Important to attach history to alerts to reduce the need to retrieve data each time one runs an analysis (or have a strong server!). Though, there are too many ways to retrieve the data: need to reduce the number of repo/code/etc. to one single that everyone can use.
Concept of the one-stop-shop.
ZTF-II: Oct 2020 to Sep 2023.
Q: in ZTF, which request is more common/useful: 1) requesting specific transient type, or 2) requesting transients with generic features? (e.g., peak mag range, duration above detection, etc …) A: ??
Q: What will be the lookback time for the forced photometry service? A: lookback time will be available through forced photometry. All public data (or proprietary if you have the appropriate access rights)
Q: How can we fetch the forced photometry for interesting alerts?
Q: Will the format of DR light curve change? A lot of fields available through IRSA/API are missed there now.
Q: Are there differences to be expected between ZTF II and ZTF I alert stream? Does a new process need to be started to receive it?
Q: will ZTF-II still split the streams into public and private ones? A: Yes, but 50% of the observing time is now devoted to public surveys
Q: Looks like the ZTF alert schema has not changed in over a year.
Will there be schema changes when ZTF-II starts to include forced photometry? A: Yes, I believe so but hopefully minimally
15:43 - 15:48 UTC Requested talk: Hopskotch, Mario Juric (5 min)
Focus on multi-messenger astronomy (MMA)
Link to SCIMMA technologies. SCIMMA will provide an MMA archive.
Hopskotch: make reliable streams simple.
Q: Who is deciding on the MMA archive that SCIMMA will provide? Will it be a community discussion or it already exists? A: there are mo
15:48 - 15:53 UTC Short break (5 min)
Moderators: Federica Bianco & Nina Hernitschek
15:53 - 16.18 UTC Discussion: alert generation (25 min)
Notes for Breakout room #1
Main Q: “What types of xmatch info should be in the alert from LSST?” but also what the brokers does in terms of xmatch for the user
What types of xmatch are there?
What’s the bottleneck?
Are there issues with xmatch to proprietary data sets?
Alerts-related Q&A: https://community.lsst.org/c/sci/data/34
Alternative sets of notes - Breakout session #1 -- cross-match topic.
Fed Bianco moderator
Roy Williams - See notes above [I, Gizis, missed his comments ]
What is the bottleneck? Is it data access or technical?
Steven - quality of input catalog data is bottleneck. Astrometry, Star/Galaxy separation. Need ALGORITHM for cross-match. Confusion (galaxies), proper motion (stars) are issues.
Examples of huge galaxies like M31. Three nearest sources in PS don’t include.
Q: Is proprietary data problem solved?
A: For now assume summary information and as much detailed information should be available worldwide. Haven’t thought about it in detail, bulk will be public. Do envision two different types of users.
Q: (Jha): I am a little bit confused by the topic “included in the alerts” — I thought cross-matching would be a job for the broker? Is the idea that some of this would be pushed upstream to the LSST alert, or are we talking about broker cross-matching?
Thought of Project info to the brokers. Then this is a Project issue. If there is something missing that is scientifically interesting then inform the project soon. The Project is only matching its alerts to its own catalogs, like the deep coadd catalog that goes deeper than any one exposure. [and known Solar System objects]. Object IDs are included, data is subject to Data Rights.
Question about forced photometry? Project has not extensively analyzed that capacity but it is in the books.
Smartt: Initial alert will not have forced photometry. Within 24 hours the brokers can request forced photometry for a certain number of objects. [perhaps throttled.] The second alert will have forced photometry going back 30 days. Question is how long does forced photometry. Second alert might be next day, might be 7 days, depending on cadence. Forced photometry should be available within 12-24 hours.
Question to clarify. I thought forced photometry is for all alerts. Clarification: All alerts will have forced photometry in 2nd alert. If you want within 24 hours, the broker needs to request.
Cross-matches to radio, other wavelengths. ALLWISE in IR, X-Ray. Gives useful information. This is the responsibility of the broker.
Redshift of host galaxies. Photo-z of host galaxies is an LSST product. Will LSST include such information in the alert packet if cross-matched? A: Cross-match is object ID, so redshift can be looked up in catalog with data rights, same for host galaxy photometry. DLR information? Melissa Graham is currently working on a proposal to include host-galaxy size information in the cross-matching algorithm. Right now it is just based on on-sky distance.
Gregory: Will provide a few matches, not just one object.
Notes for Breakout room #2: Full stream vs lightweight only-metadata stream.
What metadata means? What to include? What are the options?
Q: what is the relevance of (relatively large) cutouts? Still on the early stage of it.
Q: what are the use-cases for having a lightweight stream?
The audience to get full packet content for a single high-demand packet could be a few hundred people, which is unlikely to be a technical hurdle
Lightweight stream in the meantime is breaking a few things (e.g. guarantee to have one-time delivery, ...).
The project (+SAC) should investigate how to feed all brokers, rather than cutting on the total number. Encouraging inter-broker collaboration rather than cutting the stream.
Q:what is the burden to maintain the ZTF alert system? What impact for LSST and having multiple streams? A: Several places for alerts: IPAC -> UW -> brokers. When no problem, fairly minimal to maintain (~0.5 FTE), but not trivial when problems start.
Science collaboration (SC) use cases I
Notes for Breakout room #3:
How many brokers will be supported for full-stream?
The requirement is minimum 5 but looking into supporting more. However, this possibility won't be known before full proposals are due for. Maybe for the next workshop?
16:18 - 16:30 UTC AGN, Paolo Coppi (12 min)
LSST will detect hundreds million AGN in the main survey area!
Q: Are these sources you expect to confidently detect and do science with
or do you think they will be in alerts but not confidently classified ? What are the timescales at which these might be classified? A:
16:30 - 16:42 UTC Dark Energy, Kara Ponder (12 min)
Using SN Ia for cosmology, but planning to follow-up any type of SNe.
No need for immediate information. Bulk alerts received within 24h, and processed by brokers is what is needed
DESC is running evaluation for brokers and metrics (with some broker teams to define inputs).
Q: What do you mean you want to evaluate CPU usage of brokers? A: it is more a question of reproducibility for testing purposes (example inside a light docker container).
Q: how many teams are thinking about 3rd party reproducibility (i.e., them being able to spin up a version of your broker, or providing such a service)? A: Mario, this is built into the ANTARES design (provenance tracking, containerized versions, etc.) ; also A: Fink’s design allows this reproducibility. All our framework is open source, with version control and can be deployed in any cloud.; also A: Thanks! Just to focus my question a bit -- will it be (or already is?) possible for someone to download (say) a container of your broker and run it by themselves (say, with a simulated stream to understand selection function)?; also A: Each ampel version can also be spin up at a later time. Expanding the question a bit, an as hard or harder question for reproducibility is whether all input streams will be available. All decisions in Ampel are logged, but to fully "rerun" a science season requires all input streams to be accessible, including e.g. follow-up decisions.; also A: That's in our development plan. We aren't making containers available yet, but will, for exactly the reasons you bring up.; also A: @mjuric to complement @Anais Möller answer, we have series of containers (to simulate alerts via Kafka, listen to and process alerts by Fink, and redistribute alerts to users). There is some documentation to use them, but it might need a good clean up at some point if there is someone else other than us using it. Just ping me whenever.
Find more information from DESC here: https://docs.google.com/document/d/13Kwxcl1DeUDwH94iWeqqYcEsEm1PsOAVE2R1nxFTMdI/edit?usp=sharing
16:42 - 16:55 UTC Informatics & Statistics, Ashish Mahabal (12 min)
Q: could ISSC help the broker community to design and test benchmarks? A: Yes + The ISSC is very interested in helping to coordinate relevant community efforts so anything involving statistics and machine learning .
Link to LSST forum: https://community.lsst.org/c/sci/statistics/41 linked
For Slack: #issc-ask-the-issc
Q: Can the sequence of classification updated be used for anomaly detection? Do brokers plan on archiving it and providing it? A: in Fink we will store all anomaly scores with time stamps and they will be accessible to the user.
The ISSC web site archives videos from past webinars. The last one was by Anais Moller on Bayesian neural networks. See the "Videos" link here: https://issc.science.lsst.org/
16.55 - 17.25 UTC Long break (30 min)
Moderators: Emille Ishida & Matthew Graham
Science collaboration (SC) use cases II
17:25 - 17:37 UTC Galaxies, Harry Ferguson (12 min)
More interested in long-term classification than real-time identification. Several epochs needed.
Interest in: Novae, Mira variables, Microlensing, emission-line sources, Supernovae, High-Redshift signposts, ...
Q: Will DCR (Differential chromatic reflexion) related changes trigger alerts (that will be sent out, not flagged as bogus)? A: Trying to build models. They will be flagged and minimised as the survey goes on.
Q: How will you know they are spurious in the early survey? E: depends what you mean by “know”. dipolar effects are by definition inconsistent with a subtraction that produces a clean point source--hence, a high spuriousness score. It is a separate question whether there is real 5-sigma variation + a DCR effect that produces the dipole--I think predicting that is essentially the same as correcting for the DCR effect in the first place. MWV: In theory it seems like it should be. I can imagine that in practice it takes more information to build a DCR-informed template than it does to identify that one might have concerns about an object. E: Yes. In the early survey I expect our ML models to simply be trained to recognize clean point sources. MJ: We'll have to be careful not to be _too_ good at it & accidentally remove moving object trails. MWV: But an overall takeaway for Brokers might be that that orientation of the dipole moment of a source with parallactic angle , combined with the dipolar strength itself might be of particular interest. E: @mjuric we won’t remove anything, just score it. (... other messages)
Q: Although brokers may not be critical for the Galaxies SC science cases, brokers could benefit from your input. Have you considered setting up a service, e.g. API, that brokers/all the community could query to get detailed galaxy properties? I'm sure brokers could help with that
17:37 - 17:49 UTC Stars, Milky Way & Local Volume, John Gizis (12 min)
Q: will you have a criteria for classification? like purity? A: (I missed it!)
Q: have you thought about the taxonomy required? A: (I missed it!)
Q: For microlensing, is there any plan for coordinating LSST alerts with specific networks for quick follow-up?
17:49 - 18:01 UTC Solar System, Mike Kelley (12 min)
(missed the broker wishlist...) See slides!
Q: will larger thumbnails be something useful for the cometary outbursts/impacts? A: yes.
Comment: there's a typo on your minimum requirements slide; the DIASources will be linked to SSObject records by LSST's pipelines -- I think you meant to say the brokers shouldn't drop that information.
18:01 - 18:13 UTC Strong Lensing, Timo Anguita (12 min)
Quasar microlensing variation: weeks to years.
LSST cadence only is probably not enough to do all string lensing science:
Q: how much does the inference on microlensing depend on static sky info about the star and would it break in very crowded fields where the diff image should still work well but static photometry would not? A: There's no "star" per se in quasar microlensing but the gravitational potential of many stars (think of 10000 star system galactic microlensing events.... 10000 unresolved micro-images). We see the effect in the uncorrelated brightness fluctuations of the "macro" lensed quasar images. To answer your question, as long as the lens is identified and the astrometry allows modeling the "macro" lens it is enough. In any case very accurate lens models will require high resolution imaging and spectroscopic redshifts. So "static" follow up of lensed quasars is necessary.
18:13 - 18-25 UTC Transients & Variable Stars, Michael Stroh (12 min)
Both Time-Critical and non-Time Critical science cases.
There is a broker task force inside TVS.
Q: What satellite orbits would be stationary in RA, Dec during 30 seconds? A: The point is what fields would be empty of satellites for 30 sec And what would be empty for 30 min to get 2 exposures. watch out survey burnout tho. Recommended paper: https://arxiv.org/abs/2003.01992
Q: Could we reuse the survey at some point? A: Yes, but (1) we need to think carefully to whom it should be sent to avoid the survey burnout and (2) If it is re-circulated I’d [EB] like to provide feedback before it is re-sent--many users say that they need data that is already included in the baseline alert packet (e.g., the cutouts, the lightcurve history, timeseries features…)
18:25 - 18:30 UTC Short break (5 min)
18:30 - 18:35 UTC Requested talk: PLAsTiCC 2, Mi Dai (5 min)
18.25 - 19.00 UTC Discussion: scientific use cases (25 min, breakout rooms)
README: You need Zoom 5.3+ to use breakout rooms. There will be a poll during the meeting to define the topics of the breakout sessions. Then move to the breakout rooms available. First action in each breakout room will be to elect a moderator.
Notes for the breakout session #2 room #2 (hierarchical broker)
Notes for the breakout session #2 room #1 (taxonomy)
Notes for the breakout session #2 room #4 (Real time vs. delayed broker outputs)
Attendees: Harry Ferguson, Melissa Graham, Michael Kelly, Alex Kim, Ken Smith, Emille Ushida,
MWV Notes (written after the fact):
Session 4. Moderators: Timo Anguita & Guillermo Cabrera-Vives
15:00 - 15:11 UTC SNAPS, David Trilling (11 min)
Real time outliers & feature outliers.
Period finding: GPU is 300 times faster! Below 10s to find it.
Web portal: pre-alpha version on Friday 30 October.
Q: in DES there was a lot of interest in finding Kuiper belt objects and beyond; for LSST, is there a plan to tag such objects in a broker ? A: <missed>
Q: Why does SNAPS need both LSST (or ZTF) & another broker stream (Antares in this case)? A: it is about flexibility. Also about optimizing the number of brokers that can connect directly to LSST stream.
Q: Are you able to keep up with the real-time alert stream considering you have so much Lomb–Scargle periodogram fitting to perform? A: That’s our GPU implementation, which was a big effort to make sure we can work at LSST scale. Paper on this GPU approach has been submitted. comment from Matthew Graham: We have GPU implementations of Lomb-Scargle, Conditional Entropy, and AoV period finding algorithms that should be released to the public soon
Q: sorry if I missed this. Will you only monitor known objects? or also find new objects? A: only known objects. The survey has to tell us “this is object xyz” and then we can handle it.
Q: Can you share a URL for Beta testers for Fridays Pre-Alpha release? A: sorry, I’m not ready to share a URL yet. We’re not ready to send to outside users. However, I can invite you when we get to a public alpha/beta version.
Q: what exactly do you use from the ANTARES stream? A: Do you mean what information? Object name, magnitudes in 8 and 18 pixels, filter, time — a few other parameters. Not much, in reality. I have a complete list that I can dig up if you are interested.
15:11 - 15:22 UTC Point of Interest, Nina Hernitschek (11 min)
Q: @Nina Hernitschek Are you planning on doing this closer to the MW or in the entire LSST footprint ? A: Specific regions chosen by users, and closer to MW.
Q: are you considering to include other classes of variable stars? A: Yes. We try to include as many classes as possible of variable stars.
Q: are you only focusing on RR Lyrae and Cepheids in PS1, or do you have classification of other types of variables from PS1 3pi data? This would be very useful for pre-LSST catalog A: We are not only looking at RR Lyrae and Cepheids. RR Lyrae and Cepheids were just what brought us to the idea of building such an alert broker. Users can define "points (regions) of interest" and get updates within them of a) anything that varies, b) anything that varies regarding a specific pattern, c) anything that was classified as a specific type of variable stars. Use cases are such as: "I want an updated light curve plus period/phase of the one RR Lyrae star I observe frequently", "I want to see all newly found RR Lyrae within Draco dSph", "give me anything that varies periodically within 1 arcmin of...".
Q: Are you thinking of expanding to non-regular variables? A: non-regular variables are a part of the "superset", so e.g. if the user's choice is to get "everything that varies" at a specific "point (region) of interest"
15:22 - 15:33 UTC Pitt-Google, Troy Raen (11 min)
Q: can you expand on the semantically compressed stream you mentioned? A: Very similar to the lightweight stream. Open to discussion and input with anyone with interest about what should be in that.
Q: will the redistributed full stream be free? or only the compressed version? A: We expect to have some baseline funding such that most user operations would be free and you would only incur cost if you wanted to do things like move large amounts of data out of the region or do compute heavy things like MCMC.
Q: When you say pub/sub stream, does that mean kafka stream?
Q: what is the database architecture you are using for the “Big Query” module ? A: Currently the full metadata from each alert simply gets its own row in the database. As we move forward we may consider doing something different, e.g. with the lightcurve history.
Q: is the user subscription free? Do you have an estimate of how many users you can support?
Q: for template classifier, is AV part if fit to allow objects that are redder/bluer than template ?
Q: So the users must be on google right ? If a user wants to add a single cloud function will they only be responsible for the compute running that ? A: Users could listen to Pub/Sub streams (similar to Kafka) and/or use the API independently from the Google Cloud Platform (GCP), so users do not necessarily need to be on Google. If you want to run a Cloud Function you would need to be on GCP and yes, you would only be responsible for the compute associated with your Cloud Function.
Q: what is the database architecture you are using for the “Big Query” module. A: Currently the full metadata from each alert simply gets its own row in the database. As we move forward we may consider doing something different, e.g. with the lightcurve history. Q: And what’s the database ? SQL-like or no-SQL like Cassandra ? A: Google proprietary SQL db.
15:33 - 15:44 UTC Fink, Anais Möller (11 min)
> 30 million ZTF alerts collected in one year
More people are welcomed to the collaboration!
Q: The portal, does it have classification and in such case, what are those? A: Models for SNe, random forest with SNe and other classifications
Q: What does it mean that the data will remain accessible for the full survey duration? Will we lose access immediately after. What about ZTF etc,? A: It means all the alerts+added values will be kept in a database, and made available to the public for 10y (i.e. you can query it, and download data). This represents more than 3PB.
15:44 - 15:55 UTC Lasair, Ken Smith (11 min)
Watchlist up to 1 million sources
Q: Are you planning to develop your own ML classifiers? A: Possibly, but we also want other people to integrate theirs.
15:55 - 16:00 UTC Short break (5 min)
Session 5. Moderators: Melissa Graham & David Young
16:00 - 16:11 UTC Babamul, Matthew Graham (11 min)
<3 min to process 240k alerts
Q: how would the community use the Google Edge TPUs in times of VRO/LSST? You would need to have access to the data locally? A: We demonstrated the streaming is sufficient (even using 4G). // You can deploy your own broker based on a RPi with a couple of hardware accelerators to do the heavy lifting (classifications, etc.) and then this feeds a cloud deployed Fritz instance to handle all the messy filtering, annotations, followup, etc.
16:11 - 16:22 UTC ANTARES, Tom Matheson (11 min)
Q: you are using kafka to both collect alerts, and send alerts to users. So what is the advantage (or ) of sending gzipped BSON to downstream users while the upstream alerts to Antares (at least from ZTF) are in the Avro format? A: Avro has heavier weight + annotation is easier with BSON.
16:22 - 16:33 UTC AMPEL, Jakob Nordin (11 min)
Idea of combining 3 operations (broker/supplement/join) in any orders and possibly many times to build system:
Alert ecosystem terminology needs to be better defined.
16:33 - 16:45 UTC ALeRCE, Francisco Förster (11 min)
Q: out of the 2000+ users, how many are repeat users? A: I don’t know the answer, the number is from Google Analytics (combination of IP address and device ID)
Q: What do you do with annotations? A: we store them in a different database than the main S3 storage.
16.45 - 16:50 UTC Requested talk: AEON, César Briceño (5 min)
Queueable observatory: SOAR (co-located with LSST), Gemini, BLANCO
Q: Is it possible to ask for target of opportunity observations? And in that case how should we ask for it; or is it only through asking for time? A: Yes, through proposals to the Noirlab.
16:50 - 17:20 UTC Long break (30 min)
17.20 - 17.50 UTC Discussion: brokers (30 min)
Notes for breakout room #1 inter-broker collaboration
Notes for breakout room #2 public repo for training sets
Attendees (~13 attendees => add your name!): Franz Bauer, Gautham Narayan, Matthew Graham, Paula Sanchez-Saez, Paolo Coppi, Ashish Mahabal, Eric Bellm, Jakob Nordin, Vincenzo Petrecca, Alejandra Muñoz Arancibia ...
Notes for breakout room #3 which features should be computed for brokers?
Attendees (fill your name if I forget): Kostya Malanchev, Nina Hernitschek, Massimo Dall’Ora, Ilsang Yoon, Ken Smith, Leanne Guy, Anais Moller
Notes for breakout room #4 Interaction with broker teams
Attendees: Melissa Graham, Troy Joseph Raen, Alex Kim
Open questions about whether most users just want queryable databases, or streams; want access to the individual modules within a broker and if so, which ones and what type of access.
Databases vs. streams: from a DESC perspective, most interest is in queryable database, but for kilonovae a stream might be needed (or a relatively quickly-updated database); beyond DESC it seems like queryable databases with short-ish latencies of updates are generally more broadly useful than an immediate stream-format access to processed alerts.
Modules: yes it would be nice to be able to go directly to the module outputs -- including information about calculations that might be done but not persisted, if such a thing happens, e.g., stuff that doesn’t become a ‘classification alert’ or a downstream published value-added product.
17:50 - 17:55 UTC Short break (5 min)
17.55 - 18.40 UTC Discussion: summary of discussion sessions (45 min)
Each of the moderators of each breakout session summarize the topics of the discussion. See notes above for a complete review (Days 1 & 2).
18.40 - 19.00 UTC Discussion: preparing for the next workshop (20 min)