Status Report: THREDDS
April 2019 - October 2019
Sean Arms, Ethan Davis, Dennis Heimbigner, Cece Hedrick, Ryan May, Jennifer Oxelson, Howard Van Dam II
Areas for Committee Feedback
We are requesting your feedback on the following topics:
- Do you know of anyone using netCDF-Java to read Vis5D grid files?
- Do you know anyone who just **loves** BUFR and understand it inside and out? Unidata could use your help (knowledge of Java **not** required)!
- With an eye towards guiding our outreach efforts, what THREDDS Data Servers do you use, and are there any that you consider critical to your needs?
Activities Since the Last Status Report
The THREDDS Project
The THREDDS Project encompases four projects: **netCDF-Java, the THREDDS Data Server (TDS), Rosetta, and Siphon** (the Unidata Python client to interact with remote data services, such as those provided by the TDS). For specific information on Siphon, please see the Python Status Report. An update regarding cloud efforts related to the TDS, including the popular Docker container effort, can be found in the Cloud Computing Activities Status Report.
The various THREDDS projects were featured in a paper regarding Data Interoperability presented at OceanObs’19, a decadal conference series which seeks to improve response to scientific and societal needs of a fit-for-purpose integrated ocean observing system, for better understanding the environment of the Earth, monitoring climate, and informing adaptation strategies as well as the sustainable use of ocean resources.
Data Interoperability Between Elements of the Global Ocean Observing System. Derrick P Snowden, Vardis Tsontos, Nils Olav Handegard, Marcos Zarate, Kevin M. O'Brien, Kenneth S Casey, Neville Smith, Helge Sagen, Kathleen Bailey, Mirtha Lewis, Sean Arms. Review, Front. Mar. Sci. - Ocean Observation, Submitted on: 15 Nov 2018. DOI: 10.3389/fmars.2019.00442
Released netCDF-Java 5 (Stable)
- NetCDF-Java version 5.0.0 was released on 29 July 2019. NetCDF-Java version 5.1.0 was released on 12 September 2019.
- Prior to version 5, the netCDF-Java/CDM library and the THREDDS Data Server (TDS) have been built and released together. Starting with version 5, these two packages have been decoupled and now live in separate git repositories, allowing new features or bug fixes to be implemented in each package separately, and released independently.
- The codebase of netCDF-Java can be found at https://github.com/unidata/netcdf-java
Released TDS version 4.6.14 (Stable)
- TDS version 4.6.14 was released on 29 July 2019.
- As part of this release, netCDF-Java version 4.6.14 was also released. However, the 4.6.x line of development for netCDF-Java is now in maintenance mode, and will only include security related fixes. All users of netCDF-Java are strongly encouraged to move to the latest netCDF-Java (as of this report, version 5.1.0 ).
Released TDS version 5.0.0-beta7
- TDS version 5.0.0-beta7 was released on 29 July 2019.
- We anticipate releasing a stable version of TDS 5.0.0 before the next committee meeting.
- Starting with TDS v5.0.0-beta7, the TDS codebase can be found at https://github.com/unidata/tds
Documentation for netCDF-Java / TDS (Beta) v5
Rosetta
Rosetta continues to progress following a very successful NASA ACCESS grant (the Oceanographic In-situ data Interoperability Project, or **OIIP** ), in which Unidata partnered with the PO.DAAC at JPL and UMASS-Boston. A poster was presented at the Fall AGU 2018 meeting with respect to the advances in Rosetta related to OIIP. We continue to work with JPL as part of their user acceptance process, with the ultimate goal being the operational use of Rosetta at the PO.DAAC. We, along with the PI of the OIIP project, participated in an ESIP Marine Data Cluster presentation.
Progress has been made on the following:
- Supporting user selection of an appropriate standard name from the CF Conventions has been particularly challenging. Work has been done to create a mind-map of the various standard names as a design prototype for building a standard name selection widget. While it is easy to stuff all of the available standard names into a dropdown list (>4000 names), it is not particularly user friendly, does not promote discoverability, and is certainly not performant.
General dependencies, challenges, problems, and risks include:
- While all java based components in the THREDDS project run under Java 11, all except Rosetta currently build with Java 8. Portions of our build infrastructure need to be reworked to use Java 11. The end of 2020 is generally marked as end-of-life for Java 8, and thus moving to Java 11 is a priority.
- Calling all beta testers! The goal of beta testing TDS 5 is to ensure that the current capabilities of 4.6.x are working in the new version (and if some bugs get fixed in the process, even better!). Beta testing by our users is critical, and so far we have had several community members offer their help (special thanks to Rich Signell, Peter Pokrant, Victor Gensini, the NCAR RDA, etc.!).
- While the list of names on this report seems large, the current staffing levels on all three components covered in this report is less than 2.5 FTE (including externally funded efforts). A similar resource landscape can be seen for nearly every other project run by the Unidata Program Center. External funds help (currently seeking one opportunity), but rarely provide the ability to bring on and sustain new staff members, which results in taking resources away from other projects and efforts within the Unidata Program Center. Given your position as members of our governing committees, you play a critical role in helping us set priorities. Your feedback is greatly desired and very much appreciated.
Ongoing Activities
We plan to continue the following activities:
- Maintain thredds.ucar.edu and keep up with the addition of new datasets to the IDD.
- Closely monitor the security status of our project dependencies, and provide updated versions of our libraries and server technologies to address as needed.
- Clearly define the public API of netCDF-Java.
The following active proposals directly involve THREDDS work:
- Thanks to Rich Signell, we, along with Axiom Data Science, submitted and were **awarded** a NOAA IOOS grant. The proposal was entitled “A Unified Framework for IOOS Model Data Access”, and the goal to enable support of the UGRID specification within the THREDDS stack, as well as create a GRID featureType to allow for serving large collections of gridded datasets (including UGRID). This work will fund a Undiata staff member at 0.5 FTE for two years, as well as two co-PIs at Axiom Data Science at a slightly lower level. This work **stragiticialy aligns with the Unidata 2024 focus area “Managing Geoscience Data, Making Geoscience Data Accessible** by improving the reliability and scalability of the TDS to handle very large collections of gridded datasets, as well as **“Managing Geoscience Data, Enhancing Community Access to Data”** through the addition of UGRID support (example: MPAS output is on a mesh, a.k.a. “unstructured”, grid).
- We partnered with JPL / PO.DAAC on a NASA ROSES Advanced Information Systems Technology (AIST) solicitation. The proposal was titled “MIKITA – Multi-sensor Data Integration Microservices for Knowledge InTensive Applications.” The focus of the work is on extending the Metadata Profile Service (MPS) and creating Semantic Technology based microservices which leverage the MPS. While the bulk of the work is on the JPL side, we are proposing to extend Rosetta to interact directly with the MPS through its web API, and extending the TDS to provide metadata records in a more semantic friendly format, such as JSON-LD based on schema.org. While we at Unidata are not experts on the semantic web, we can certainly empower those who are. The semantic technology work **strategically aligns with the Unidata 2024 focus area “Managing Geoscience Data, Making Geoscience Data Discoverable”** by fostering community adoption of standard data discovery services to help users locate and acquire data appropriate for their projects, working to ensure that data available from Unidata-managed services are discoverable through standard data discovery mechanisms, and using Schema.org markup of datasets in support of search engine discovery (i.e. Google DatasetSearch), as well as **“Managing Geoscience Data, Making Geoscience Data Usable”** by creating ways that netCDF-Java and the TDS can take advantage of “Linked Data” best practices for exposing, sharing, and connecting datasets across networks. Our work on Rosetta **strategically aligns with the Unidata 2024 focus area “Providing Useful Tools, Creating Modern Data Management Workflows”** by further enabling long-tail data providers to produce files that conform to widely adopted, community driven data and metadata standards, as well as **“Managing Geoscience Data, Making Geoscience Data Usable”** by promoting the use of data standards like netCDF with the Climate and Forecast (CF) conventions that allow scientists to quickly understand the shape and provenance of datasets. In general, this work **strategically aligns with the Unidata 2024 focus area “Providing Useful Tools, Creating Modern Data Management Workflows”** through the creation of cloud-native data access services, based around common cloud technologies like object store, hosted databases, and “serverless” technologies. **UPDATE** As of 2019-09-30, we have learned that the proposal was not selected by the AIST program.
New Activities
Over the next three months, we plan to organize or take part in the following:
- Deep look into current dependencies and reduce as much as possible.
- Facilitate migration from netCDF-Java 4.6.x to 5.x where we can.
- For the most part, this is a “baking” period for 5.x - focus on bug fixes, non-visibile library changes, few new features.
- Exception - enhance support for S3 storage
- Complete command line tool creating a WRF intermediate file from a subsetted GRIB dataset.
- Getting TDS v5.0 to a stable release (release candidate targeted for late 2019/ early 2020).
- Implement option to create WRF intermediate files from GRIB datasets via TDS user interface. Support storage of pre-defined dataset variables for ease of WRF file recreation.
- AGU 2019 presentation covering UGRID work (supported by NOAA/NOS/IOOS COMT award).
Over the next twelve months, we plan to organize or take part in the following:
- Begin to modularize (Java Platform Module System) but maintain Java 8 compatibility.
- Define public API and get 90%+ test coverage of it.
- API breaks likely as we restructure our current artifacts
- Initial support for reading / writing Zarr
- Release TDS version 5 (Stable)
- Create a TDS Registry
Beyond a one-year timeframe, we plan to organize or take part in the following:
- Remove deprecated code
- Fully support Java 11 and the Java Platform Module System (end of Java 8 support)
- Commit to semantic versioning
- Reduce dependency footprint
- Modularize (Java Platform Module System)
- Create a collection level update notification system
- Create a collection level metadata search across TDSs
- Look at re-architecting the TDS to “really” run on the cloud
- TDS as a collection of autoscalable microservices vs monolithic web application. With our current resource levels, this is a stretch.
Relevant Metrics
NetCDF-Java
Recently we’ve found that the method by which we count download statistics for the netCDF-Java library has been excluding a significant source of downloads. Traditionally, users have downloaded a single jar (netcdfAll.jar) to use with their packages. However, the way Java developers have been consuming our libraries has changed significantly over the past several years, as many projects now rely on a build system to pull in just the components of the library they need, and we have not taken that into account.
As an example, if we consider the netCDF-Java version 4.6.13 release, our current method for computing library downloads would show 1717 downloads over the past six months. If we take the previously unaccounted for downloads into account (for just the core component of the library — cdm.jar — not all components), the number of downloads for netCDF-Java version 4.6.13 over the past six months jumps to 16,049.
To put this new number in perspective, at our last meeting we reported that the total **yearly** downloads of **all versions** of netCDF-Java from 2018-02 — 2019-03 was 8389, and our new **six month** download figure for a **single version** of netCDF-Java nearly doubles that. As such, we will be changing the way the download statistics of netCDF-Java are computed to better reflect the community of developers who rely on the library. These changes will be reflected in the download metrics reported at our next meeting (Spring 2020).
THREDDS Data Server
We see that **10,436** unique IPs started up thredds from March 2019 through August 2019, **106** of which are publicly accessible servers. “Publicly accessible” means we could find them using common url patterns. For this plot, the version includes betas and snapshots, not just the official release of that version, for presentation simplicity.
This information is only known for servers running v4.5.3 and above. There are many reasons why these numbers are so different. The differences could be due to:
- People testing the TDS on their local machine, but not actually running a server (most likely the cause for the majority of the difference)
- A TDS running behind a proxy server may not be “seen” in this analysis as publicly reachable at the tested url pattern (e.g. <server>/thredds/catalog.xml). For example, a TDS running behind a proxy might be configured to respond to mytds.<server>/catalog.xml, and so our check for mytds.<server>/thredds/catalog.xml would not work. This can also happen if the TDS has been configured to use a different context without the use of a proxy server.
- The TDS server may be running behind a firewall that does not allow public access.
- A TDS running in the past is no longer running today.
Note 1: the vast majority of the publicly accessible servers are running v4.6.3 or above (v4.6.14 was the most current release during this period, and was released on 26 July 2019 ).
Note 2: there are some odd looking versions of the TDS being reported in the log files, such as TDS_4.39.x. It is likely these version numbers are actually generated by software that is being built on top of the TDS or applications that bundle the TDS as part of a deployment package.
Furthermore, of the **106** publically accessible servers, **64** have updated the name of their server in their server configuration file (taken as a sign that they are maybe possibly intended to be used by others...maybe…).
In the next six months, we will be working towards enabling TDSs, on an opt-in basis, to officially advertise their availability to the community through a centralized resource.
Strategic Focus Areas
The THREDDS projects covered in this report support the following goals described in Unidata Strategic Plan:
- Managing Geoscience Data
The component software projects of the THREDDS project work to facilitate the management of geoscience data from four points of view: __Making Geoscience Data Accessible, Making Geoscience Data Discoverable, Making Geoscience Data Usable, and Enhancing Community Access to Data__ . As a client-side library, **netCDF-Java** enables end users to read a variety of data formats both locally and across numerous remote technologies. Less user-friendly formats, such as GRIB, are augmented with metadata from community driven metadata standards (e.g. Climate and Forecast metadata standards), and viewed through the more user friendly Common Data Model (very similar to the netCDF Data Model), providing a single set of Java APIs for interacting with a multitude of formats and standards. The **THREDDS Data Server** exposes the power of the netCDF-java library outside of the Java ecosystem with the addition of remote data services, such as __OPeNDAP__ , __cdmremote__ , __OGC WCS__ and __WMS__ , __HTTP direct download__ , and other remote data access and subsetting protocols. The TDS also exposes metadata in standard ways (e.g. ISO 19115 metadata records, json-ld metadata following schema.org), which are used to drive search technologies. **Rosetta** facilitates the process of translating ascii based observational data into standards compliant, archive ready files. These files are easily read into netCDF-Java and can be served to a broader community using the TDS.
- Providing Useful Tools
Through Rosetta, the THREDDS project seeks to intercede in the in-situ based observational data management lifecycle as soon as possible. This is done by enabling those who produce the data to create archive ready datasets as soon as data are collected from a sensor or platform without the need to write code or intimately understand metadata standards. NetCDF-java and the TDS continue to support legacy workflows by maintaining support for legacy data formats and decades old data access services, while promoting 21st century scientific workflows through the creation of new capabilities (such as adding Zarr support) and services.
- Supporting People
Outside of writing code, the THREDDS project seeks to support the community by __providing technical support, working to build capacity through Open Source Software development, and by building community cyber-literacy__ . The team provides expert assistance on software, data, and technical issues through numerous avenues, including participation in community mailing lists, providing developer guidance on our GitHub repositories, and leading and participating in workshops across the community. The team also actively participates in “upstream” open source projects in an effort to help sustain the efforts of which we rely and build upon. We have mentored students as part of the Unidata Summer Internship Program, and worked across organizations and disciplines in support of their internship efforts.
Prepared September 2019