Status Report: netCDF
October 2020 - April 2021
Ward Fisher, Dennis Heimbigner
Areas for Committee Feedback
We are requesting your feedback on the following topics:
- Are there other cloud-based block storage formats/locations (TileDB, Azure, etc) that are actively in use? What is the next venue for investigation once we have our Zarr support in place?
- How can we encourage more user testing of the release candidates we provide?
Activities Since the Last Status Report
We are using GitHub tools for C, Fortran and C++ interfaces to provide transparent feature development, handle performance issues, fix bugs, deploy new releases and to collaborate with other developers. Additionally, we are using docker technology to run netCDF-C, Fortran and C++ regression and continuous integration tests. We currently have 164 open issues for netCDF-C, 68 open issues for netCDF-Fortran, and 40 open issues for netCDF-C++. The netCDF Java interface is maintained by the Unidata CDM/TDS group and we collaborate with external developers to maintain the netCDF Python interface.
In the netCDF group, progress has been made in the following areas since the last status report:
- The release of ncZarr (netCDF with native Zarr support) has been released as of netCDF-C version 4.8.0.
- Migrated the NetCDF User’s Guide to a new, separate repository. This repository will contain the concise, language-agnostic summary of the netCDF data model. Language-specific documentation (primarily used by developers) will remain associated with the individual code repositories.
- Further enhancements to the netCDF-C documentation, modernization of the netCDF-Fortran and netCDF-C++ documentation.
- We continue to see a high volume of contributions to the netCDF code base(s) from our community. While these contributions require careful review and consideration, it is encouraging to see this model of development (enabled by our move to GitHub) being more fully embraced by our community.
- We have prepared and presented multiple times on the netCDF-Zarr roadmap and plans to working groups at both NASA and NOAA.
Dependencies, challenges, problems and risks include:
- The small group of netcdf developers is under a lot of pressure to provide project management as well as implement new features, fix bugs, provide esupport, etc. With 1.5 FTE assigned to the project, the workload is significant.
- Rapid evolution of Zarr standard is very useful, but also provides a bit of a moving target.
- Increase in external contributions has greatly increased the project management overhead for netCDF-C/C++/Fortran.
- Advances in compilers (GCC 10.x) and newer architectures (such as Apple’s ARM M1 architecture) are requiring additional overhead to ensure compatibility.
Ongoing Activities
We plan to continue the following activities:
- Continue work towards adoption of additional storage options, separating out the data model from the data storage format (as much as possible).
- Provide support to a large worldwide community of netCDF developers and users.
- Continue development, maintenance, and testing of source code for multiple language libraries and generic netCDF utility programs.
- Continue modernizing the documentation for netCDF-C, Fortran and C++ libraries.
- Extend collaboration as opportunities arise, for increasing the efficiency of parallel netcdf-3 and netcdf-4.
New Activities
NetCDF/Zarr Integration
The netCDF team has released the first public version of netCDF-C which provides Zarr I/O compatibility, dubbed ‘ncZarr’. This work has been highly anticipated, and well received, by the broader netCDF community.
- Release of XArray compatibility.
Over the next three months, we plan to organize or take part in the following:
- Release the first version of netCDF with Zarr+Xarray support (ncZarr).
- Release subsequent versions of netCDF-C, netCDF-Fortran, netCDF-C++.
- Continue modernizing/editing the netCDF documentation to provide easy access to documentation for older versions of netCDF.
Over the next twelve months, we plan to organize or take part in the following:
- Release an official Windows port of the netCDF-Fortran and netCDF-C++ interfaces.
- Continue to encourage and support the use of netCDF-4's enhanced data model by third-party developers.
- Expand support for native object storage in the netCDF C library.
- Continue to represent the Unidata community in the HDF Technical Advisory Board process.
- Continue to represent the Unidata community in the Zarr/n5 collaboration conference calls.
Beyond a one-year timeframe, we plan to organize or take part in the following:
- Improve scalability to handle huge datasets and collections.
- Improve the efficiency of parallel netcdf3 and parallel netcdf4.
- Continue to add support for both file-storage and object-storage options.
Relevant Metrics
Static Analysis Metrics
There are currently about 226,892 lines of code (up from 202,428 lines of code) in the netCDF C library source. The Coverity estimate for defect density (the number of defects per thousand lines of code) in the netCDF C library source has slightly decreased to 0.68, where it was 0.68 six months ago. According to Coverity static analysis of over 250 million lines of open source projects that use their analysis tools, the average defect density with 100,000 to 500,000 lines of code is 0.50.
Google Metrics
Google hits reported when searching for a term such as netCDF-4 don't seem very useful over the long term, as the algorithms for quickly estimating the number of web pages containing a specified term or phrase are proprietary and seem to change frequently. However, this metric may be useful at any particular time for comparing popularity among a set of related terms.
Currently, Google hits, for comparison, are:
- 951,000 for netCDF-3
- 884,000 for netCDF-4
- 1,130,000 for HDF5
- 116,000 for GRIB2
Google Scholar hits, which supposedly count appearances in peer-reviewed scholarly publications, are:
- 440 for netCDF-3
- 972 for netCDF-4
- 17,500 for HDF5
- 1,380 for GRIB2
Strategic Focus Areas
We support the following goals described in Unidata Strategic Plan:
- Managing Geoscience Data
by supporting the use of netCDF and related technologies for analyzing, integrating, and visualizing multidimensional geoscience data; enabling effective use of very large data sets; and accessing, managing, and sharing collections of heterogeneous data from diverse sources.
- Providing Useful Tools
by developing netCDF and related software, and creating regular software releases of the C, C++ and Fortran interfaces; providing long-term support for these tools through the various avenues available to the Unidata staff (Github, eSupport, Stackoverflow, etc).
- Supporting People
by providing expertise in implementing effective data management, conducting training workshops, responding to support questions, maintaining comprehensive documentation, maintaining example programs and files, and keeping online FAQs, best practices, and web site up to date; fostering interactions between community members; and advocating community perspectives at scientific meetings, conferences, and other venues.
Prepared April, 2021