Status Report: netCDF
October 2019 - April 2020
Ward Fisher, Dennis Heimbigner
Areas for Committee Feedback
We are requesting your feedback on the following topics:
- Are there other cloud-based block storage formats/locations (TileDB, Azure, etc) that are actively in use? What is the next venue for investigation once we have our Zarr support in place?
- Are there any emergent avenues (stack overflow, etc) for user support which the netCDF team should investigate?
- How can we encourage more user testing of the release candidates we provide?
Activities Since the Last Status Report
We are using GitHub tools for C, Fortran and C++ interfaces to provide transparent feature development, handle performance issues, fix bugs, deploy new releases and to collaborate with other developers. Additionally, we are using docker technology to run netCDF-C, Fortran and C++ regression and continuous integration tests. We currently have 101 open issues for netCDF-C, 48 open issues for netCDF-Fortran, and 31 open issues for netCDF-C++. The netCDF Java interface is maintained by the Unidata CDM/TDS group and we collaborate with external developers to maintain the netCDF Python interface.
In the netCDF group, progress has been made in the following areas since the last status report:
- Multiple releases of the core C library, as well as the Fortran and C++ interfaces.
- Refinement of user-defined compression filters.
- Continuing Work towards enhanced parallel I/O.
- Further enhancements to the netCDF-C documentation, modernization of the netCDF-Fortran documentation.
- Progress moving the netCDF Users Guide (NUG) into its own repository.
- Extended continuous integration platforms have been adopted.
- An architecture roadmap is available describing how the netcdf-c library will support thread-safe operation in *nix* and Windows environments. The draft proposal is available as netcdf-c github issue #382.
- We have seen a high volume of contributions to the netCDF code base(s) from our community. While these contributions require careful review and consideration, it is encouraging to see this model of development (enabled by our move to GitHub) being more fully embraced by our community.
Dependencies, challenges, problems and risks include:
- Small group of developers for supporting large project.
- Dependency on HDF5, controlled by external group.
- Rapid evolution of Zarr standard is very useful, but also provides a bit of a moving target.
- Increase in external contributions has greatly increased the project management overhead for netCDF-C/C++/Fortran.
Ongoing Activities
We plan to continue the following activities:
- Continue work towards adoption of additional storage options, separating out the data model from the data storage format (as much as possible).
- Provide support to a large worldwide community of netCDF developers and users.
- Continue development, maintenance, and testing of source code for multiple language libraries and generic netCDF utility programs.
- Continue organization of Doxygen-generated documentation for netCDF-C, Fortran and C++ libraries.
- Extend collaboration as opportunities arise, for increasing the efficiency of parallel netcdf-3 and netcdf-4.
New Activities
NetCDF/Zarr Integration
The netCDF team has begun the technical work of adopting Zarr functionality in the core C library. This will allow for object-based storage (Amazon S3, etc). We expect to have the initial release in the next 1-2 months.
Status
- Meta data is being properly written and read
- The content data of variables is not yet supported.
- ncgen and ncdump -h work with some limitations (as shown)
- Much testing:
- move and modify tests from the ncdump/ and the nc_test4/ directories to zarr_test/.
Support for pure zarr
- Requires producing simulated data when _ncz... is missing:
- simulated _nczarr: get the zarr version from the root group, use library built-in value for nczarr_version.
- simulated _nczcontent: Assume we have a group whose key is e.g. /y/z/g.
- dims: All variables whose shape is, say, (m,n), create dimensions in the root group of form dim_m=m and dim_n=n.
- vars: collect values of X for all keys of form "/y/z/g/X/.zarray".
- grps: collect values of X for all keys of form "/y/z/g/X/.zgroup".
- simulated _nczvar - contains netcdf-4 specific information for a
- Variable
- dimrefs: Using the shape of the variable to figure out the dim names
- contiguous is always false
- simulated _nczattr: infer attribute type from the values of the
- Attribute
- (process is somewhat complex and is similar to algorithm in ncgen)
Over the next three months, we plan to organize or take part in the following:
- Work on reducing the defects reported by static analysis.
- Release the first version of netCDF with Zarr support.
- Release the next versions of netCDF-C, netCDF-Fortran, netCDF-C++.
- Continue modernizing the netCDF documentation to provide easy access to documentation for older versions of netCDF.
Over the next twelve months, we plan to organize or take part in the following:
- Release an official Windows port of the netCDF-Fortran and netCDF-C++ interfaces.
- Continue to encourage and support the use of netCDF-4's enhanced data model by third-party developers.
- Expand support for native object storage in the netCDF C library.
- Enhance thread-safety for the netCDF C library.
- Continue to represent the Unidata community in the HDF Technical Advisory Board process.
- Continue to represent the Unidata community in the Zarr/n5 collaboration conference calls.
Beyond a one-year timeframe, we plan to organize or take part in the following:
- Improve scalability to handle huge datasets and collections.
- Improve the efficiency of parallel netcdf3 and parallel netcdf4.
- Continue to add support for both file-storage and object-storage options.
Relevant Metrics
Static Analysis Metrics
There are currently about 226,892 lines of code (up from 202,428 lines of code) in the netCDF C library source. The Coverity estimate for defect density (the number of defects per thousand lines of code) in the netCDF C library source has slightly decreased to 0.68, where it was 0.68 six months ago. According to Coverity static analysis of over 250 million lines of open source projects that use their analysis tools, the average defect density with 100,000 to 500,000 lines of code is 0.50.
Google Metrics
Google hits reported when searching for a term such as netCDF-4 don't seem very useful over the long term, as the algorithms for quickly estimating the number of web pages containing a specified term or phrase are proprietary and seem to change frequently. However, this metric may be useful at any particular time for comparing popularity among a set of related terms.
Currently, Google hits, for comparison, are:
- 900,000 for netCDF-3
- 861,000 for netCDF-4
- 924,000 for HDF5
- 106,000 for GRIB2
Google Scholar hits, which supposedly count appearances in peer-reviewed scholarly publications, are:
- 407 for netCDF-3
- 853 for netCDF-4
- 14,900 for HDF5
- 1,240 for GRIB2
Strategic Focus Areas
We support the following goals described in Unidata Strategic Plan:
- Managing Geoscience Data
by supporting the use of netCDF and related technologies for analyzing, integrating, and visualizing multidimensional geoscience data; enabling effective use of very large data sets; and accessing, managing, and sharing collections of heterogeneous data from diverse sources.
- Providing Useful Tools
by developing netCDF and related software, and creating regular software releases of the C, C++ and Fortran interfaces; providing long-term support for these tools through the various avenues available to the Unidata staff (Github, eSupport, Stackoverflow, etc).
- Supporting People
by providing expertise in implementing effective data management, conducting training workshops, responding to support questions, maintaining comprehensive documentation, maintaining example programs and files, and keeping online FAQs, best practices, and web site up to date; fostering interactions between community members; and advocating community perspectives at scientific meetings, conferences, and other venues.
Prepared March 2020