Status Report: Python
April - September 2015
Ryan May, Sean Arms, Julien Chastang, Ward Fisher, Russ Rew
Strategic Focus Areas
We support the following goals described in Unidata Strategic Plan:
- Enable widespread, efficient access to geoscience data
Python can facilitate data-proximate computations and analyses through Jupyter Notebook technology. In particular, Jupyter Notebook web servers can be co-located to the data source for analysis and visualization through web browsers. This capability in turn, reduces the amount of data that must travel across computing networks. - Develop and provide open-source tools for effective use of geoscience data
Our current and forthcoming efforts in the Python arena will facilitate analysis of geoscience data. This goal will be achieved by continuing to develop Python APIs tailored to Unidata technologies. Starting with the summer 2013 Unidata training workshop, we developed an API to facilitate data access from a THREDDS data server. This effort has been encapsulated with the new siphon project, which is an API for communicating with a THREDDS server. Moreover, Python technology coupled with HTML5 Jupyter Notebook technology has the potential to address "very large datasets" problems. In particular, a Jupyter Notebook can be theoretically co-located to the data source and accessed via a web browser thereby allowing geoscience professionals to analyze data where the data reside without having to move large amounts of information across networks. This concept fits nicely with the "Unidata in the cloud" vision. Lastly, as a general purpose programming language, Python has the capability to analyze and visualize diverse data in one environment through numerous, well-maintained open-source APIs.
- Provide cyberinfrastructure leadership in data discovery, access, and use
The TDS catalog crawling capabilities found in siphon will facilitate access to data remotely served by the Unidata TDS, as well as other TDS instances around the world. The desired goal of pyCDM is to construct a geoscience focused data model in Python, based heavily on the netCDF-Java implementation of the Common Data Model (CDM). pyCDM is anticipated to provide a simple, pythonic API to the higher level functionality of the FeatureType layer of the CDM.
- Build, support, and advocate for the diverse geoscience community
Based on interest from the geoscience community, Unidata, as part of its annual training workshop, hosted a three day session to explore “Python with Unidata technology”. Also, to try to help the use of NetCDF in Python, Unidata has promoted Jeff Whittaker’s NetCDF4-python project, including hosting its repository under Unidata’s GitHub account. Unidata is also fostering some community development of meteorology-specific tools under the MetPy grassroots project.
Activities Since the Last Status Report
Training Workshop
The Python with Unidata Technologies workshop had 12 attendees, and was again the most well-attended of all the training workshop. This year, we expanded the workshop to 3 days, mostly just to improve pacing of the material; this seemed to work out well. It is interesting to note how many users new to Python were in attendance. It is also interesting to note that attendance was dominated by IT staff and oceanographers; there were not many meteorologists in attendance.
In conjunction with the workshop we also developed a Unidata Python Docker image. It contains a minimal conda distribution along with packages related to Unidata technology and Python.
Siphon
Siphon represents a rebranding of PyUDL, as we try to elevate our Python support in TDS to a higher status. We anticipate developing Siphon to ensure that it is easy as possible to download data from a TDS in Python, keeping pace with new features added on the Java side.
Progress has been made on the following:
- Cleaned-up catalog parsing
- Complete implementation of CDM Remote protocol
- Implemented clients for speaking to TDS REST protocols: NCSS and Radar Query Service
JupyterHub
JupyterHub, part of project Jupyter, is a multi-user Jupyter Notebook server, with a highly-pluggable design. In support of several cloud efforts (NOAA big data, server-side processing), Ryan May has developed a set of docker images that support running a Unidata JupyterHub instance running on Amazon EC2. Authentication of users is managed using GitHub (against a simple whitelist of allowed users). Users are sandboxed from each other (and the master system) through Docker, which allows spawning individual containers on a per-user basis. Facilities provided through the Jupyter Notebook interface include: uploading files (both notebooks and data), terminal access (for installing packages, including using git), and of course execution of Python 2 and 3 code (or potentially other kernels). The interface also works on tablets, giving a nice solution for doing Python analysis through a tablet.
We would like to start extending the testing of this server outside Unidata, to see how this capability solves issues of working with large remote datasets, as well as providing managed Python environments.
Progress has been made on the following:
- Implementing GitHub OAuth setup
- Proper SSL certificate for proper functionality on iOS
- Learning how to run Docker containers from within Docker
- Spinning up machines on Amazon EC2 using Docker-machine
Dependencies, challenges, problems, and risks include:
- Management of user storage space is tricky, though Docker 1.9 includes actual volume support which should ease this
- This relies upon having the resources to run a sufficiently powerful machine in EC2 as a service to the community
MetPy
After feedback from the last users’ committee meeting, a push was made to bring MetPy forward as a place for community collaboration on meteorology tools that fit within the rest of the scientific Python ecosystem (aka. PyGempak). This project was announced for collaboration in late May with a blog post, and followed up with a presentation at the triennial workshop in June. Feedback has been quite positive, even beyond those who have participated on GitHub.
A presentation for 2016 AMS Python symposium has been submitted, which will hopefully do more to drive event further community interest in this project. Ideas for further development are outlined on GitHub.
Progress has been made on the following:
- Project infrastructure created, including automated testing, documentation builds, and conda package generation.
- Skew-T plots, NEXRAD data reading, and unit-aware python calculations all present and well-documented
- Community awareness and involvement progressing well given the early status of the project
Dependencies, challenges, problems, and risks include:
- As a grassroots project (without dedicated staff time), it’s difficult to consistently keep the project moving forward
External Participation
The Python team attends conferences as well as participates in other projects within the scientific Python ecosystem. This allows us to stay informed and to be able to advocate for our community, as well as keep our community updated on developments. Ryan May attended the 2015 SciPy conference in Austin; major takeaways:
- xray is a new package from Stephen Hoyer of the Climate Corporation; it serves as a layer on top of netcdf data to provide simple query capabilities (think pandas for multidimensional arrays) There was a lot of excitement from the SciPy community regarding this package
- Project Jupyter announced another major round of funding, which ensure that the Jupyter notebook will continue its steady development
- There was a distinct lack of meteorologists at the conference, in comparison with the size of the oceanography community present. We would encourage anyone with interest in Python to consider attending the conference; it’s a fun and really informative week.
Ryan May has also continued to be an active participant in the matplotlib community, reviewing some pull requests and contributing several others. We also continue to host Jeff Whittaker’s netCDF4-python project repository; Jeff continues to be the active maintainer of the project.
Progress has been made on the following:
- Fixed unit support in matplotlib--this facilitates improved unit support for MetPy’s Skew-T plotting
- Contributed support in matplotlib for embedding animations (as HTML5 video) within the Jupyter Notebook
- Contributed a pull request to merge in the community developed JSAnimation package, which embeds animations in the notebook as a javascript animation.
Dependencies, challenges, problems, and risks include:
- Due to little dedicated staff time for these activities, keeping up on these activities is not guaranteed
Ongoing Activities
We plan to continue the following activities:
- “Python with Unidata Technologies” training workshop
- Maintaining Siphon as an official Python API for working with TDS
- Continued participation in the scientific Python community
- Relevant matplotlib support and fixes
- Working with JupyterHub as a way to facilitate data-proximate analysis
- Growing and developing MetPy as a community resource for Python in meteorology
New Activities
Over the next three months, we plan to organize or take part in the following:
- Using supplemental funds from NSF, develop asynchronous training materials for Python in meteorology. We are investigating the use of a cloud server hosting executable Jupyter Notebooks (based on our training workshop) as the core of the training materials, using either the tmpnb or jupyterhub packages from Project Jupyter.
Beyond a one-year timeframe, we plan to organize or take part in the following:
- Evaluate the possibility of extending siphon functionality to interface with the AWIPS-II EDEX server
Areas for Committee Feedback
We are requesting your feedback on the following topics:
- What are the biggest obstacles that you see to the use of Python with other Unidata technologies, or for use in meteorology in general?
- How valuable do find an effort like MetPy to the Python meteorology community? Are there additional barriers we could remove through this project? Are there other efforts over which this should take priority?
Relevant Metrics
Siphon (since April):
- 94% test coverage
- 544 downloads/month from the Python package index (no metrics for anaconda.org)
- 3 externally contributed issues, 1 external pull request
MetPy (since April):
- 95% test coverage
- 847 downloads/month from the Python package index (no metrics for anaconda.org)
- 5 externally contributed issues, 3 external pull requests
Prepared September 2015