“dependency heaviness” of R packages
Zuguang Gu / @jokergoo_gu
German Cancer Research Center (DKFZ)
Heidelberg Germany
https://github.com/jokergoo/pkgndep
What is the package dependency?
Risks of large dependency of a package P
1. Users have to install a lot of additional packages when installing P, which would bring the risk that installation failure of any upstream package stops the installation of P;
2. The number of packages loaded into the R session after loading P will be huge, which increases the difficulty to reproduce a completely same working environment on other computers;
3. Dependencies of P will spread to all its child packages;
4. On the platforms for continuous integration such as GitHub Action or Travis CI, automatic validation of P could easily fail due to the failures of its upstream packages.
Some repositories also report package dependencies
Question
Package P inherits all its dependencies from its parent packages, thus, to reducing the complexity of P’s dependencies, we need to answer the key question “which parent contributes high dependencies to its child package?”, or “which are the heaviest parents”?
We designed a new metric: “dependency heaviness” to quantitatively measure this effect.
Dependency categories of P
Strong parent packages: the packages listed in “Depends”, “Imports” and “LinkingTo”. Mandatory to be installed when installing P.
Weak parent packages: the packages listed in “Suggests” and “Enhances”. Optionally to be installed when installing P.
Strong dependency packages: the total packages by recursively looking for their parent packages. Mandatory to be installed when installing P.
Heaviness of a strong parent A on P
n1 is the number of strong dependencies of P;
n2 is the number of strong dependencies of P after changing A from a strong parent to a weak parent, i.e., moving A to “Suggests” of P;
Heaviness of A on P denoted as h = n1 - n2.
From the aspect of dependency network
Network only contains strong dependency relations.
n1: number of upstream nodes of P,
n2: number of upstream nodes of P by removing the edge of A -> P.
So the heaviness measures number of additionally unique dependencies brought by A and not by any other parent packages.
Heaviness of a weak parent B on P
n1 is the number of strong dependencies of P;
n2 is the number of strong dependencies of P after changing B from a weak parent to a strong parent, i.e., moving B to “Imports” of P.
Heaviness of B on P denoted as h = n2 - n1.
From the aspect of dependency network
Network only contains strong dependency relations.
n1: number of upstream nodes of P,
n2: number of upstream nodes of P by adding the edge of B -> P.
Dependency heatmap: visualizing the dependencies
Usage
Usefulness of the dependency heaviness analysis
Provides hints for reducing the complexity of strong package dependencies.
Heavy strong parents -> weak parents
Of course, how to optimize the dependency depends on the specific use of parent packages in the corresponding package.
But here we can give three general examples of how to reduce the dependency complexity based on heaviness analysis.
Example 1. the mapStats package
An extremely heavy parent Hmisc can be observed where Hmisc has a heaviness of 49.
Example 1. the mapStats package
The 49 additional dependencies imported from Hmisc can be avoided by simply reimplementing a function capitalize() by developer’s own.
Example 2. the ComplexHeatmap package
Example 2. the ComplexHeatmap package
We moved some heavy parents, which only provide enhanced functionalities for ComplexHeatmap but are not expected to be frequently used, to “Suggests”, such as
These two packages are only required when the corresponding functionalities are used by users.
Example 3. the cola package
A typical example of bioinformatics R packages which integrated many analysis from many other packages.
Example 3. the cola package
cola performs consensus clustering as its core analysis which is expected to be very frequently used by users.
It also provides other downstream/secondary analysis such as functional enrichment analysis which are expected to be less used.
Thus, some heavy packages only for secondary analysis are moved to “Suggests”: For example,
Other ways to reduce dependency complexity
To reduce dependency from weak parents
Future plans
We are planning to perform systematic analysis on CRAN/Bioconductor package ecosystems:
Partially done with the function pkgndep:::dependency_website().
Thank you and questions!