1 of 22

“dependency heaviness” of R packages

Zuguang Gu / @jokergoo_gu

German Cancer Research Center (DKFZ)

Heidelberg Germany

https://github.com/jokergoo/pkgndep

2 of 22

What is the package dependency?

3 of 22

Risks of large dependency of a package P

1. Users have to install a lot of additional packages when installing P, which would bring the risk that installation failure of any upstream package stops the installation of P;

2. The number of packages loaded into the R session after loading P will be huge, which increases the difficulty to reproduce a completely same working environment on other computers;

3. Dependencies of P will spread to all its child packages;

4. On the platforms for continuous integration such as GitHub Action or Travis CI, automatic validation of P could easily fail due to the failures of its upstream packages.

4 of 22

Some repositories also report package dependencies

5 of 22

Question

Package P inherits all its dependencies from its parent packages, thus, to reducing the complexity of P’s dependencies, we need to answer the key question “which parent contributes high dependencies to its child package?”, or “which are the heaviest parents”?

We designed a new metric: “dependency heaviness” to quantitatively measure this effect.

6 of 22

Dependency categories of P

Strong parent packages: the packages listed in “Depends”, “Imports” and “LinkingTo”. Mandatory to be installed when installing P.

Weak parent packages: the packages listed in “Suggests” and “Enhances”. Optionally to be installed when installing P.

Strong dependency packages: the total packages by recursively looking for their parent packages. Mandatory to be installed when installing P.

7 of 22

Heaviness of a strong parent A on P

n1 is the number of strong dependencies of P;

n2 is the number of strong dependencies of P after changing A from a strong parent to a weak parent, i.e., moving A to “Suggests” of P;

Heaviness of A on P denoted as h = n1 - n2.

8 of 22

From the aspect of dependency network

Network only contains strong dependency relations.

n1: number of upstream nodes of P,

n2: number of upstream nodes of P by removing the edge of A -> P.

So the heaviness measures number of additionally unique dependencies brought by A and not by any other parent packages.

9 of 22

Heaviness of a weak parent B on P

n1 is the number of strong dependencies of P;

n2 is the number of strong dependencies of P after changing B from a weak parent to a strong parent, i.e., moving B to “Imports” of P.

Heaviness of B on P denoted as h = n2 - n1.

10 of 22

From the aspect of dependency network

Network only contains strong dependency relations.

n1: number of upstream nodes of P,

n2: number of upstream nodes of P by adding the edge of B -> P.

11 of 22

Dependency heatmap: visualizing the dependencies

12 of 22

Usage

13 of 22

Usefulness of the dependency heaviness analysis

Provides hints for reducing the complexity of strong package dependencies.

Heavy strong parents -> weak parents

Of course, how to optimize the dependency depends on the specific use of parent packages in the corresponding package.

But here we can give three general examples of how to reduce the dependency complexity based on heaviness analysis.

14 of 22

Example 1. the mapStats package

An extremely heavy parent Hmisc can be observed where Hmisc has a heaviness of 49.

15 of 22

Example 1. the mapStats package

The 49 additional dependencies imported from Hmisc can be avoided by simply reimplementing a function capitalize() by developer’s own.

16 of 22

Example 2. the ComplexHeatmap package

17 of 22

Example 2. the ComplexHeatmap package

We moved some heavy parents, which only provide enhanced functionalities for ComplexHeatmap but are not expected to be frequently used, to “Suggests”, such as

  1. the package dendextend (contributing a heaviness of 32) which is only used for coloring dendrogram branches;
  2. the package gridtext (contributing a heaviness of 14) which is only for customizing text formats in heatmaps.

These two packages are only required when the corresponding functionalities are used by users.

18 of 22

Example 3. the cola package

A typical example of bioinformatics R packages which integrated many analysis from many other packages.

19 of 22

Example 3. the cola package

cola performs consensus clustering as its core analysis which is expected to be very frequently used by users.

It also provides other downstream/secondary analysis such as functional enrichment analysis which are expected to be less used.

Thus, some heavy packages only for secondary analysis are moved to “Suggests”: For example,

  • package clusterProfiler which is for functional enrichment analysis contributes a heaviness of 91;
  • package ReactomePA which provides Reactome pathways for enrichment analysis contributes a heaviness of 94.

20 of 22

Other ways to reduce dependency complexity

  1. Directly copy code from heavy parents. This approach is of course NOT recommended from the aspect of software engineering, but actually it is widely used in CRAN packages. (https://doi.org/10.1109/IWSC.2015.7069885)
  2. Try to separate a large package into several smaller packages which focus on more specific tasks.

To reduce dependency from weak parents

  1. Do not make the corresponding code runnable and put weak parents to “Enhances”.
  2. More complex vignettes can be served separately while not being shipped with the package.

21 of 22

Future plans

We are planning to perform systematic analysis on CRAN/Bioconductor package ecosystems:

  1. how dependency heaviness spreads from parents to child packages and what are the heaviest parent packages?
  2. how dependency heaviness spreads from long-range upstream to downstream packages and what are the “core paths” that spread the dependency heaviness?

Partially done with the function pkgndep:::dependency_website().

22 of 22

Thank you and questions!