How Important is Exploration and Prioritization?
Written: May 2022. Published: Apr 2025.
In this document I demonstrate that under many circumstances, one should spend lots of effort (and likely more than commonly assumed[1]) on exploration (i.e. looking for new projects) and prioritization (i.e. comparing known projects), rather than exploitation (i.e. directly working on a project). Furthermore, I provide heuristics that help one decide how much effort to spend on exploration and prioritization.
I assume that the goal is to maximize expected altruistic impact, where by “expected” I’m referring to the mathematical expectation. The actor here can be an individual or a community, and a project can be anything ranging from a 1-week personal project to an entire cause area.
Section 1 will deal with exploration, section 2 will deal with prioritization, and section 3 will test and apply the theory in the real world. Mathematical details can be found in the appendix.
Key Findings
This is an explorative study, so the results are generally of low confidence, and everything below should be seens as hypotheses. The main goal is to suggest paths for future research, rather than to provide definitive answers.
Theoretical Results
- Claim 3 (Importance of exploration): We, as a community and as individuals, should spend more than half of our effort on searching for potential new projects, rather than working on known projects.
- Confidence: 45% / 20% [2]
- Claim 6a (Importance of prioritization, uncorrelated case): When faced with
(
isn’t too small) equally-good longtermist projects that aren’t correlated, we should act as if the
tasks of evaluating every project are each as important as working on the best project that we finally identify, and allocate our effort evenly across the
tasks, as long as we are able to identify the ex-post best project[3] by working on the
evaluation tasks.
- Claim 6b (Importance of prioritization, correlated case): When the
projects are strongly correlated, prioritization becomes much less important than in the no-correlation case. However, one should still be willing to spend only a small portion (something like
or
; note that this is already much higher than the
in Claim 6a) of one’s effort on direct work, and to spend the remaining portion on prioritization.
Practical Results
- Claim 7 (Applicability in real world): In real world, the most important factors deciding the applicability of our model are difficulty of reducing uncertainty[4] by exploration and prioritization (E&P), comparative scalability of E&P vs exploitation, heavy-tailedness of opportunities, fixed total budget, and utility-maximization objective.
- Confidence: Moderate (hard to quantify)
- Claim 8 (Applying the framework to different parts of EA): In EA, the top 3 areas where E&P deserves the largest portions of resources (relative to the total resources allocated to that area) are
- #1: identifying promising individuals[5] (relative to the budget of talent cultivation),
- #2: cause prioritization (relative to the budget of all EA research and direct work),
- #3: within-cause prioritization (relative to the budget of that cause).
- Confidence: Low (hard to quantify)
Instrumental Results
Note that the following claims are results (based on mostly qualitative arguments) rather than assumptions.
- From section 1.2: The distribution of TOC impact (impact stemming from a project’s TOC (theory of change), as opposed to e.g. flow-through effects) across different projects is strictly more heavy-tailed than log-normal.
- Confidence: 70% / 55% [6]
- Claim 2 (stronger version of previous claim): The distribution of TOC impact across different projects resembles the Pareto distribution in terms of heavy-tailedness.
- Claim 4: For any particular project, the distribution of its TOC impact across all potential scenarios is (roughly) at least as heavy-tailed as the Pareto distribution.[7]
Important Limitations
- Negative impact is ignored.
- Pascal’s wager emerges in the model of prioritization, but we don’t have a very good way to handle this. (relevant)
- The “log-normal world” is ignored, and the focus is primarily on the “power law world”. This exaggerates the importance of E&P (exploration and prioritization).
- “Log-normal world” is the possibility that the distribution of impact is log-normal, and “power law world” is the possibility that the distribution obeys a power law (cf. claim 2, claim 4).
- I think the power law world is somewhat more likely to be the realistic model for our purpose than the log-normal world is. See section 1.1, 1.2 and 2.1 on this.
- Non-TOC impact of projects (e.g. flow-through effects) are ignored.
- This is justifiable in some cases but not in all. (see the last parts of section 1.1)
- Individual projects are assumed to consist purely of direct work (rather than a mixture of E&P and direct work, which is often the case in reality), and all parts of an individual project are homogeneous. This is especially problematic when we define projects to be larger in scope, e.g. when projects are entire cause areas.[8] Moreover, it’s sometimes hard to distinguish direct work and E&P, e.g. many kinds of direct work also provide insights on prioritization.
- Despite this, findings in this document still tell us to optimize more for information value (E&P) when your project leads to both information value and direct impact.
- Differences in scalability (diminishing returns) are mostly ignored.
- Externalities (and, more generally, all social interactions) are ignored.
- For example, if your own E&P also provides information value to others, then your E&P should be more important than what’s suggested in this document. Also it will be especially important for you to make your attempts & conclusions publicly known (e.g. writing about my job). On the other hand, you also benefit from other people’s E&P, which reduces the importance of doing E&P yourself.
1 Exploration: Discovering New Projects
1.1 Impact of All Potential Projects is Heavy-Tailed
First we need to figure out the impact distribution of all potential projects. As an example, we can take a look at the cost-effectiveness distribution of health interventions: (source)

This is a heavy-tailed distribution, which means roughly that a non-negligible portion of all interventions manages to reach a cost-effectiveness that is orders of magnitude higher than the median one. As another piece of evidence, the success of for-profit startups also follows a heavy-tailed distribution (the for-profit startup space is arguably analogous to the philanthropy space).
This heavy-tail phenomena occurs in the longtermist space as well. For example, the classical argument for longtermism holds that reducing existential risk has an EV orders of magnitude higher than most other altruistic interventions. Carl Shulman’s comments to this GiveWell blog post also fleshed out a case for heavy-tailed distribution of intervention effectiveness when taking into account longtermist interventions.
On the other hand, Brian Tomasik has argued that charities usually don't differ astronomically in cost-effectiveness, based on several considerations including, most notably, flow-through effects. However:
- The ~10x or ~100x difference in effectiveness estimated in Brian’s essay is enough to allow for a heavy-tailed (in an informal sense) distribution, regardless of whether this 100x upper bound is really reasonable.
- The previous arguments all focus on “TOC impact” (impact stemming from the project’s intended theory of change, or other pathways highly similar to the TOC), while Brian’s essay takes “non-TOC impact” (which, arguably, consists mostly of flow-through effects) into account.
- When can we be justified in ignoring non-TOC impact?
- When non-TOC impact doesn’t matter to our specific purpose of investigation, or
- When we’re (almost) completely clueless about the sign and size of non-TOC impact (even after investigating it), or
- When we only care about TOC impact and not the spooky non-TOC impact, or
- When non-TOC impact is strongly and positively correlated with TOC impact, or
- When non-TOC impact isn’t large (compared to TOC impact) for top projects. More concretely, we need the following statements to hold:
- For any project X, among all its pathways of impact, one pathway (or a cluster of similar pathways), which we call the main pathway, accounts for >50% of impact, and
- When X’s TOC isn’t its main pathway, we can always find another project Y whose TOC is exactly X’s main pathway, and whose impact surpasses X.
- I think ii. (a factual claim) is <50% likely to be true, and iii. (a normative claim) is <50% likely to be justifiable.
- I’m uncertain regarding iv. and v., not inclining to either direction.
- I think i. does apply (to some extent) to the investigations in this document, the reason for which I’ll explain later.
- For this reason, and also for simplicity, in the rest of this document I will mostly focus on TOC impact. This should be seen as a compromise, and I look forward to seeing future research that handles non-TOC impact well.
From now on I’ll use “total impact” to refer to the sum of TOC and non-TOC impact.
To sum up,
Claim 1: TOC impact of different projects follow a heavy-tailed distribution.
- Confidence: ≥80% / ≥80% [9]
1.2 TOC Impact of All Potential Projects Follows a Power Law
When talking about heavy-tailed distributions, two distributions come to mind: the log-normal distribution, and the Pareto distribution (aka power law). These two distributions are perhaps the two most common heavy-tailed distributions in real world, with the latter being much more heavy-tailed than the former.
For the sake of convenience, we shall adopt one of these two distributions as an approximation of the TOC impact distribution. Here I argue that Pareto distribution (power law) is a better approximation than log-normal distribution. Note that these two distributions are notoriously hard to distinguish empirically, which is part of the reason why I have relatively low confidence in this conclusion.
A commonly seen argument for the log-normalness of TOC impact is that:
- When estimating TOC impact, we usually multiply many factors together to get a cost-effectiveness value (with the unit being DALY/$ or something like that).
- According to the central limit theorem, by multiplying many independent random factors together, the product that we get follows a log-normal distribution.
- Therefore, cost-effectiveness values of different projects follow a log-normal distribution.
This argument has a big flaw. The central limit theorem assumes that there are a fixed number of random factors, while it’s usually not the case for EA prioritization. For example, it’s hard to believe that evaluating global poverty interventions involves that same set of factors as evaluating X-risk reduction interventions. Indeed, it’s exactly the factors related to the long-term impact of human civilization (e.g. the size of the galaxy), which are absent in the evaluation of global poverty interventions, that make X-risk reduction look vastly more appealing to longtermists. What’s more, the independence assumption in step 2 seems dubious too.
After accounting for the difference in the set of factors involved, and the (likely positive) correlation between factors, we should expect to get a distribution skewer than log-normal distribution. Therefore, we can conclude that the TOC impact distribution is likely more heavy-tailed than log-normal distribution. (Note that this does not imply the TOC impact distribution is closer to Pareto than to log-normal)
Next, let’s examine how heavy-tailed the two distributions exactly are.
For Pareto distribution:
- (it means that roughly
of all projects have an impact of
) - (
can be understood as the “heavy-tailedness” parameter. A smaller
means a heavier-tailed distribution.)
- which implies that it’s always
(a constant value) times harder to find a project with impact
than to find a project with impact
, regardless of the exact value of
.
For log-normal distribution:
- The tail distribution of log-normal is roughly and asymptotically
,
- (it means that roughly
of all projects have an impact of
)
- which implies that it’s roughly
(where
is a constant) times harder to find a project with impact
than to find a project with impact
.
I find that the implication of Pareto distribution (constant ratio between project
and project
) fits my impression about EA’s search for global priority. On the other hand, the implication of log-normal distribution (rapidly increasing ratio with increased
) is rather unintuitive.
The “constant ratio” implication of Pareto distribution also fits well with the exponential progress of technology[10], and searching for high-efficiency designs in the design space is arguably somewhat analogous to searching for high-impact projects in the altruistic project space. One may worry that the altruistic project space is much more complex than the design space of, say, integrated circuits. However, more relevant than the progress of any single piece of technology, may be the continued spring of new ideas and inventions in all sectors of society, which sustains the exponential growth of the economy. If we see society as a big piece of “technology”, then its design space seems at least as complex as the altruistic landscape, and yet we managed to improve this “technology” at an exponential rate. A notable counterargument to this line of reasoning might be Holden Karnofsky’s post, which claims that “the effective altruism community’s top causes and ‘crucial considerations’ seem to have (exclusively?) been identified very early”, and “there are some reasons to think that future ‘revolutionary crucial considerations’ will be much harder to find, if they exist.”
Carl Shulman, in his comments to this GiveWell blog post, also argued that log-normal distribution is too thin-tailed, which echoes my point here. Note that his argument applies only to TOC impact and not to total impact.
Conclusion: I consider Pareto distribution to better reflect the degree of heavy-tailedness that the TOC impact distribution displays. Note that I don’t consider the arguments above to be very strong, so I have rather low confidence in the conclusion. I do think the distribution is more likely to be (close to) Pareto than to be (close to) log-normal, but the space of all possible heavy-tailed distributions is so vast, that it’s hard to have too much confidence on one single family of distributions.
To sum up,
Claim 2: The distribution of TOC impact across different projects resembles the Pareto distribution in terms of heavy-tailedness.
I expect the distribution of total impact to be lighter-tailed than that of TOC impact, because:
- It introduces additive terms as opposed to only multiplicative terms.
- “Astronomical stakes”-style arguments (e.g. the one supporting X-risk reduction) applies to most, if not all, interventions, as long as we take into account non-TOC impact.
- More generally, if we take into account non-TOC impact, any argument of the form "consider type of consequence X, which is larger than consequences you had previously considered, as it applies to option A" calls for application of X to analyzing other options (quoted from here).
- Other arguments by Brian Tomasik (which apply to total impact and not to TOC impact).
Overall I think the distribution of total impact is more likely to be close to / lower than log-normal, than to be close to Pareto, in terms of heavy-tailedness. Note that I haven’t spent very much time thinking about this. Also I’m unsure whether log-normal distributions or other distributions with lighter tails are more likely.
1.3 Lots of Effort Should Go Into Exploration
Assuming that TOC impact follows a Pareto distribution, let’s first determine the shape parameter
. While finding the exact value of
is hard, we may instead try to find an upper bound.
According to this report by the Center for Global Development, the cost-effectiveness of global health interventions displays a phenomena where “if we funded all of these interventions equally, 80 percent of the benefits would be produced by the top 20 percent of the interventions”. This implies a shape parameter of
.
After taking into account longtermist interventions, I expect the TOC impact distribution to be significantly more skew than only considering global health, because longtermist TOCs explicitly take long-term effects into account, where there’s more uncertainty (thus higher variance) involved compared to the near term; and because longtermist interventions often focus more on research, politics, etc. which are apparently more heavy-tailed than distributing bednets. This leads to a significantly smaller
. But nevertheless,
can still serve as an upper bound.
Next, let’s try to derive the optimal amount of effort that should go into the search for new projects.
I model the exploration-exploitation process as follows:
- We have 1 unit of time in total. In the first
unit of time (
), we repeatedly draw random samples from the TOC impact distribution.
- (the number of sample we draw is proportional to the duration
; the actual number doesn’t matter)
- In the remaining
unit of time, we work on the one project drawn in step 1 that has the highest impact. Our eventual impact is the product of time spent and project quality.
The expected final impact is approximately
units, where
denotes (approximately) the quality of the project we find, and
denotes the time we spend working on it. Solving for the maximum, the best
turns out to be
, which equals
when
. In other words, we should spend 46.3% of all effort on the exploration phase (step 1).
Then let’s try to account for diminishing returns. I model the process as follows:
- We have 1 unit of time in total. In the first
unit of time (
), we repeatedly draw random samples from the TOC impact distribution. - In the remaining
unit of time, we work on the top
projects drawn in step 1, optimally choosing
and allocating the time in order to maximize total utility. Here I use an isoelastic utility function with
, shifted to the left by 1 unit to keep it non-negative.
- The isoelastic utility function is adopted to model diminishing returns. It has the property that the “degree of bending” at every point on its curve stays constant, which fits intuitively with how diminishing returns should work. This “degree of bending” is specified by the parameter
; the higher
is, the stronger diminishing returns are. Below are plots of the isoelastic utility function with
and
respectively, both shifted to the left by 1 unit.


A numerical optimization (with some approximating) suggests that
(assuming
), meaning we should spend 65.6% of all effort on the exploration phase.
We can also choose a smaller
to have weaker diminishing returns (
means no diminishing returns), or choose a larger
to have stronger diminishing returns. Below is a plot of
over
, assuming
.

Recall that
is only an upper bound of
, but decreasing
further doesn’t turn out to significantly change
, so we can take the 65.6% result as an approximate best guess.
Finally: why are we using TOC impact when doing the modeling, rather than total impact?
- Why use TOC impact in exploration phase: Because, when we generate new project ideas, we’re really generating new TOCs. We’re randomly drawing TOCs from the pool of all TOCs, instead of drawing projects from the pool of projects.
- Why use TOC impact in exploitation phase: For simplicity, and for reasons ii.~v. in section 1.1 . Whether the simplification is realistic, depends on whether reasons ii.~v. hold. (This is one of the limitations of this study.)
To sum up,
Claim 3: We, as a community and as individuals, should spend more than half of our effort on searching for potential new projects, rather than working on known projects.
In contrast:
- In EA Survey 2018, 2019 and 2020, “cause prioritization” was selected by approximately 12% of respondents as the top cause. (I assume that “exploring unknown causes” is also part of “cause prioritization”)
- The survey at EA Leaders Forum gives 9.2% as the optimal percentage of resources spent on cause prioritization.
- It’s estimated based on EA Survey 2019 data that approximately 10% of EA human resources are spent on cause prioritization.
Overall, the EA community as a whole currently seems to allocate approximately 9~12% of its resources to cause prioritization.
Note that donation and grantmaking work (e.g. pouring money into GiveDirectly) is often more scalable than research, community building, management, and other types of work, and this may be part of the reason why the portion of resources spent on cause prioritization is so low. However, the 12% figure and the 10% figure only counted people and not money, so the low scalability of human resources (compared to money) shouldn’t be a big problem there.
Taking one step further, one may argue that cause prioritization research is especially unscalable, and much less scalable than cause-specific research, community building, etc. I don’t have much to say about whether this claim is true, and whether it is enough to justify the 9~12% figure. I’d love to receive input on this.
2 Prioritization: Comparing Known Projects
Unless otherwise specified, all “impact” in section 2 refer to TOC impact rather than total impact.
2.1 Potential TOC Impact of Any Particular Project Follows a Power Law
From now on let’s turn our attention to evaluating known projects. We no longer care about the impact distribution of all potential projects; Instead we focus on how the impact of any particular project is distributed across all potential scenarios.
More rigorously speaking, the distribution can be defined as follows:
- Denote the impact of the project with random variable
, where the randomness comes from the uncertainty of “which future will be actualized”, among all possible futures that could come about. (our project would have different sizes of impact in different futures) - Then what is the probability distribution of
?
Here I argue that such a distribution is roughly as heavy-tailed as the Pareto distribution (aka power law), if not more. I present two arguments here:
- A Pareto distribution predicts that, conditional on our project having an impact of at least
units, there’s a constant probability that the project reaches an impact of
(in contrast to the probability decreasing rapidly with increased
), which fits my intuition. - For long-termist projects, the expected impact is usually dominated by a tiny slice of possibility. In fact, the expected impact seems to be infinite. This indicates that the tail is at least as heavy as Pareto.
- One may (very reasonably) object to Pascal’s wager and thus modify the expected value decision framework to disallow wagers. Often, (variants of) argument 2 still holds under the modified framework.
Argument 1:
- The tail distribution of Pareto is asymptotically
(where
is the shape parameter), meaning that
is always equal to the constant value
, regardless of
. - This prediction fits my intuition. I would find it rather unintuitive if the conditional probability above turns out to decrease rapidly with increased
(something like
, as the log-normal distribution predicts).
Argument 1 is isomorphic to the argument in section 1.2 about tail distributions. One may wonder whether there’re some connections that we can draw between the two arguments, and between the two distributions (the impact distribution across projects, and the impact distribution across scenarios).
I think the two distributions (and therefore the two arguments) are orthogonal, in the sense that any conceivable impact-distribution-across-projects can, in principle, be combined with any conceivable impact-distribution-across-scenarios. As a result, the two arguments ought to be evaluated separately. More details on this in the appendix.
Argument 2:
- Here for the sake of the argument, I shall assume hedonistic utilitarianism and a plain expected value approach. The former isn’t central to the argument, and in argument 2a I’ll try to weaken the latter.
- For X-risk reduction, a very likely scenario may be that humans go extinct after at most a few thousand years, despite our best efforts; or that our bet in AI doesn’t eventually pay off. In such a scenario, we at best manage to extend the life of our civilization by a few centuries, which is almost negligible compared to what we initially hoped for.
- There is another possibility that humans manage to sustain their existence (i.e. keeping X-risk extremely small), but fail to (or aren’t willing to) expand to the whole galaxy and fill it with computational hedonium, contrary to what we had hoped.
- One more possibility is that humans do fill the galaxy with hedonium, and we can finally be happy.
- No, not yet. Humans could also expand to the infinite multiverse, or find some way for hypercomputation, or manage to improve the welfare of mathematical concepts (the population size of which is the largest infinity, whatever that means), or …
- Every scenario above is less likely (and perhaps way less likely) than the previous one, yet seems to dominate the previous scenario with its astronomically larger stake. Such cascading dominance also suggests that the distribution has an infinite mean (i.e. expected value). Both traits point to the distribution being at least as heavy-tailed as Pareto.
- Now, this only shows that the impact distribution of X-risk reduction is at least as heavy-tailed as Pareto, rather than of every intervention. However, in real life, almost every time when I try to do back-of-the-envelope calculations for an action’s impact, I eventually find myself in a situation like the one described above. This makes me suspect that situations like this (aka Pascal’s wager) are very common and probably ubiquitous in impact evaluations.
- To quote this post, a compactly specified wager can grow in size much faster than it grows in complexity. This is the essence of the problem here. And this characterization is general enough, that I find it likely that we can find such wagers everywhere, in the space of all possible events.
- Note that I have used a longtermist intervention as example. I am less certain about short-termist interventions and whether the argument applies in this case. It’s worth pointing out that neither of the two points here (being dominated by tail, and having infinite mean) is necessary for a distribution to be Pareto, so even if this argument doesn’t hold for short-termist interventions, it does not directly reject the Pareto claim.
Argument 2a: Do anti-wager modifications undermine argument 2?
- In argument 2 I assumed a plain expected value approach, which allowed for Pascal’s wager; in fact, argument 2 itself relies on Pascal’s wager.
- Many people consider Pascal’s wager bad, and thus modify the expected value approach in certain ways to disallow wagers. Below I consider some of these modifications, and examine whether (variants of) argument 2 still takes effect under these modified frameworks.
- Spoiler alert: yes and no, depending on the exact modification.
- I don’t intend to argue for any of the approaches here. Instead, my reasoning looks like:
- 1. Most EAs adopt an expected value framework for decision theory.
- 2. Most EAs refuse to be pascal-mugged.
- 3. Therefore, most EAs must be using some kind of anti-mugging modification to the EV framework.
- 4. All the reasonable modifications I can think of are listed here. If I haven't missed anything, then most EAs must be using one of these approaches here.
- 5. Some of the modifications listed here undermine argument 2; others don't. If the one you’re using falls into the latter category, then argument 2 should still work for you; otherwise, ignore argument 2.
- Risk aversion from diminishing returns: In economics and finance, risk aversion is often modeled using a utility function that displays diminishing returns, while still keeping the expected value framework.[11] Again I adopt isoelastic utility functions here.[12]
- When our isoelastic utility function has strictly weaker diminishing returns than a logarithmic utility function does (i.e.
), the Pareto-ness of the impact distribution isn’t affected. As long as the original distribution (assuming no diminishing returns) is Pareto, our new distribution (assuming diminishing returns) is also approximately Pareto. - When our isoelastic utility function has equal or stronger diminishing returns than a logarithmic utility function does (i.e.
), the Pareto-ness is broken.
- Risk aversion by discounting extreme outcomes: Alternatively, we can also model risk aversion using RDEU models. Here possible outcomes are ranked by utility, and are each assigned a weight based on the rank. We usually assign smaller weights to the best outcomes than to the near-median outcomes, in order to discount extreme cases.
- Whether Pareto-ness is preserved, depends on the strength of discounting. Weak discounting preserves Pareto-ness, while strong discounting breaks Pareto-ness.
- In particular, the Pareto-ness is preserved when outcomes are discounted in proportion to their rank, i.e. the 10% best outcome is discounted 10x, 1% best outcome is discounted 100x, 0.1% best outcome is discounted 1000x, and so on.
- Fundamental assumptions set in stone: Sometimes we avoid wagers by setting certain assumptions in stone (making them “axioms” in our decision framework), and thus give their opposite positions zero credence. Examples of such assumptions include “everything is finite”, “only biological beings can be sentient”, and “supernatural things can’t be true”.
- We can roughly think of their effect as “cutting off any outcome above a certain utility threshold”, if the assumption that we make toggles a dominant factor of utility. For example, the “only biological beings can be sentient” assumption cuts off any possibility of non-biological sentience, yet non-biological sentience seems necessary for almost all of the best- and worst-case scenarios to a hedonistic utilitarian; therefore, by making this assumption we cut off the tail (starting from some threshold) of the impact distribution.
- The heavy-tailedness of the resulting distribution depends on where the utility threshold (at which we perform the cut) lies. The higher the threshold, the closer our resulting distribution is to Pareto; though this is hard to quantify.
- So impact follows a Pareto distribution except in extreme scenarios (e.g. something like “the 0.1% best-case scenarios”). Note that the Pareto distribution here will have infinite mean (as argued in argument 2), but the impact distribution itself won’t.
- Sense check with intuition: We could also simply ignore anything that doesn’t make intuitive sense. This works in a similar way as making fundamental assumptions.
- Heuristics about ignoring wagers: One may argue that “ignore wagers” is a decent heuristic, because priorities look quite similar with and without wagers. Such arguments make no claim about our decision theory or about the distribution of impact, and so are completely compatible with argument 2. Therefore, according to these arguments, we get rid of wagers without banning impact distributions that have infinite mean.
As we have seen, among all the possible modifications above, some break Pareto-ness (call these “type I” modifications), some preserve Pareto-ness but disallow infinite mean (type II), and some preserve both Pareto-ness and infinite mean (type III). Alternatively, we can choose not to make any modification and use the plain expected value framework, which results in a Pareto distribution with infinite mean (type IV).
In the rest of section 2, we will accommodate type II,III and IV. More specifically:
- Claim 4 below will focus on type II,III and IV.
- Section 2.2 (claim 5) will focus on type II,III and IV.
- Section 2.3 (claim 6a,6b) will focus on type III and IV, but the results should approximately apply to II, albeit with a likely significant error.
To sum up,
Claim 4: For any particular project, the distribution of its TOC impact across all potential scenarios is (roughly) at least as heavy-tailed as the Pareto distribution.
I expect the distribution of total impact to be lighter-tailed than that of TOC impact, because:
- It introduces additive terms, as opposed to only multiplicative terms.
- It seems prima facie plausible that “how long a project’s impact lasts, absent lock-in events” follows an exponential distribution (which means the dilution rate is constant in each year), which is lighter-tailed than Pareto (and even log-normal) distribution.
- However, many high-impact projects aim to have impact through lock-in events, which undermines the “absent lock-in events” condition.
However, I don’t have much to say about exactly how heavy-tailed it is.
Sidenote on the seriousness of the problem posed by wagers:
- Wagers lead to infinite expected value, this is bad enough.
- But things get even worse when we take into account negative impact: we have a positive tail with an expected value of infinity, and we have a negative tail with an expected value of negative infinity. What will result if we add up the two of them? Well, the result is certainly undefined.
- Even if you define the result using something like Cauchy principal value (i.e. cut off the positive tail at some probability threshold, and cut off the negative tail at the same probability threshold), it’s still extremely hard to calculate in practice, because even a small difference between the two thresholds can alter the sign of the result - a curse by the heaviness of tails. Plus, it’s unclear whether the Cauchy principal value exists at all (i.e. whether setting the probability threshold to 0.0001 and to 0.000001 will give consistent results), and, if it does exist, whether it complies with the four axioms of VNM-rationality.
- This is discussed in more detail here. Currently I don’t know of any satisfying solution to this problem.
2.2 Prioritization Works Like Bilocation
An implication of the impact distribution (of any particular project) being heavy-tailed, is that a large part of its impact (say, 90%) comes from the tail (occupying, say, 10% probability).
Now suppose that there’re two projects, A and B, in front of us, and we’ll work on one of them. The impact distributions of the two projects are identical but independent. We can directly start working on A (or B if you want), but alternatively we can carry out the following procedure:
- Step 1: Examine how impactful project A really is. In other words, we examine “which part of A’s impact distribution will be actualized”.
- Step 2a: If we find that the 10% tail will be actualized, work on A.
- Step 2b: If we find that the other 90% will be actualized, work on B.
The good thing about this procedure is that:
- We’ll almost certainly not work on A (with 90% probability), and thus save our time for B.
- Yet, almost all of A’s impact (the 90% that lies in the 10% tail) will be captured by us.
We will work only on A or B, but capture most of the expected impact from both A and B. By prioritizing between A and B, we manage to achieve something similar to bilocation.
This is a feature of heavy-tailed distributions. Had the impact followed a normal distribution, there wouldn’t be much to gain from prioritizing between equally good projects; we can at best expect to get one or two standard deviations of additional impact.
Note that this feature relies on the independence assumption. Because A and B are independent, their tails are largely disjoint, and so we can expect to capture both of their tails by prioritizing.
Now let’s quantify this. Suppose that the impact distributions of both projects follow a Pareto distribution (with parameters
and
, the latter of which doesn’t matter much to us). Then
, the extra gain from prioritizing (compared to working only on an arbitrary project), equals
. Below is a plot of
over
.

is undefined for
, because the expected impact is infinite in this case. Note that infinite expected value doesn’t necessarily lead to being Pascal-mugged, as shown by the last three approaches in argument 2a of section 2.1 .[13]
For long-termist projects, I would guess that
holds (that is, the impact distribution has infinite mean), the reason for which is presented in argument 2 of section 2.1 . To avoid dealing with infinities, I shall assume that for long-termist projects,
is larger than but sufficiently close to 1, as a nearest approximation. This gives
, meaning we can double our impact by prioritizing.
The choice of
is heavily influenced by subjective factors like one’s policy towards wagers, so very reasonable people may still disagree with me on whether
. However, even conservatively choosing
(meaning that 80% impact comes from 20% scenarios) gives
(meaning we can gain 63% more impact from prioritizing), so the claim
isn’t essential here.
In the above model, we only evaluated A instead of both A and B. If we evaluate both A and B, and work on the one with higher impact, then we should expect to have an approximately
boost in impact, which is 100% for
and 82% for
.
Below is a plot of
over
.

If prioritization increases our cost-effectiveness by
, then we should be willing to do such prioritization as long as it takes less than
of our time, assuming the gain from direct work is linear in the amount of time spent. For
this is 50%, for
this is 39%, and for
this is 45%.
To sum up,
Claim 5: When prioritizing between two equally good longtermist projects that aren’t strongly correlated, the prioritization is worth doing as long as it takes
of total effort to find out the actual impact of at least one project, or to compare which one of the two projects will eventually be more impactful.
One important thing to note is the unrealisticness of “finding out the actual impact of one project” or “comparing which project will eventually be more impactful”, as doing so implies reducing uncertainty to zero, which is clearly impossible. In reality what we can achieve through prioritization will be more modest, and therefore the portion of effort we should spend on prioritization will likewise be more modest than the
indicated here. This limitation also applies to the findings in section 2.3 .
For short-termist projects I’m unclear about what
is appropriate, and therefore prefer to leave them out. Unless otherwise specified, I’ll be focusing on longtermist projects in the remaining parts of section 2, though the conclusions will likely apply to some extent to shortermist projects.
Before we proceed, there’s still one question to ask: why are we using TOC impact (which justifies the use of a Pareto impact distribution), rather than total impact, when modeling prioritization?
- Because prioritization is based on evaluation of projects, and TOC impact is much easier to evaluate than non-TOC impact.
- I consider this a weak argument, as it seems possible to give a rough guesstimate for the size of non-TOC impact.
- Because, as a matter of fact, we often evaluate only TOC impact (or a few most salient pathways of impact) when we evaluate projects. This might be suboptimal (or not), but currently this is how most things are done.
- I’m uncertain about the truth of this claim. Please correct me if I’m wrong.
- This argument takes effect as long as we limit ourselves to studying “how important is prioritization, in the way they’re currently done”, rather than “in the way they’re ideally done”.
- For reasons ii.~v. in section 1.1, which (as previously mentioned) I have mixed attitudes on.
- For simplicity.
- This isn’t a good reason, and should be seen as a compromise.
2.3 A Multi-Project Model of Prioritization
The reasoning in the two-project case naturally generalizes to the multi-project case. Instead of two projects, we now have
projects, the impacts of which are independent and identically distributed.
Instead of simply picking one arbitrary project and working on it, we can identify the project whose actual impact is largest, and devote all the remaining effort to it. The expected impact of the best project is approximately
times that of an arbitrary project, so here the improvement ratio
equals
.[14]
For longtermist projects I will assume
(that is, the impact distribution has infinite mean). If you disagree with this and adopt an
that’s slightly higher than 1, the conclusions in this section should still (approximately) hold because they’re already approximate results, but the approximation error will likely be much larger.
Assuming
, we get
(which means we can capture the expected impact of all
projects just by working on the best one!), and so the acceptable ratio of effort spent on prioritization is
if
isn’t very small. This means we should be willing to spend as little as
of our effort on direct work, as long as the remaining effort spent on prioritization successfully leads us to the ex-post best project. Also, we can assume that the
unit of effort spent on prioritization are evenly allocated to each of the
projects, which means
unit for each.
To state the finding in another way,
Claim 6a: When faced with
(
isn’t too small) equally-good longtermist projects that aren’t correlated, we should act as if the
tasks of evaluating every project are each as important as working on the best project that we finally identify, and allocate our effort evenly across the
tasks, as long as we are able to identify the ex-post best project by working on the
evaluation tasks.
Longtermist projects are strongly and positively correlated, as most of them rely on the assumption that the future will be large and/or positive (there may also be other, more subtle ways of correlation). This seriously violates the independence assumption that we previously made. Thus a natural next step is to relax the independence assumption.
It’s not an easy job to determine the dependence structure among different projects, and I don’t have a very satisfying way of doing that. Here I present two approaches: a white-box approach and a black-box approach; the results of which are only meant as ballpark estimates and are to be compared with each other.
The white-box approach: clarify where the correlations come from, and model the correlation structure accordingly.
- Suppose that our
projects have impact
respectively, where
denotes the “shared” factors that take the same value across all these projects (e.g. how large the future will be), and
denotes the other factors that take different values for different projects (e.g. whether our theory of change for this project works, how efficiently we’re executing this project). - Here we assume that:
are mutually independent (so
accounts for all the correlations).
follows a Pareto distribution with parameter
;
each follows a Pareto distribution with parameter
.
- Factors in
seem analogous to the factors that determine the short-term performance of a for-profit startup. According to this paper, the revenue distribution of for-profit startups have
, assuming that the distribution is Pareto.[15] Therefore we will set
.
- This analogy has (potentially big) flaws. For example, the altruistic space isn’t as competitive as the startup space, and the startups’ “theories of profit” usually aren’t as speculative as the longtermists’ theories of change.
- Note that the paper examines the distribution of actualized revenue across startups, not a particular startup’s distribution of revenue across scenarios. However there are some reasons (albeit not very strong) to think that the two distributions are approximately equal. Details in appendix.
- Overall I have low confidence in setting
, and I think this is the weakest part of this approach.
- The expected impact of the best project is approximately
times that of an arbitrary project, so here the improvement ratio
equals
.
The black-box approach: use a model that doesn’t necessarily correspond to the underlying mechanism, but that displays traits we expect to see in this problem, and then fill in best-guess estimates of parameters.
- We already knew (or already assumed) that the impact distribution of each individual project is Pareto, but the dependence structure (i.e. how the impact of these projects correlates) is left to be determined.
- Most longtermist projects rely on the future being large and/or positive (and it’s likely that there’re also other ways of correlation); this implies a positive correlation between different longtermist projects, and the correlation is much stronger in the positive tail than in the bulk (in other words, these projects succeed in similar ways, but fail in all different ways). Therefore I use a reversed Clayton copula to model the dependence structure.[16]
- Imagine two longtermist projects, A and B, that are equally popular in the community, but you think A is more impactful than B in expectation, because you have different ideas about how the future will go. Imagine that in the next five years, you’ll have a bunch of fundamental discoveries that reshape your idea about the future, to the highest extent possible (that is, your previous credences are completely erased). Then what’s the probability that B will turn out to be better than A, after these five years?
- My personal best guess
lies somewhere between
and
, and closer to the latter.[17] - I interpret this answer as follows: given two projects A and B with equal expected value, if we randomly sample two future scenarios, then there’s a
probability that the “better project between A and B” is different in the two scenarios. - Below I’ll assume
, for two reasons:[18]
- 1.
and probably
are mathematically tractable in later derivations, but
likely isn’t. - 2. My best guess for
is closer to
than to
.
- The value
tells us the strength of correlation between the impact of two individual projects. Based on that, we can indirectly determine the parameter in our multivariate impact distribution.
- As a demonstration, the scatter plot below depicts the bivariate impact distribution (that is, when there’re two projects) that we get. Each dot represents a sampled future scenario. The x-axis stands for the logarithm of project A’s impact, and the y-axis stands for the logarithm of project B’s impact. The minimum possible impact is assumed to be 1.
- Now we consider the
-dimensional case (that is, there are
projects). Denote the impact of each of our
projects with
(so
are identically distributed but mutually dependent). It turns out that
(assuming
), so by choosing the best project to work on, we can have an impact roughly
times that of each individual project. Therefore
.[19]
The two results that we get here,
and
, imply a much smaller boost from prioritization than in the independent case.
Recall that “we should be willing to do such prioritization as long as it takes less than
of our time”, so the fraction of time to spend on direct work should be
, which equals
or
in the two cases here.
To sum up,
Claim 6b: When the
projects are strongly correlated, prioritization becomes much less important than in the no-correlation case. However, one should still be willing to spend only a small portion (something like
or
; note that this is already much higher than the
in Claim 6a) of one’s effort on direct work, and to spend the remaining portion on prioritization.
3 Examples and Applications
In this final part of the document, I try to put theory into the real world.
Section 3.1 explains what kind of real-world cases can be meaningfully compared with our model (“desiderata”), and what factors influence the conclusion about the importance of E&P (“factors”).
Section 3.2 examines a few real-world cases - not necessarily related to EA - in order to verify the claims in section 3.1 .
Section 3.3 tries to apply the insights that we gained, in an EA context.
3.1 Desiderata to Check For, and Factors to Watch
In the real world, it’s often hard to tell apart exploration and prioritization, so in section 3 I don’t distinguish between exploration and prioritization, and will use E&P to refer to them collectively. That is, we will only distinguish between two types of work: E&P and exploitation.
In order to draw a line between those cases that can be meaningfully compared with our model and those that can’t, here I give two desiderata of the real-world cases that we will examine:
- The goal can be roughly described as maximizing some meaningful kind of (expected) utility.
- Positive example: Angel investment satisfies this desideratum, as it maximizes expected returns.
- Negative example: Most donations don’t satisfy this desideratum, as they don’t seem to maximize any meaningful kind of utility (in particular, impact).
- Sometimes, even if the goal is to maximize expected utility, people and institutions still take actions that are far from the optimum because of irrationality or inefficiency. Due to the subjectiveness of identifying irrationality and inefficiency, I will ignore them here and simply assume that people are always rational and efficient.
- There is a fixed total budget (monetary or not) for E&P + exploitation.
- Positive example: Philanthropic grantmaking satisfies this desideratum, as the grant money and the salary of reviewers come from the same fixed pool of money (which is the worth of the fund).
- Negative example: The development (-> E&P) and production (-> exploitation) of a commercial product don’t satisfy this desideratum, as the latter doesn’t consume money but brings in money.
Different examples comply (or don’t comply) with our previous claims (that lots of effort should be spent on E&P) to different degrees. I think the differences are mainly due to three factors:
- Difficulty of reducing uncertainty by E&P: In some cases you can reduce uncertainty to zero if you work hard enough on E&P, but in some other cases E&P can’t help you much.
- Comparative scalability of E&P vs exploitation: E&P can be scalable (i.e. weak or no diminishing returns) or not, and exploitation can be scalable or not. Holding other factors constant, the more scalable you are, the more resources you deserve.
- An extreme case of low scalability (i.e. strong diminishing returns) is a utility function capped at a low threshold. Once you reach that threshold, adding more resources will bring you zero benefit. This is the case for tasks where you only need to “suffice” rather than “optimize”.
- Heavy-tailedness of opportunities: There’s no point doing E&P, if all opportunities are equally valuable. The more heavy-tailed the distribution is, the more value E&P will bring.
To sum up,
Claim 7: In real world, the most important factors deciding the applicability of our model are difficulty of reducing uncertainty by E&P, comparative scalability of E&P vs exploitation, heavy-tailedness of opportunities, fixed total budget, and utility-maximization objective.
- Confidence: Moderate (hard to quantify)
We will test this claim in section 3.2 .
3.2 Real-World Examples
In this section we look at real-world examples, in order to test our theory. All examples satisfy the two desiderata in section 3.1 . Each example will be presented in the following format:
Name of the example (What E&P and exploitation respectively means in this example)
(color of the name will represent the resource share of E&P in this example: highest, lowest)
- Difficulty of reducing uncertainty by E&P [lowest, highest][20]
- Comparative scalability of E&P vs exploitation [E&P, exploitation]
- Heavy-tailedness of opportunities [highest, lowest]
For the three factors listed, red-ish colors roughly mean “positively contributing (to the importance of E&P)”, while blue-ish colors mean “negatively contributing”. Every colored segment will be followed by a number from 0-5 representing the color number: highest (5), lowest (0). These numbers can also be seen as subjective scores assigned to each entry.
Caveat: I have very limited domain knowledge for many examples below, so it’s very possible that I make false claims about specific examples (and please correct me when I make them!).
Example(s) where E&P receive >50% resources
Catching an escaped criminal (5) (E&P: searching for the hiding place; exploitation: arresting)
- Most of the time and effort are spent on searching rather than arresting.
- Difficulty of reducing uncertainty by E&P: Low (4)
- You can usually reduce the uncertainty to zero (i.e. find out whether an area contains the true hiding place) with hard work and some luck.
- Comparative scalability of E&P vs exploitation: E&P moderately more scalable (4)
- During E&P, there’s a wide range of places & information sources that you need to search for clues in.
- During exploitation, just a few policemen will be enough to bring the criminal back in a few hours.
- Heavy-tailedness of opportunities: Extremely heavy-tailed (5)
- Only finding the true hiding place will do.
Building small deep learning models (5) (E&P: experimenting to find the best design;
exploitation: the final training run)
- Most (almost all?) of the time and compute is spent on experiments, as opposed to the final training run.
- Difficulty of reducing uncertainty by E&P: Very low (5)
- You can reduce the uncertainty to zero (i.e. find out how well a network works) by training and testing it.
- Comparative scalability of E&P vs exploitation: E&P much more scalable (5)
- During E&P, you can do many training runs.
- During exploitation, you do only one.
- Heavy-tailedness of opportunities: Rather heavy-tailed (3)
- Network designs that break SOTA will bring you lots of benefits. Network designs that don’t break SOTA (which are the majority of all reasonable designs) will bring you little benefits.
Example(s) where E&P receive 10%-50% resources
Building large language models (2) (E&P: experimenting to find the best design;
exploitation: the final training run)
- A significant minority of compute is spent on experiments.[21]
- Difficulty of reducing uncertainty by E&P: Very low (5)
- You can reduce the uncertainty to zero (i.e. find out how well a network works) by training and testing it.
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- During E&P, you only need to experiment with small models; there isn’t much information to gain from experimenting with larger models.
- During exploitation, performance reliably (and significantly) increases with increased model size.
- Heavy-tailedness of opportunities: Rather light-tailed (2)
- In the case of large language models, scaling reliably increases performance, so non-significant differences in performance due to different network designs become much less important.
Example(s) where E&P receive <10% resources
Hiring a lawyer (1) (E&P: finding the lawyer;
exploitation: pay the attorney fees and go through the litigation process)
- (Claims below are pure speculations. Please correct me if I make false claims.)
- Almost all the money is spent on exploitation rather than E&P.[22] Most effort is spent on exploitation (working with the lawyer during the litigation process) rather than E&P.
- Difficulty of reducing uncertainty by E&P: High (1)
- Most people don’t have good sources on what lawyers there are and who are competent.
- Comparative scalability of E&P vs exploitation: Exploitation somewhat more scalable (2)
- Spending 2x time asking around is less effective than being willing to pay twice as much money.
- Heavy-tailedness of opportunities: Rather light-tailed (2)
- For any given case, I’m not sure how much a top lawyer’s success rate will differ with that of a mediocre lawyer. But I suspect that when we take into account the cost (attorney fees), this difference will be reduced, due to the difference in fees.
- In general, I guess in a market equilibrium it’s usually hard to find economic activities that are heavy-tailed, due to the definition of equilibrium.
Grantmaking of Open Philanthropy (0) (E&P: evaluating applications;
exploitation: sending the grant money)
- I estimate that 0.4%-3%[23] of Open Phil’s annual cost (salary + grant money) is staff salary.
- Difficulty of reducing uncertainty by E&P: Very high (0)
- It’s impossible to reduce uncertainty to zero (i.e. know with certainty which application will be most impactful), and very hard to reduce uncertainty to a significant extent. This is part of the reason why Open Phil adopted the hits-based giving approach.
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- For E&P, an in-depth analysis of an application takes many times more effort than a ballpark estimate does, but isn’t many times more helpful. E&P displays strong diminishing returns.
- For exploitation there is also diminishing returns, but is significantly mitigated by the hits-based giving approach. I think the diminishing returns in exploitation is weaker than that in E&P.
- Heavy-tailedness of opportunities: Highly heavy-tailed (4)
- See section 1.1 and 1.2 on this.
- Unlike most economic activities, the philanthropy and altruism space is extremely far from equilibrium, so you may expect to see more heavy-tailed things here.
Venture Capital (0) (E&P: evaluating startups; exploitation: investing in chosen startups)
- Almost all of the money is spent on exploitation rather than E&P.
- (Reasons for the three claims below are similar to those for Open Phil’s.)
- Difficulty of reducing uncertainty by E&P: Very high (0)
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- Heavy-tailedness of opportunities: Highly heavy-tailed (4)
For these examples, it’s fair to say that the three factors predict E&P’s resource share quite well. In particular, the total score for the three factors is perfectly monotone with respect to the score for E&P’s resource share. However, this result should be taken with a grain of salt, since:
- I have been thinking about some of the examples when formulating the list of desiderata and factors, so there may be some overfitting going on here, and
- I may be biased (towards scores that are consistent with each other) when giving the subjective scores.
3.3 Applications in the EA Community
In this section we apply the framework to examples in the EA community. We will continue to use the format used in section 3.2:
Name of the example (What E&P and exploitation respectively means in this example)
- Difficulty of reducing uncertainty by E&P [lowest, highest]
- Comparative scalability of E&P vs exploitation [E&P, exploitation]
- Heavy-tailedness of opportunities [highest, lowest]
- Total score of three factors
And here is what I get. The subjective scores are my own (very crude) impression; feel free to fill in your estimates.
(Entries are sorted in decreasing order of total score.)
Cultivating talent (E&P: identifying promising individuals[24];
exploitation: supporting selected individuals[25] in their development)
- Difficulty of reducing uncertainty by E&P: Moderate (2.5)
- Comparative scalability of E&P vs exploitation: Roughly equal (2.5)
- Heavy-tailedness of opportunities: Rather heavy-tailed (3)
- 8
Cause prioritization / Cause-specific work (E&P: cause prioritization;
exploitation: cause-specific work)
- Difficulty of reducing uncertainty by E&P: High (1)
- Comparative scalability of E&P vs exploitation: Exploitation moderately more scalable (1)
- Heavy-tailedness of opportunities: Extremely heavy-tailed (5)
- 7
Within-cause prioritization / Direct work (E&P: within-cause prioritization;
exploitation: direct work)
- Difficulty of reducing uncertainty by E&P: High (1)
- Comparative scalability of E&P vs exploitation: Exploitation moderately more scalable (1)
- Heavy-tailedness of opportunities: Highly heavy-tailed (4)
- 6
Career choice (E&P: choosing a career path; exploitation: progressing on the chosen path)
- Difficulty of reducing uncertainty by E&P: High (1)
- This includes both the uncertainty in personal fit and the uncertainty in prioritization.
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- Heavy-tailedness of opportunities: Extremely heavy-tailed (5)
- This includes both the variation in personal fit and the variation in prioritization.
Employee recruitment (E&P: trying to identify the best applicant;
exploitation: hiring the chosen applicant and pay them salary)
- Difficulty of reducing uncertainty by E&P: Somewhat low (3)
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- Heavy-tailedness of opportunities: Rather light-tailed (2)
- 5
Grantmaking (E&P: evaluating applications; exploitation: sending the grant money)
- Difficulty of reducing uncertainty by E&P: Very high (0)
- Comparative scalability of E&P vs exploitation: Exploitation much more scalable (0)
- Heavy-tailedness of opportunities: Highly heavy-tailed (4)
- 4
To sum up,
Claim 8: In EA, the three areas where E&P deserves the largest portions of resources (relative to the total resources allocated to that area) are
- #1: identifying promising individuals[26] (relative to the budget of talent cultivation),
- #2: cause prioritization (relative to the budget of all EA research and direct work),
- #3: within-cause prioritization (relative to the budget of that cause).
- Confidence: Low (hard to quantify)
Appendix: Mathematical Details
Appendix section Ax.y contains the omitted mathematical details in section x.y .
A1.2
Heavy-tailedness of Pareto and its implication: (we’ll follow the notations here)
- The CDF of Pareto is
, so the tail distribution (1 - CDF) is
.
- A
portion of all projects has impact
, while a
portion of all projects has impact
. Dividing the latter by the former gives
.
Heavy-tailedness of log-normal and its implication:
- Therefore the tail distribution of log-normal is roughly and asymptotically
, and
if we ignore constant coefficients in the exponent.
- A
portion of all projects has impact
, while a
portion of all projects has impact
. Dividing the latter with the former gives:
A1.3
Proposition 1: If we independently draw
samples from a Pareto distribution with parameters
and
, the expected maximum can be approximated by
.
Proposition 2: If we independently draw
samples from a Pareto distribution with parameters
and
, the expected
-th largest value
can be approximated by
.
Proposition 3: If we independently draw
samples from a Pareto distribution with parameters
and
, the expected maximum is precisely
, where
is the generalized binomial coefficient.
Proposition 4: If we independently draw
samples from a Pareto distribution with parameters
and
, the expected
-th largest value
is precisely
, which equals
.
Proof of proposition 1-4:
- When
is large and
, we have 
- Therefore
, thus proving proposition 1 from proposition 3. - Similarly,
, proving proposition 2 from proposition 4. - Also, here’s a numerical verification (in Python) for proposition 1&2:
The exploration-exploitation process without diminishing returns:
- From proposition 1, we know that the expected impact of the best project is approximately proportional to
, and is therefore approximately proportional to
(so the exact number of samples doesn’t matter). - The derivative of
is
, the zero of which is
, and we can verify that
is indeed a maximum.
The exploration-exploitation process with diminishing returns:
- From proposition 2, we know that the expected impact of the
-th best project
is approximately proportional to
, and is therefore approximately proportional to
(so the exact number of samples doesn’t matter).
- Note that in theory we should consider the joint probability distribution of all the
top projects, instead of the EV of each individual top project. But nevertheless individual EVs serve as an acceptable approximation.
- To find the optimal allocation of time, we only need to make sure that the marginal utility from each project is equal. We can, for example, do a binary search to find this marginal utility, from which we can determine the optimal allocation.
- Note that choosing a larger
is always not worse than choosing a smaller
, as we can allocate zero time to some of the projects if that’s desirable. - Python code for optimization and for plotting
over
:
A2.1
For the mathematical details behind argument 1, see A1.1.2 .
Are the impact-distribution-across-projects (abbr. IDAP) and the impact-distribution-across-scenarios (abbr. IDAS) orthogonal?
- (This part isn’t that maths-heavy, but I put it in the appendix anyway.)
- Imagine that someone tells you an idea of a project
. Then:
- Initially (before doing any evaluation), you have no concrete idea about how impactful
is. At this stage, even if a prophet tells you what the future will be like, you still don’t know how impactful
will be in that scenario. All you have is a generic prior distribution about a project’s impact, which is the same for any project. Let’s call this distribution
. - After a simple back-of-the-envelope calculation you get a sense of the size of
’s impact. Now given a specific future scenario, you can in principle tell how impactful
will be in that scenario. However, you don’t know which scenario will actually come about. Let’s call your current version of
’s impact distribution
. - Later you somehow learnt how to predict the future, and you can finally say with certainty how much impact
will have. Let’s call this scalar value
.
- There is a continuum between
(full uncertainty) and
(zero uncertainty), and
lies somewhere on this continuum. Note that the definition of
is rather vague, so “where
lies” isn’t clear-cut. - IDAP considers the distribution across all possible project
, of the value
.
- Therefore, IDAP focuses on the journey from
to
, and ignores the journey from
to
.
- IDAS considers
itself, or in other words, the distribution across all scenarios, of the value
.
- Therefore, IDAS focuses on the journey from
to
, and ignores the journey from
to
.
- From this we can see that IDAP and IDAS cover disjoint domains, and therefore they are orthogonal.
- Note that this relies on the assumption that IDAP and IDAS use the same
- after all, there can be different choices of where to put
on the continuum, and it’s possible that IDAP and IDAS choose different locations.
Do anti-wager modifications undermine argument 2?
- Risk aversion from diminishing returns:
- We consider the inverse of the tail distribution of Pareto, which is
(see A1.2.2). Let
denote our isoelastic utility function. - When
,
(
), so we get a distribution which is Pareto in the tail, with parameter
. - When
(
), which apparently is far from Pareto. - When
,
has an upper bound
, and so is far from Pareto (which should be unbounded).
- Risk aversion by discounting extreme outcomes:
- Here we adopt
as the risk function in RDEU. - Again we consider
. - Then the “weighted” version of
equals
. For this to be a valid “inverse tail distribution” of some Pareto distribution,
has to hold, and so
is the condition for Pareto-ness being preserved. - Infinite mean of
implies
(actually it should be
, but the corner case
is unlikely so we unrigorously ignore it), and so
. Therefore when
(that is, outcomes are discounted in proportion to their rank), Pareto-ness is preserved.
A2.2
The expression for the improvement ratio
, assuming we had only evaluated project A:
- Impact distribution of both projects have identical parameters
and
. - For convenience, we don’t directly consider the PDF of a distribution, but consider the inverse of its tail distribution. That is, we consider function
, meaning the value of the distribution’s
-quantile.
- If we plot
, then the
interval of the horizontal axis stands for the spectrum of all possible scenarios; every point in this interval stands for a scenario, and the height of the plotted curve above that point stands for the impact of our project in that scenario.
- Recall that the tail distribution of Pareto is
, and so the function
is its inverse: 
- The EV of each project is
, which equals
when
, and +
when
.
- Here we define

- We work on project A only if its actual impact is higher than B’s expected impact, and so in every scenario our eventual impact equals max{A’s actual impact, B’s expected impact}, the latter being
when
. - So our expected impact is

- Therefore

The expression for the improvement ratio
, assuming we had evaluated both projects:
- In every scenario our eventual impact equals max{A’s actual impact, B’s actual impact}.
- By proposition 1, the expected value of max{A,B} approximately equals
times the EV of a single project. Therefore
.
Why is it that “we should be willing to do such prioritization as long as it takes less than
of our time”?
- Just verify that
(here
is the time left for direct work, and
is the quality of the project that we directly work on), which means “do zero prioritization” and “spend
time on prioritization” are equally good in terms of utility.
A2.3
First, the independent case.
The expression for the improvement ratio
, assuming we had evaluated all
projects:
- In every scenario our eventual impact equals
. - By proposition 1, the expected value of
approximately equals
times the EV of a single project. Therefore
.
Then, the inter-dependent case.
The white-box approach:
- Note that the paper examines the distribution of actualized revenue across startups, not a particular startup’s distribution of revenue across scenarios.
However, I think these distributions will look similar, with two conditions:
the revenue-distribution-across-scenarios is roughly the same for each startup, and
the actualized-revenue-distribution-across-startups is roughly the same in each scenario.
- The reason is that, imagine an
matrix where each row represents a startup and each column represents a scenario, and the
entry in this matrix is the revenue of startup
under scenario
. Now the two conditions guarantee that different rows contain the same bag
of values, and different columns contain the same bag
of values. Aggregating the bags of different rows should result in the same bag (the bag containing all values in the matrix) as aggregating the bags of different columns. So we have
(where
denotes aggregating
copies of
), and therefore
and
are “proportional”. - Intuitively, you can think of this as a “veil of ignorance”: everyone has equal chances of success, so you have no idea which part of the revenue distribution you’ll be in, before the “veil of ignorance” rises; therefore, your startup’s revenue distribution across scenarios, will look similar to the revenue distribution across all startups.
- Do the two conditions actually hold?
- I can see some reason for the first condition to hold (namely, if startup idea A has a better ex ante revenue-distribution-across-scenario than idea B, then idea B won’t be pursued by startup founders), but I’m rather skeptical about this.
- The second condition seems largely true, as random factors at most change how prosperous an industry is, but are unlikely to decisively change the performance of the whole economy[27].
The black-box approach:
- Reversed Clayton copula:
, where
are “normalized” random variables following a uniform distribution on
, and “
” means “greater or equal to, in every dimension”.
- Estimation of
:
- Combining the reversed Clayton copula with Pareto CDF (assuming
and
), we get the reversed joint CDF of our
projects:
, where
denote the impact of each project.
- R Code for the scatter plot:
- To get
from
, we need to apply the inclusion-exclusion principle:
- (from lemma 1, where
is a generalized binomial coefficient)
- (source)
- We assume
and
are sufficiently large, because in Pareto distributions we focus on the tail.
- Note that this integral would result in
, so this isn’t rigorous. The rigorous way should be let
approach 1 from below, instead of just setting
.
- On the other hand, if
are mutually independent, then 

- Therefore
. Recall that when projects are independent,
, so when projects are dependent
.
Lemma 1:
, where
is any real number.
Proof:
- When
is a non-negative integer, this transformation (“subset-of-a-subset identity”) has an easy combinatorial proof. To extend this to the real case, we only need to observe that
and
are both polynomials of
(of a fixed degree
), and the fact that they are equal for non-negative values of
, automatically implies that they are, in fact, identical polynomials.
- Here we substituted
with 
- This is because

(by generalized binomial theorem)
(by generalized binomial theorem)
- QED
[1] In the current situation, the EA community as a whole seems to allocate approximately 9~12% of its resources to cause prioritization. But there’s some nuance to this - see the last parts of section 1.3 .
[2] Meaning I assign 0.45 probability to [the statement being true (in the real world)], and 0.2 probability to [the statement being true (in the real world) and my model being mostly right about the reason].
[3] Which, by the way, is quite an unrealistic assumption. This assumption is also shared by claim 6b.
[4] By taking this factor into account, we’ve dealt with the unrealistic assumption (“we are able to identify the ex-post best project”) in claim 6a and 6b.
[5] Including, for example, providing opportunities for individuals to test fit.
[6] Conditional on the heavy-tailedness comparison here being meaningful, which isn’t obvious. Same for similar comparisons elsewhere in this document.
[7] Subject to caveats about Pascal’s wager. See section 2.1 .
[8] A more realistic model might be a hierarchical one, where projects have sub-projects and sub-sub-projects, etc., and you need to do some amount of prioritization at every level of the hierarchy.
[9] Meaning I assign ≥0.8 probability to [the statement being true], and ≥0.8 probability to [the statement being true and my model being mostly right about the reason].
[10] For any technology X, assume that a constant amount of resources are spent each year on developing X. If we observe an exponential increase in X’s efficiency, we can infer that it always takes a constant amount of resources to double X’s efficiency, regardless of its current efficiency - which points to a Pareto distribution. This “amount of work needed to double the efficiency” may have been slowly increasing in the case of integrated circuits, but far slower than what a log-normal distribution would predict.
[11] Here the diminishing returns mean “saving 108 lives is less than 105 times as good as saving 103 lives, not because we’re scope-insensitive but because we’re risk averse and 108 is usually much more speculative and thus riskier than 103.”
[12] Isoelastic utility functions are good representatives of the broader class of HARA utility functions, which, according to Wikipedia, is “the most general class of utility functions that are usually used in practice”.
[13] Under the “fundamental assumptions” or “sense check with intuition” approach, the true distribution has finite mean, but the Pareto distribution used for approximating the true distribution has infinite mean. Under the “heuristics” approach, the true distribution itself has infinite mean.
[14] u is defined in section 2.2; it stands for the extra gain in “how good is the project that we work on” resulting from prioritization, compared to working only on an arbitrary project. For example, if prioritization increases the project quality from 1 DALY/$ to 2 DALY/$, then u=100%=1.0 .
[15] See the “⍺” column of the “Revenue” rows in table 1 of the paper.
[16] Copulas are used for modeling the dependence structure between multiple random variables. A reversed Clayton copula is a copula that shows stronger correlation when the variables take larger values. Mathematical knowledge about the (reversed) Clayton copula (and about copulas in general) isn’t needed for reading this section.
[17] This is a very crude guess, and my 90% confidence interval will likely be very, very wide.
[18] Note that by choosing r=⅓ I’m underestimating (to a rather small extent) the strength of correlation. I’ll briefly revisit this in a later footnote.
[19] Recall that we underestimated the strength of correlation by choosing r=⅓, so here u=(log k)-1 is an overestimation of the boost from prioritization, though I think the extent of overestimation is rather small.
[20] reversed because it’s negatively correlated with importance of E&P
[21] For GPT-3, 12% of compute is spent on training smaller models than the final 175B-parameter one, according to table D.1 of the GPT-3 paper, though it’s unclear whether that 12% is used for exploration/comparison, or simply checking for potential problems. Google’s T5 adopted a similar approach of experimenting on smaller models, and they made it clear that those experiments were used to explore and compare model designs, including network architectures. Based on the data in the paper I estimate that 10%-30% of total compute is spent on those experiments, with high uncertainty.
[22] I’m not counting referral fees into E&P, since they’re usually charged on the lawyer’s side while I’m mainly examining the client’s willingness to pay. Plus, it’s unclear what portion of clients use referral services, and how much referral services help improve the competence of the lawyer that you find.
[23] This is based on a simple ballpark estimate, and so I don’t provide details here.
[24] Including, for example, providing opportunities for individuals to test fit.
[25] The extent to which to prioritize promising individuals, is often discussed under the title of “elitism”. Also here’s some related research.
[26] Including, for example, providing opportunities for individuals to test fit.
[27] I know little about macroeconomics, and this claim is of rather low confidence.