First Version: January 2014.
Current Version: April 2014
Impact Data for AFA Members
This document describes a protocol for creating and updating quantitative objective information about leading researchers in the AFA. We want to create a basic data set that we can update periodically with reasonable effort and that is reasonably representative. The data will not be perfect. It will be better than nothing, better than anecdotal impressions, but not better than perfect.
The intent of this data-collection protocol is to make it straightforward for anyone else to adopt this protocol and re-fill or update the spreadsheet in future years. The existence of this protocol should also make it easy for an AFA committee to follow the protocol to add or correct entries, following the protocol, in order to make this person comparable.
The spreadsheet is open for maximum transparency. However, it is important for the reader to understand its shortcomings before the reader looks at the spreadsheet itself. Therefore, the URL is not given in a more prominent fashion, but embedded later in this document.
The spreadsheet should be data that aids AFA committee in selecting members for committees, awards, honors, etc. There is no mandate that it be used for this purpose.
For AFA committees using this spreadsheet, it is important to emphasize that this spreadsheet is supposed to be an auxiliary document to help make decisions. It is not intended to make decisions.
For example, citations (much less citation counts) are not all that any organization should use to determine awards, etc. However, having objective information at hand when making choices is not only a good practice, but also a defense against nepotism, accusation that the AFA is becoming a non-meritocratic club in which it is more important to have a nice pedigree and know certain people than it is to be a great scholar.
The beginning point is a spreadsheet, in which the AFA member number is one unique key. Because we cannot publish this key but need to have a reference id for communication purposes with the public, this is supplemented with a secondary unique key, based on the last name of the author.
Inclusion in these rankings (e.g., and thus more likely appointments to AFA roles) is a benefit of AFA membership. We may, at our discretion, in rare cases, supplement this list with authors who are highly ranked elsewhere or suggested.
We then collect the information for each of the several thousand persons on this list according to our protocol.
We are constrained in our ability to construct good ratings. We essentially employ a two-pass approach. In the first pass, we use Google Scholar to select the top-impact 200 to 500 members of the AFA. This is reasonably quick for us. Our primary variable is a broad cut in GScholarTop5, which contains the sum of the cites for the top 5 papers. In the second pass, we use the ISI web of science data to obtain more detailed citation counts only for the top-impact ISI scholars, plus some selected others. Google Scholar tends to be more accurate for young papers, while ISI tends to be more accurate for old papers.
Our quasi-third pass is error-correction:
We expect the data base to become better with each iteration. Though unreliable in 2014, we expect it to become quite reliable in its 2nd or 3rd implementation, around 2016 or 2018.
UID user-id. usually the last name of the scholar, followed by a number if there is another AFA member with the same name.
Notes tells us about issues and corrections
URL the google scholar URL
entered the date on which we entered data
Gscholar the total number of google scholar search results
Gscholar5yr the same, but only for 5 years
Gscholar.top5 the sum of the 5 articles with the highest number of cites. This is more likely to be accurate, because we can look at the titles.
ISI.date the date on which the ISI information is entered
ISI Initial the initials of the researcher we used in the ISI search
ISI.numpprs the number of papers that ISI provided on the author
ISI.cites ISI citations from the citation report, with self-cites
ISI.noself ISI citations from the citation report, with self-cites
JF the number of JF publications, from a spreadsheet
Phd the year of the PhD
Other fields are internal to the AFA. All of the above fields are quasi-public, in that they can be hand-collected by anyone with an interest in the subject.
The intent of the first pass is to narrow down the set of members for which we try to produce more accurate statistics (and from other sources). The target fraction for detailing is 10%. It is essential that the first pass be quick and easy.
The principal first-pass source will be google scholar, http://scholar.google.com/ .
1. if the author has a dedicated scholar page, this is best and easiest.
is that of Ivo Welch.
we enter today’s date when you are entering the information. (in my example, January 10, 2014)
from this page, we take the number of citations (both all and since 2009).
then add the cite counts from the top 5 papers to the neared hundred into Gscholartop5. In Ivo Welch’s case, this is about 10,500 . These are specific papers. Look at the titles to make it more likely that these are finance and economics papers that the person has published.
2. if the author does not have a dedicated scholar page, this is more problematic.
start with the google scholar page. enter the name of the person, the word “finance” and the institution where the person is working now. for example, I am going to use “Hui Chen, finance, MIT”. (this will map into a URL of http://scholar.google.com/scholar?hl=en&q=hui+chen+finance+MIT&btnG=&as_sdt=1%2C5 .
we are deliberately excluding patents on the scholar home page. also change settings from 10 to 20 results.
we believe that we can only obtain Gscholartop5 with reasonable accuracy. so now we look on the first page for the 5 articles that are obviously finance related. for example, there is an article on molecularly engineered microspheres for hui chen. this is obviously not the person we are looking for. in my example, I see one paper with 120 count, one with 44, one with 26, and one with 33. the numbers are so low now that at the rounding level of 100s, we are done. this comes to 200.
when individuals obviously have less than 100 cites, don’t bother further. just call it 0.
In both cases, it is not important that we have single digit accuracy. we are collecting the information over a time span of a few weeks. some individuals may not have any cites. it will not be perfect. it cannot be perfect. we can only try to do a good job, and fix things as we notice errors.
The main goal of this pass is to have a Gscholartop5 that is available for everyone.
Our second pass focuses on the 200-500 most highly ranked individuals, with some extra bias towards including younger PhDs (whose phd was awarded between 5 years and 20 years ago), minority/women, and unusual individuals (e.g., by request). Note that we are collecting considerably more individuals than are likely to be selected for any AFA committees. Thus, omitting at this low a level is likely not to have much impact on AFA decisions.
The best (though still mediocre) source for citation count is still the ISI Thomson Web of Knowledge. We collect it only for authors that ranked high in the first pass.
First look for the tab “Web of Science.” This is
Choose web of science database in the database tab.
Change the years to start in 1970. We are interested only in the Social Sciences Citation Index, 1900-Present. If you can, save this setting permanently. Then start with “Author Search” just below the middle field. the direct link seems to be
Type in the name of the person, last name perfect, first name with a star if you do not know the middle initial (and some researchers sometimes include or exclude it).
Note: If the scholar has a common last name, say, Larry Harris. We will need to have their middle initial also to get an more accurate citation report. If you are not sure if the last name is common or not, you can just search the last name and initial first to see if would too much noises, for example, are most of the papers are in business/economics categories? Is there papers with an author that obviously not the person we are looking for?
Take Larry Harris as an example. I first searched “Harris,L”. But there are obviously too much noises in the result list. To illustrate, in the left side, the “web of science categories”, the top categories are psychology instead of business or economics (the perfect state is business and/or economics are the only categories in this column), which means there are many scholars in the other field also named “Harris, L” but they are not who we are looking for. Moreover, in the list, we can see author names like “Harris, Lloyd”, “Harris, Lindsay N” which are not Larry Harris. In this case, we definitely need a middle initial to get an accurate number.
So I searched Larry Harris on Google to get his full name “Lawrence E. Harris” and again search “Harris,LE” on web of science. This time we have much less noise and as instructed, I refined the results by choosing business finance and economics categories. Now all the results on the list are “Lawrence E. Harris” and we can generate the citation report.
First example is again Ivo Welch. In my example, I use “Welch” as last name and “I*” as first name. Click continue to research domain. Now pick only “Social Sciences - Business Economics”. Click Finish Search. (You do not know all the institutions where this person was before.) Look at the first 10 or so papers. Is there anything that is obviously not the same person? if so, use the checkbox to remove this specific article. If we have the top-10 list of articles right, typically we have a pretty good count on the mid right. So, now we can record the number of results (here 34 into ISI.numpprs), the number of citations without self-citations (here 4309 into ISI.noself).
Second example is Hui Chen. (This is a good example, because Chen is such a common name. We would probably not need to do this, because Hui Chen may not appear in the top-500 individuals, or top-100 individuals with PhDs 5-15 years out on the basis of citations.) In my example, I use “Chen” as last name and “H*” as first name. Click continue to research domain. The categories have been changing. We want to include every category that looks sort of like “Business,” “Economics,” or “Finance.” Click Finish Search. (You do not know all the institutions where this person was before.)
Click “Create Citation Report” which will appear on the right a few lines down from the top within a few seconds.
Now we need some intelligence. There are obviously many Chen’s and they are not Hui Chen from MIT. Unfortunately, this data base does not give full first names. So, go back to the google author page, and remember the top 5 papers. the top one in this case is A unified theory of Tobin's q, corporate investment, financing, and risk management . Does it look like you can create a reasonable list here? if not, then note it.
In this particular case, we need a third pass by going to Hui Chen’s list of publications from his own web page, and doing this in detail. obviously, this will not take the usual 5 minutes per author, but more like 30 minutes. the citations are obviously below 100 here, so we can just note this, too. the only aspect we would really collect for hui chen would then be the number of JoF publications.
An HTML file is created for the top 50 AFA members on our spreadsheet. The HTML links purpose is to detail the statistics shown on the spreadsheet.
Create small dossiers for each author. The dossiers explain where and how we got the numbers shown on the spreadsheet.
Repeat the first pass and
Take a screenshot of the authors of the Google Scholar page, either the dedicated scholars page for the author or if the author did not have a scholars page search the authors name along with finance and/or the institution where the person is currently working.
Repeat the second pass and
Take a screenshot of the authors Web of Science citation report page.
Note that it is possible to use very different method to collect ISI data. We do a simple “basic author search.” An example of a different search is a “cited reference search” (a small arrow). This deals better with author misspelling, cites to working papers, etc. (The cited-ref-search number is typically about ⅓ higher than the basic-author-search number we report.) However, the cited reference search is modestly more complex. Thus, we stick to the basic author search. Importantly, one must not mix methods.
Researchers with more than 1 name (e.g., name change) are problematic.
Researchers without a google page are more problematic on the first pass ISI search.
Editors (like Cam Harvey) can have editorials etc. that are overcounted.
Common names are a problem. Some researchers even share the same names, down to the initials (e.g., Kevin J Murphy). We need to whittle out errors, and make correction notes when brought to our attention.
Our goal is to avoid gross errors. A good way to think of these cites is that they are more useful when one relies on a rounded natural log of cite counts. Any metric that attempts to read more into these counts is somewhat silly.
We rely on authors to help point out errors. Note that we require that such corrections acknowledge that they have read this document first. (We do not want to receive dozens of emails to point out the differences between cited-ref and basic-author ISI searches.)
The data becomes more accurate with help. The first versions are expected to be more buggy than later version.
We expect authors to suggest corrections primarily for themselves. However, we do appreciate their suggestions for changes on other authors, too. It is particularly helpful if the correction comes with the suggested replacement number. Note that as the spreadsheet ages, corrections need to be (at least grossly) adjusted for the elapsed time.
We collect these data over the duration of a few weeks, and they are updated only every other year. We want to keep numbers comparable. It makes no sense to update a few authors every few months, and not to update the others.
We do not correct to increase quoted statistics when the difference between our number and the current number is less than 10%. The spreadsheet does not have this level of accuracy, anyway.
We can correct when an author points out that his/her statistic was overcounted. For example, Cam Harvey pointed out that we should not count his editorials as publications.
We promise to appreciate suggestions, acknowledge receipt, and look at suggested corrections (eventually). We do not have the time to follow up on every suggestion and/or to provide feedback to requests for corrections. We simply try to do our reasonable best with the limited resources we have. (Volunteers?)
When we make corrections, we need to log that we have made additional changes in the spreadsheet, and keep a second file (corrections) that contains notes what to look for in the next revision. We should not have to rely on members to have to point out the same error every 2-3 years.
PS: We need one or two large monitors (possibly also computers), so that the spreadsheet is visible at the same time that the collectors look at data. this will save a lot of time.
We should be able to do the first pass for our 4,000 members in about 100 hours. most of this has to go very quickly, because most people will not have (m)any cites. so, 2 min per author. If after a few hours we find out that we cannot do this quickly, then we may need to limit ourselves to authors that are at recognizable US universities. (we then have to rely more on including individuals after the first collection that our committees remember may have been omitted, and/or that were on the SSRN top finance researcher list.)
When pass 1 is done, we can move on to our list of the top 200 or top 500 authors for whome we want to get more info in pass 2.
The first time, we need to fill in PhD years. I hope we can do this for the top-500, using the OSU directory (and web search as fallback), within about 10 hours.
Time estimates for second pass for different authors varies, ranging from 2 mins to 15 mins per author. If the author has a common name, it takes longer for us to 1) search his or her middle initial on google 2) distinguish his article from other articles with similar author name
But in the future, since some of the necessary author initial have been done, time estimate should be shorter.
We could keep more detailed statistics for a smaller number of individuals that are eligible for and in consideration of particular roles, e.g., membership in the nomination committee.
The spreadsheet is available at ….
The first time we do this, we must collect additional information, so start with the top 200. we need to find the year of graduation from a phd program when we have this. A good starting source is http://fisher.osu.edu/fin/findir/ . go through this list and fill it in for all individuals listed there. for individuals that are still left empty and that are in your 200 top ranked list, see if you can go to their website and find the year of their phd. fortunately, this will only have to be done once.