The reason for Open Citation Data and why it is necessary now more than ever, is arrived at by examining why researchers began to publish in journals and the current trend of applied metrics leading to a ‘publish or perish’ mentality. How citation or reference lists are included to attribute preceding work in these journal publications is key and it is important to note the lack of sentiment in those attributions and why the current form of accessible citation data is inadequate for judging and assessing the worth of a given research paper, especially within the dearth of published material, even given an expert reader.
I shall pull my examples and context primarily from scientific publishing, but many of the arguments apply to what would normally be termed humanities subjects (although to group them as such is to do them a disservice.) * One general difference however is that context and terminology is often the driving force behind references in humanities subjects, providing the reviewer or reader with the necessary conceptual environment in which to understand a work.
* One area that stands out is Law – the referral to authoritative documents and cases, with footnotes and citations, within legal articles uses more subject-specific conventions than many other areas. As Legal references often include a lengthy footnote by the author, the sentiment and reason behind the reference is far less opaque than in other subjects.
Beginnings of (Scientific) research journals
Whilst the citing of prior works is an old and established convention, with recorded examples of it in practice from the time of Euclid and earlier, publishing the outputs of research publicly, early and regularly, is a relatively recent idea. The idea of a research journal can be traced back to the year 1665 when the “Philosophical Transactions of the Royal Society” is first published.
From wikipedia:
“Among the earliest research journals was the Philosophical Transactions of the Royal Society in the 17th century. At that time, the act of publishing academic inquiry was controversial, and widely ridiculed. It was not at all unusual for a new discovery to be announced as an anagram, reserving priority for the discoverer, but indecipherable for anyone not in on the secret: both Isaac Newton and Leibniz used this approach.” [Ben: emphasis my own]
Even with this ‘public’ declaration of research, there were (and are) still disputes as to who came up with an idea or finding first – who ‘owns’ a givcn discovery. A good overview of the early handling of disputes (eg a “disputatio”) can be found in “disputatio, Defense of Tycho, The Sociology of Science, Philosophers at War” by Giuliano Pancaldi. This paper also has some fascinating anecdotes about the lengths early scientists went to both hide their research from peers and still retain ‘first claim’ on their breakthroughs.
The following section quoted from the above paper conveys how the process of “disputatio” was carried out:
“The rituals of the disputatio varied considerably, but the issue of the timing and the certification of results occasionally entered into the rules. For example, in the late fifteenth century the University of Bologna decided that theses submitted for public discussion among the lecturers (an event held at least twice yearly, the rector present) had to be deposited eight days in advance. Within fifteen days of the disputatio written conclusions would be made public and the reputation of the authors enlarged or diminished as a result.”
The participants had to offer arguments in a manner that made their ideas understood to their audience, noting that the public and academic reputation of those who failed was ‘diminished’. This description parallels the drive to publish, with peer-review and editing in place to maintain a level of quality in the arguments put forth. The sociologist Robert K. Merton outlines how the comparatively rapid style of publishing in journals compared to books lead to a reduction in the number of publicly debated disputes during the period of time when journals established themselves. The vast majority of disputes were over the right to claim ownership over the discoveries and inventions made. [See “The Sociology of Science” (University of Chicago Press, 1979) [wikipedia on the subject area] for more information about his sociological analysis of scientific publishing. Whilst I do not agree with the overall manner and findings of this work, it does have some illuminating quotes within it.]
This notion of ‘first-claim’ on ideas is still with us, although the conventions and ettiquette about researching in certain areas has changed somewhat. James Watson, in his autobiographical book “The Double Helix” (Harvard Press, 1968) [wikip link], describes his impressions, the ‘human story’ from his point of view, of the events leading up to the discovery of the nature of DNA. He talks about the race for priority, about the need to be first, the ever-present rivalry and competition between Cavendish and Caltech and the difficulty of getting data from somewhat reluctant peers. It may also be notable in the context of the psychology of researchers that his account of the contributions made by Rosalind Franklin have been criticized, in that he did not give adequate credit to her for her work. He also talks about the erosion of an ‘English’ sense of private domains for scientific research, again echoing the idea of claiming a research area as if it were territory.
Merton also quotes in his “Sociology of Science” from another researcher, Hans Gaffron which I include here (as I cannot find an example to link to online):
“The student now, in 1970, finds it difficult to believe that, at least with many of us in the 1920s, there was never the thought of having to hurry, or of having to publish results prematurely and more than once lest they be overlooked or taken over in their entirety by somebody else. Even important discoveries were left for a year or two in the hands of a man with whom they originated so that he could develop them according to his means and abilities. We used to say: “An apple already bitten into is not very attractive.” The man who had the first bite was expected to keep and eat his apple. But then more and more people appeared on the scene who felt no compunction to bite into every apple within reach and then often drop it just as quickly. It was considered very bad manners, but they were the men of the future …”
“The phenomenal increase in the number of people whose work brings them in contact with scientific investigations has changed not only the image of the average scientist but also his motives and relationship with his colleagues. The latter are not fellows working in neighbouring fields – their fields – but all too often are direct competitors engaged in simultaneous, absolutely identical, experiments. Not only has the ruthlessness of accomplished business techniques invaded the areas where industrial exploitation overlaps research, but this kind of behaviour is no longer considered alien to science.”
from “The Sociology of Science”, Robert K. Merton, p327, and again, emphasis is my own.
I include these historical views to underline this desire, this drive to both stake out new areas of research for the researcher and also to be the first, to be seen to take the first bite, even if they were not the first to have the idea. This is one of the factors that have led us to this “Publish or Perish” motif of modern scientific endeavor, the pressure to publish quickly and often, or to sink below the outputs of your competitors.
Using published articles as the main tool to stake out a claim on a research area is still with us forty years after Hans’ sentiments and the ‘bad mannered’ behaviour is now commonplace and expected. The colloquial phrase ‘Publish or Perish’ mentioned above is well understood by academics, with the implication that the continual production of research papers in peer-reviewed journals is the main metric (if not the sole metric) by which their performance, reputation and quality is judged. The advent of electronic publishing means has provided new ways for ideas and research to be disseminated and this leads naturally to new ways for reputation and quality to be judged, although this is an area that has yet to mature. The key point is still that it is only the published papers in peer-reviewed journals which are used by university staff, funders and fellow researchers to ascertain value. Alternate forms of publishing, such as publicly licensing and releasing data in a machine-readable form, blogging, participating in online discourse and so on, are still not used as a metric of reputation within institutions and peer groups.
One of the standard metrics by which weight and value is given to a paper is the simple (simplistic even) notion of ‘Cited By’ – the number of other peer-reviewed papers that cite this paper in their references. Unfortunately, this idea has become somewhat conflated with the idea that a citation is a mark of esteem, that the author or authors of a citing paper is endorsing the quality to the paper it cites. Whilst it is out of scope to suggest why often quality is perceived to vary linearly with the numbers of citations a paper has got, some illumination may be gotten by understanding the psychological drive to attain high-scores in games. Many of the early computer games produced between the late 1970s to the 90s – the ‘arcade’ years – had no other way to mark progression. You either beat your last score or you did not; there was no ‘win’ condition, no story to finish, no master plan to complete. A light-hearted but very illuminating talk was given at Playful 08 by Iain Tait on this emotional drive and it is well worth watching for a different perspective on this psychological drive
Attribution as the currency of research
Whilst the result of research journal publication was to help cut down on the number of disputes, the driving force was to let researchers claim ownership of ideas sooner, not to claim that they are the only ones capable of a discovery or idea, but that they were the first.
Attribution or citation from within a journal article is often as the complementary reaction to the above idea – it is the author’s method to explicitly state that the idea, data or discovery that they are borrowing is not their own and that they accept another author has claim on it. It is hoped that this generalisation – why an attribution is made – irks and provokes a response in the reader. I, too, share this discomfort and will elaborate on this later on.
However, the status quo is such that an author’s citation of a research paper is seen to imply that said author has read the cited work and to an extent relies on its findings in their own work and may even agree with the findings. I will attempt to illustrate further why I think this is a somewhat flawed state of affairs, even if we accept the common statistical tools that are often applied to this data to refine it, such as the h-index.
There are many dissertations-worth of research that can and have been done on the subject of why one work might be cited versus another, subjectively similar work. I wish to draw attention to one aspect that is only set to be more important as the rate of publishing and size of the corpus increases – the use of a peer group in monitoring research trends.
Simply, that due to the wealth of papers published, the aforementioned publish-or-perish mentality and the sheer numbers of active researchers, personal interactions with a given peer group provide a key means to identify which papers to read. Personal trust in a peer often leads to professional trust and vice-versa. A researcher is more likely to read works by those they know or respect, than those who they don’t. Likewise, works cited by their peers will likely figure highly in any list of works to check and analyse further – by word of mouth, so to speak.
So, why is this a key way in which new research is found and read? Why can’t a researcher just read all the suitable papers that are published in their field? Concisely, the field would have to be very niche for this to be possible. Far more often, there is a consensus around the following sentiment:
“Too much research is published in my field to humanly read”
What is meant here by the phrase “Too much”? There is more published of interest to a given researcher than is readable by a human being, especially if the reading is done in a thorough manner. There is never enough time in a day! As the following analysis will show, this is especially true now compared to the past and it is only set to get worse:
For details on the methods and precise analysis of growth rates of scientific publishing in the past century quoted below, please see “The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index” Peder Olesen Larsen and Markus von Ins, Scientometrics. 2010 September; 84(3): 575–603.
- the rate of growth for scientific and social science publication has not slowed in the past 50 years.
- increasing exponentially still, varying in rate by subject.
- the rate of increase between 1907 and 2007 is on average 3.9%, with the corpus doubling in size every 18 years.
NB This only counts published articles, patents and proceedings – the recent trend for data production (genome, PDB, etc) is not included.
- The growth of the common citation datasets (SCI in this case) is at a much lower rate than the rate at which publications are being made; the SCI dataset’s proportional coverage is getting worse, year by year.
It is left to the judgment of the reader, but it is hoped that these figures give some substance to the assertion that most if not all researchers are being overwhelmed by the rate of research being published currently. Consider the workload involved in reading every current peer-reviewed item which may advance or influence the direction of your research in a given field. Note that research is often advanced by unexpected means, by the collision of ideas and concepts that are uncommonly found together so it might be foolish to try to prescribe the sources from which the next ground-breaking idea will come from. A new discovery, observation or invention is a creative act, regardless to which subject it occurs in.
Discovering Research: The visibility of modern research is more important than its overall quality.
To clarify, this is not saying that the quality of a piece of work and how visible it is are exclusive aspects. A highly visible work (such as an article published in Nature or a controversial or provocatively titled item) can be of the highest quality or of the lowest. It is however hoped that when a work is made highly visible by being published in a popular journal that the peer-review and editing process will ensure that the work is of the highest possible quality.
However, it should be apparent without further explanation that there are other ways of the article getting traffic that is disproportionate to its worth. It is simply that the loudest work is likely to attract the most attention, as the normal channels by which we discover research are being overwhelmed. Written citation lists of both citing and cited works, which could at one time be assessed by hand, is now too time-consuming and intractable a problem to handle with human eyes. We need to test and develop machined tools to aid the researcher, both to check the veracity of the citations and data used by a work, but also to do this in as widely sharable a way as possible. This type of checking is time-consuming and can only start to be tackled if the assessments, comments and views are shared rather than being created afresh by each researcher.
“Altmetrics” Manifesto: http://altmetrics.org/manifesto/
There is a drive for alternate means to find, filter, assess and value published work, and to introduce and place value on differing types of publications, attempting to rediscover what research publication can be in the 21st century. The manifesto of the altmetrics drive mirrors a number of the points I have already raised and raises many more of its own but these key statements are illuminating:
“No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped.”
Is Citation broken?
Fundamentally, the point has been made that how we find, filter and assess work needs to improve and also that we need better means to gauge the accuracy of the data and the comments which have been cited. It has become difficult even to pinpoint what a field of research is outside of the context of a researcher’s aims – the notion of a boundary between subjects is much less apparent than it once was and so the broad vista of potentially related works becomes even broader.
We need computational aids to help ascertain which articles contain useful findings, useful data and useful methods and ideas, and also to validate which papers merit the increasingly precious time to read and thoughtfully analyse by hand.
Once works have been found and selected to analyse, there still remains the problem of checking and assessing whether the paper is solid and the facts or findings which are not generally established or made within the work itself, are cited and referenced properly. The logical nature of science means that a work often relies on the accuracy and ‘truth’ of a large catalogue of previous work which has not been undertaken by the authors themselves, but cited within the work. It is in attempting to provide solutions for the above issues that lead to the requirement for Open Citation data. How to get it though?
“We should just scrape all the citation lists out and compute on them”…?
This sounds like an attractive solution, but even if we overlook the untested legal situation and notable financial issues – a scraper would require access to a form of the work in which the text can be copied which may not be free – there remain some very serious and difficult problems to extract automatically meaningful data from this.
Historically, citations have never been formatted to be easy for a machine to understand – the target seems to be the informed reader, one who has a great deal of knowledge about the field in question. Many journals have a house style, a rendering of the information by certain formatting rules. While the house style may derive from one of three or four main ‘schools’ of formatting, such as Harvard, Chicago and so on, they will have their own quirks and nuances which are difficult to code around or even to automatically recognise the style of rendering it is in. Not impossible, but the point is that it requires time and effort to handle each edge-case and it will never be a lossless process. It is also worth pointing out that people will often use a computer to format and create these reference lists, but will rarely check to see if each reference is valid and precisely within the house style once it has gone through the publishing, editing and reproofing process – frankly, no one has the time. So what is the scope of the problem? How many styles would have to be accommodated by an algorithm designed to extract this information?
A single figure is difficult to arrive at, but a court-case of a few years back may provide a clue – Thomson-Reuters suing George Mason University over the unlicensed use of proprietary style files in the Zotero application - the key violation claimed being that “Thomson’s 3,500 plus proprietary .ens style files within the EndNote Software [were turned] into free, open source, easily distributable Zotero .csl files”
Whilst huge leaps have been made in natural language processing, information extraction techniques from semi-structured data, and so on, it is still very much a challenging problem. Most data- and text- mining techniques have been adapted to suit multi-processor and multi-machine arrays as the typical route to understand and more importantly, match citations between papers is computationally very expensive.
As an aside, is it not absurd that it’s commonplace to use information retrieval systems like the internet, which are capable of uniquely specifying trillions of individual ‘documents’, but when it comes to citations, we find ourselves reduced by necessity to discuss artificial intelligence techniques, high performance computing and imprecise matching routines when attempting to trace back the genesis of a paper’s theories down to their foundations? That asking to see at what point an idea turned from a hypothesis into a evidence-based theory is still a difficult question? That checking to see whether or not a finding quoted within one paper was actually nothing more than a suggestion or supposition in the cited paper?
Carrying along this thread of thought, it is easy to stumble across other problems – there is no ‘why’ to a citation. In its current form, the act of referencing one paper from another tells us little about why one paper cited another. It is often mistaken for an act of trust, that one paper builds upon the work of another: any assay of the metrics and indexes used to indicate the reputation of the researcher serves to show that the quality of the output is conflated with the controversy – good or bad – that the research brings. Many solid research papers are cited fewer times from papers within renowned journals than allegedly frauduent and poor-quality work.
Consider the (retracted) paper by AJ Wakefield et al “Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children” (The Lancet, Volume 351, Issue 9103, Pages 637 – 641, 28 February 1998 - non-affiliated copy) – the basis of what became the “MMR jab” debacle. This work has been cited by one count over 1,000 times in academic papers in peer-review journals. It is worth note that the citation indexes are all in dispute about precisely how many times this paper (and almost *any* other paper) is cited. A shocking issue in its own right.
This is indeed a very public example and a very extreme one, but it hopefully points out that the act of citation shouldn’t be confused with solid research. Again, it is hoped that this is the case, but it cannot assumed that this is the case generally,
This shapes the sociology of citation – people will more likely avoid citing or even discussing work that they do not feel adds to their argument. It is unlikely that a researcher will cite data from another paper, where they feel the method is sound but the analysis is flawed, without that assertion being the basis of the paper. There is no mechanism to allow qualified citations to be made; that a paper is reliant only on a graph, some data, the method, a statement or a even just a quote or that a paper vehemently agrees, disagrees or just cannot replicate the cited work’s findings.
Even the flippant notion of letting a researcher add a ‘Like’ button or a ‘+1’ next to each use of a reference is not one to be dismissed quickly. It bears more merit than the lightheartedness of the solution may suggest.
In summary:
- Citations are key marks of attribution – they avoid disputes and provide a foundation from which research can build without retreading old ground,
- Citations are formatted for well-informed people, aware of the progress of their field,
- More and more research is published yearly, making the aforementioned awareness more difficult to maintain (if it is not already too difficult already)
- Traditional mechanisms for building up this awareness are not suitable to the scale of the research produced,
- Indices of citation information is growing at a slower rate than the rate of publication,
- Turning citation text into information that can be computed is difficult and should not be replicated by each individual,
- The nature of citation has yet to be realized – we can and must do better to qualify our citations,
- Analysing and fixing the situation will require a great deal of effort and this effort must be shared as widely as possible, such that it becomes attainable.
We need Open Citation Data because there is a lot of work to do – we need the ability to form a wide consensus, we need to avoid replicating work and so need to be able to freely share data, and we need raw access to the reference lists themselves to begin to extract the information from the text, by hand and by machine.
Commercial concerns such as Elsevier, Google, Mendeley, and so on are amassing data with which they can use as they wish but the general public are not given a clear message as to what can be done with this data (as yet) although some (such as Mendeley) are making efforts in this area. However, they are commercial enterprises and this cannot be forgotten. It will affect any decision they have to make when financial matters come into it and they would be a poorer business if they did not.
The foundation of research is within the act of citation – we cannot let its assessment remain boiled down to a few simple and obtuse numbers. We need the Open Citation Data with which the communities can devise modern, useful and beneficial means of finding exciting research and assessing value self-consistently. Otherwise, we risk a future of hunting in the dark, of using commercial search engines and hearsay to discover research.
Bill Anderson
July 8, 2011
Ben, thanks for the survey. Regarding the problem about qualifying citations, do you know of David Shotton’s 2010 Journal of Biomedical Semantics article on CiTO, the Citation Typing Ontology (http://www.jbiomedsem.com/content/1/S1/S6)? Shotton presents a set of citation types, including negative rhetorical relationships such as “refutes”, that seems like a good start.
Martin Fenner also has a blog post on how to use CiTO on blog posts: http://blogs.plos.org/mfenner/2011/02/14/how-to-use-citation-typing-ontology-cito-in-your-blog-posts/
(I have not had time to follow up on Fenner’s post in practice, but I did use the CiTO ontology on a data informatics course papers.)