What the ephemeral of the web means for your hyperlinks
Hyperlinks are a powerful tool for journalists and their readers. Diving deep into the context of an article is just a click away. But hyperlinks are a double-edged sword; despite all the infinity of the Internet, what is on the Web can also be changed, moved or disappeared entirely.
The web fragility poses a problem for any area of work or interest that relies on written material. Loss of reference material, negative SEO impacts, and malicious hijacking of valuable outbound links are among the harmful effects of a broken URL. More fundamentally, it leaves articles from decades past as shells of themselves, cut off from their original source and context. And the problem goes beyond journalism. In a 2014 study, for example, researchers (including some members of this team) found that almost half of all hyperlinks in Supreme Court opinions lead to content that had changed since its original publication or had disappeared from the Internet.
Hosts control URLs. When deleting content from a URL, intentionally or unintentionally, readers find a website inaccessible. This often irreversible degradation of web content is commonly referred to as linkrot. This is similar to the related issue of content drift, or generally unannounced changes – retractions, additions, replacements – to the content of a particular URL.
Our team of researchers at Harvard Law School undertook a project to better understand the extent and characteristics of journalistic linkrot and content drift. We looked at the hyperlinks in New York Times articles, starting with the launch of the Times website from 1996 to mid-2019, developed on the basis of a set of data provided to us by the Times. The substantial linkrot and content drift we found here reflects the difficulties inherent in long-term bonding with pieces of a volatile web. The Times in particular is a well-resourced flagship for digital journalism, with a strong institutional archival structure. Their interest in meeting the linkrot challenge indicates that it has not yet been understood or comprehensively addressed across the field.
The dataset of links on which we built our analysis was assembled by Times software engineers who extracted URLs embedded in archive articles and packaged them with basic article metadata such as section and publication date. We measured the linkrot by writing a script to visit each of the unique “deep” URLs in the dataset and record HTTP response codes, redirects, and server timeouts. Based on this analysis, we tagged each link as either “rotten” (deleted or unreachable) or “intact” (returning a valid page).
Two million hyperlinks
We found that of the 553,693 articles in our study, that is, they included URLs on nytimes.com, there were a total of 2,283,445 hyperlinks pointing to content outside of nytimes. .com. Seventy-two percent of these were ‘deep links’ with a path to a specific page, such as example.com/article, on which we focused our analysis (as opposed to just example.com, which made up the rest of the dataset).
Of these deep links, 25% of all links were completely inaccessible. Linkrot has become more common over time: 6% of 2018 bonds had rotten, up from 43% 2008 bonds and 72 percent links since 1998. Fifty-three percent of all articles with deep links had at least one rotten link.
Rot through sections
Some sections of the Times posted much higher rates of rotten URLs. Links in the Sports for example, show a relative decay rate of about 36 percent, compared to 13 percent for The result. This difference is largely related to time. The average age of a link in The Upshot is 1,450 days, compared to 3,196 days in the Sports section.
To detect to what extent these chronological differences alone explain the variation in rot rate between sections, we have developed a metric, the relative rot rate. It allows to see if a section has undergone proportionally more or less linkrot than the Times globally. Of the fifteen sections with the most articles, the Health section had the lowest RRR numbers, falling about 17% below linkrot’s benchmark frequency. The Travel section had the highest rot rate, with over 17 percent of links appearing in articles in the section having rot.
A section that reports a lot on government affairs or education might be at a disadvantage that deep links to domains like .gov or .edu show higher decay rates. These URLs are volatile in nature: whitehouse.gov will always have the same URL but will fundamentally change both content and structure with each new administration. It is precisely because their domains are frozen that their deep ties are fragile.
Of course, returning a valid page is not the same as returning the page as seen by the author who originally included the link in an article. Linkrot’s partner in content drift may make the content at the end of a URL misleading or radically different from the intentions of the original linker. For example, a 2008 item about a congressional run refers to a member of New York City Council and links to what had been his page on the council’s website. Today, by clicking on the same link, you will be redirected to the current District Council member website.
To identify the prevalence of content drift, we performed a human examination of 4,500 URLs randomly sampled from the URLs that our script had tagged as intact. For the purposes of this review, we have defined a content drift link as a URL used in a Times article that did not refer to the relevant information to which the original article referred when it was published. Based on this analysis, the reviewers marked each URL in the sample as either “intact” or “derived”.
Thirteen percent of intact links from this sample of 4,500 had drifted significantly since the Times published them. Four percent of accessible links published in 2019 articles had drifted, up from 25 percent of links accessible since 2009.
The path to follow
Linkrot and the content drift on this scale through the New York Times is not a sign of neglect, but rather a reflection of the state of modern online quotation. The rapid sharing of information through links improves the field of journalism. The fact that it is compromised by the fundamental volatility of the web indicates the need for new practices, workflows and technologies.
Retroactive – or mitigating – options are limited, but still important to consider. The Internet Archives hosts an impressive, if far from complete, assortment of website snapshots. This is best understood as a way to correct linkrot and content drift issues. The publications could help improve the visibility of the Internet Archive and other similar services as a tool for readers, or even automatically replace broken links with links to archives, as the Wikipedia community has done. .
Yet more fundamental measures are needed. Journalists have adopted proactive solutions, such as capturing screenshots and storing static images of websites. But that does not solve the reader who stumbles on an unreachable link.
New frameworks for examining the purpose of a given link will help strengthen the interlocking processes of journalism and research. Before linking, for example, journalists need to decide whether they want a dynamic link to a volatile web – risking content rot or drifting, but allowing for deeper exploration of a topic – or a frozen piece. archival record, set to represent exactly what the author would have seen at the time of writing. Newsrooms – and the people who support them – should create technical tools to streamline this more sophisticated linking process, giving editors maximum control over how their articles interact with other web content.
Newsrooms should consider adopting the right tools for their workflows and making link preservation an integral part of the journalistic process. Partnerships between library and information professionals and digital newsrooms would be fruitful in creating these strategies. Previously, such partnerships have produced solutions tailored to the field, such as those offered to the legal field by the Harvard Law School Library’s Perma.cc project (on which the authors of this report are working or have worked).
The skills of information professionals must be combined with the specific concerns of digital journalism to bring out specific needs and areas of development. For example, explorations into more automated linkrot and content drift detection would open doors for newsrooms to balance the need for external links with archival considerations while maintaining their large-scale publishing needs.
Digital journalism has grown tremendously over the past decade, taking a vital place in history. Linkrot is already shattering that record – and it won’t go away on its own.
For an extended version of this research, plus more information on the methodology and the dataset, visit https://cyber.harvard.edu/publication/2021/paper-record-meets-ephemeral-web.
Below is a list of the archived quotes included in this article:
https://www.cjr.org/tow_center_reports/the-dire-state-of-news-archiving-in-the-digital-age.php archived at https://perma.cc/FEW8-EBPH
https://www.searchenginejournal.com/404-errors-google-crawling-indexing-ranking/261541/#close archived at https://perma.cc/H23H-28CJ
https://www.buzzfeednews.com/article/deansterlingjones/links-for-sale-on-major-news-wesbites archived at https://perma.cc/D6B3-2Z2A
https://cityroom.blogs.nytimes.com/2008/06/12/democrats-rally-around-mcmahon-in-si-race/ archived at https://perma.cc/W95N-U4S9
https://council.nyc.gov/district-49/ archived at https://perma.cc/QNZ6-SEXP
John Bowers, Clare Stanton and Jonathan Zittrain are the authors. John Bowers is a JD candidate at Yale Law School and affiliated with the Berkman Klein Center for Internet & Society at Harvard University, where he worked as a senior research coordinator before coming to Yale. Clare Stanton is the Outreach and Communications Manager for the Perma.cc project at the Library Innovation Lab at Harvard Law School. Jonathan Zittrain is professor of law and computer science at Harvard, where he co-founded his Berkman Klein Center for Internet & Society.