STEM Abstracting and Indexing (A&I) Tool Overlap Analysis in 2020: An Open Science Informed Approach Amid Pandemic Budgets

Objectives: Compare journal coverage of abstract and indexing tools commonly used within academic science and engineering research. Methods: Title lists of Compendex, Inspec, Reaxys, SciFinder, and Web of Science were provided by their respective publishers. These lists were imported into Excel and the overlap of the ISSN/EISSNs and journal titles was determined using the VLOOKUP command, which determines if the value in one cell can be found in a column of other cells. Correspondence: Joshua Borycz: joshua.borycz@vanderbilt.edu Received: August 24, 2020 Accepted: December 18, 2020 Published: March 1, 2021 Copyright: © 2021 Borycz, Carroll & Eskridge. This is an open access article licensed under the terms of the Creative Commons Attribution License. Data Availability: The data are available on the Open Science Framework (https://osf.io/eu573). Disclosures: The authors report no conflict of interest. Full-Length Paper STEM Abstracting and Indexing (A&I) Tool Overlap Analysis in 2020: An Open Science Informed Approach Amid Pandemic Budgets Joshua Borycz, Alexander J. Carroll, and Honora N. Eskridge Vanderbilt University, Nashville, TN, USA


Introduction
Abstracting and indexing (A&I) tools, which facilitate information seeking by systematically organizing literature outputs from thousands of unique journal titles, represent a significant investment for many research libraries. While A&I tools have long represented the gold standard means of discovering literature across many disciplines within the academy, Google Scholar has emerged as the preferred literature searching tool for many user communities. These changes to user behaviors, combined with longstanding as well as emerging budgetary pressures on research libraries, have increased the need to reassess the value of A&I tools within the contemporary research environment.
In this paper, the authors examine the title overlap of several prominently used A&I tools in the basic and applied sciences (listed alphabetically): Compendex, Inspec, Reaxys, SciFinder, and Web of Science. This analysis finds substantial coverage overlaps between the titles indexed within Web of Science's Science Citation Index Expanded and Emerging Sources Citation Index when compared against several discipline specific databases (e.g., Compendex, Inspec, Reaxys, and SciFinder. Based on these findings, the authors will suggest that in order to maintain relevance and to continue creating value for research libraries, A&I tools must offer additional features beyond facilitating keyword searching of titles and abstracts. Finally, this paper also presents an open science informed methodological approach for conducting these types of overlap analyses, which the authors hope will facilitate future work in this area of librarianship.

Emerging platforms and changing user behavior
Two factors drive the value proposition of subscription A&I tools: 1) the unique title coverage offered over freely available alternatives; and 2) the preferences of users when conducting literature searches. As a result, information scientists increasingly measure the value proposition of many A&I tools against Google Scholar. While somewhat limited in scope when launched (Jacsó 2005), the coverage of journal articles indexed within Google Scholar has since expanded considerably. By 2014, de Winter et al. found that the majority of recent works indexed within Web of Science, often considered the gold standard multi-disciplinary A&I tool, were retrievable through Google Scholar (de Winter, Zadpoor, and Dodou 2014). By 2018, Gusenbauer had concluded that Google Scholar had become "the most comprehensive academic search engine" (Gusenbauer 2019). Previous researchers have tracked changes in A&I tool subscriptions over this period, finding that while large indexes like SciFinder and Web of Science have seen steady subscriptions, narrower subject specific tools like Compendex and Inspec have seen losses of subscribers across Association of Research Libraries (ARL) and Oberlin Group Libraries member institutions (Klassen 2020). There is also evidence indicating that traditional A&I technologies have not adapted quickly enough to the rapid increase in scientific output; scientists increasingly report that they find relevant primary literature too late in the research cycle to be useful (Lercher 2010). In response, providers of some of these A&I tools have invested in development projects to improve the value of these tools, in some cases designing entirely new search platforms (American Chemical Society 2020; The Institution of Engineering and Technology (IET) 2020).
The increased coverage of items within Google Scholar has been accompanied by the tool's growing popularity among many user communities. Google Scholar is often perceived as a tool preferred by more novice researchers who are attracted to the familiar Google interface and brand name (Rempel, Buck, and Deitering 2013;Komissarov and Murray 2016); however, several recent studies by Ithaka S+R on the research practices of scholars in fields such as agriculture, public health, and civil engineering found that faculty researchers identify Google Scholar as their first choice for literature searching (Cooper, Bankston, et al. 2017;Cooper, Daniel, et al. 2017;Cooper et al. 2019). The perceived ease and speed of Google Scholar has won over many researchers, even among more senior researchers who have nostalgic feelings about using subject specific A&I tools earlier in their academic careers: "what I use less and less is services like [AGRICOLA] and stuff. For whatever reason Google Scholar tends to give me what I want faster, which is really sad." (Cooper, Bankston et al. 2017) Moreover, Google Scholar also provides applied science researchers with easier access to grey literature and datasets that are rarely indexed within subject specific A&I tools (Haddaway et al. 2015).
Yet Google Scholar has some noteworthy limitations when compared to A&I tools, particularly when used for evidence syntheses. While a handful of studies have suggested that the coverage of Google Scholar is so expansive that it could be the primary literature retrieval tool used for evidence syntheses (Gehanno, Rollin, and Darmoni 2013), the broader research community has identified several issues with using Google Scholar for systematic literature retrieval (Giustini and Boulos 2013;Bramer et al. 2013). Google Scholar does not offer reliable or stable search results over time and place, does not allow large search results to be exported in other data formats, does not allow for thorough search strategy documentation, and only includes items that have been indexed online (Boeker, Vach, and Motschall 2013). As a result, Google Scholar cannot be relied upon to create reproducible search results over time and cannot find older items that have not been digitized, which may limit its usefulness for conducting reproducible evidence syntheses. However, many of these same limitations to Google Scholar have begun to surface within more established A&I tools, as well. A recent longitudinal query analysis of searches performed in MEDLINE discovered similar issues related to the reproducibility of search results-the results of MEDLINE searches can vary based on the platform used (e.g., EbscoHOST, Web of Science, OVID, PubMed, etc.) as well as when the search was performed (Burns et al. 2020).

Long-standing budget challenges for research libraries
Research libraries face several long-standing budget challenges that create an urgency to engage in continuous review of the value of their licensed A&I tools. Libraries continue to grapple with the ongoing "serials crisis," a now several decades long trend of annual increases in the costs of science and technology journals combined with only marginal increases in collection budgets (Mobley 1998;Schmidle and Via 2004;Baveye 2010). The contemporary serial crisis is often tied to the creation of "big deal" publication packages by major commercial publishers, in which subscription licenses to several individual journal titles from a single publisher are bundled into one package that is offered at a lower price (Hinchliffe 2020). In "big deal" license arrangements, libraries often have little room to negotiate the cost of individual titles, and local user communities may resist efforts by librarians to downsize journal subscriptions (Boissy et al. 2012).
After several decades of business as usual, these arrangements have in recent years begun to unravel, with several major research libraries in the United States severing their big deals with Elsevier in 2019 and 2020, including the University of California System, Temple University, Louisiana State University, Florida State University, the University of North Carolina at Chapel Hill, and the Massachusetts Institute of Technology (SPARC 2020). Meanwhile, other research libraries and publishers have collaborated to create "transformative agreements" that attempt to shift the fundamental business model of scholarly communication (Hinchliffe 2019). However, the actual price savings created by both "big deal" cancellat ions and "transformative agreements" remains to be seen (Anderson 2020), suggesting that concerns over the value of A&I tools will remain in the near term.

Emerging budget challenges for research libraries
In addition to these existing challenges, research libraries in 2021 must contend with the economic uncertainty created by the ongoing COVID-19 pandemic. Previous economic downturns like the 2008 Global Financial Crisis (GFC) created budget shortfalls for higher education, which resulted in pressures on research libraries' collection budgets. However, unlike the GFC, which was at least partially mitigated by counter-cyclical surges in student enrollment (Barr and Turner 2013), COVID-19 negatively affected all of the main revenue streams for institutions of higher learning: student enrollment, research productivity, charitable giving, and expected yield from endowment funds (Banes, Schwartz, and Pisacreta 2020). Faced with falling revenues and unexpected new expenses in the form of increased online instruction infrastructure and extensive new health and sanitation services expenditures, many colleges and universities instituted hiring freezes, furloughs, buyouts, and layoffs to control costs (Chronicle Staff 2020).
With institutions pivoting towards online and hybrid instruction, demand has grown amongst stakeholders across higher education for access to a wider variety of electronic resources that can support student learning in both synchronous and asynchronous online environments (Blankstein, Frederick, and Wolff-Eisenberg STEM Abstracting and Indexing (A&I) Tool Overlap Analysis JeSLIB 2021; 10(2): e1192 https://doi.org/10.7191/jeslib.2021.1192 2020). Research libraries in many ways were well-prepared for a pivot to online and hybrid learning. Broader community investments in shared resources like HathiTrust created shared open access infrastructure that can be accessed remotely by researchers and students temporarily barred from entering physical library spaces (Schonfeld 2020). Yet despite decades of intentional investment and development in digital infrastructures and digital resources (Evans and Schonfeld 2020), the rapid closures of physical facilit ies still created unexpected challenges. Limited access to print collections led many research libraries to purchase electronic duplicates of items already held within physical collections (Hinchliffe and Wolff-Eisenberg 2020), placing new burdens on already dwindling discretionary budgets (Daniel, Esposito, and Schonfeld 2019). In the midst of this push for more electronic resources to support online instruction while simultaneously facing significant budget shortfalls, many colleges and universities have asked their libraries to prepare for budget cuts to upwards of twenty percent in upcoming fiscal years (Lutz and Schonfeld 2020). While publishers responded to this crisis by offering temporary free access to some of their platforms, many of these extended access programs expired in June 2020 (Association of American Publishers 2020). Given this budget climate, many college and university libraries will be hard-pressed to meet these conflicting demands, which places potentially duplicative A&I tools under an even brighter spotlight.

Previous A&I overlap work
Previous evaluations of A&I tools' title overlaps have largely adopted two approaches. One approach is to use specific computational tools that have been custom-built for this kind of analysis: for example, the Serials Solutions Overlap Analysis tool, the Academic Database Assessment Tool (ADAT), and the CUFTS Resource Comparison Tool (Duong, Perruso, and Ramachandran 2013;Harker and Kizhakkethil 2015). However, this approach creates several significant limitations related to the reproducibility of the generated results. Tools like the Serials Solutions Overlap Analysis tool are proprietary, and as such are only available to research libraries that have a license agreement with Serials Solutions. On the other hand, open source tools like the ADAT and the CUFTS Resource Comparison Tool only retain their utility if they are maintained, which requires constant investment for continued development; as of 2020, neither of these tools were still being actively updated.
The other primary method described in the literature is downloading full-title lists for individual A&Is, and then comparing title lists against one another by matching titles on ISSN (Gavel and Iselid 2008;Kimball 2016). These analyses are often performed using a tool like Microsoft Excel, in which the VLOOKUP function is used to merge title lists from two separate data sheets into a single data sheet by finding common ISSNs (Kimball 2018). Previous studies have suggested that limitations exist with this method, as well. Primarily, substantial data cleaning is often involved, as the data quality of the title lists provided by vendors can vary substantially. Issues created by poor data quality may include missing ISSN data, incongruence between print ISSNs and electronic ISSNs, as well as duplicate STEM Abstracting and Indexing (A&I) Tool Overlap Analysis JeSLIB 2021; 10(2): e1192 https://doi.org/10. 7191/jeslib.2021.1192 records within a single title list. Previous studies have also suggested that this approach only allows for overlap analysis to be conducted via paired comparisons rather than multilevel analyses (Harker and Kizhakkethil 2015).

Methods
Requests for the journal title lists of each of the databases used in this work were made from their respective providers (Elsevier, Chemical Abstracts Service (CAS), and Clarivate Analytics). The journal title lists for Compendex, Reaxys, and Web of Science were in CSV files containing ISSNs, EISSNs, and titles. For the purposes of this overlap analysis, only journals indexed within the Science Citation Index Expanded and the Emerging Sources Citation Index were included from the Web of Science (WOS) Core Collection, which reduced the total WOS titles included from 21,226 to 17,014. The title list for Inspec contained only ISSNs and titles. SciFinder titles with ISSNs, and EISSNs were provided in PDF format and had to be manually transferred to an Excel spreadsheet. For the purposes of this overlap analysis, only journals indexed within the CAPlus database were included for SciFinder; titles searchable within MEDLINE but not included with CAPlus were excluded from this analysis. The unprocessed title list data for each A&I tool are available via OSF (Borycz, Carroll, and Eskridge 2020).
After importing the titles and ISSN/EISSNs into Excel, extra white space was removed, dashes were added to all ISSN/EISSNs for consistency, and duplicate titles were removed from all journal lists. The VLOOKUP command takes a string in a single cell and compares it to the strings in all of the cells in another column. VLOOKUP was used for journal titles, ISSNs, and EISSNs separately. If ISSNs or EISSNs were listed as the same number, the EISSN was removed to prevent double counting. A README file that documents the data cleaning and data analyses processes is available via OSF (Borycz, Carroll, and Eskridge 2020).
If only the journal title, or ISSN, or EISSN matched between the databases, this was considered an overlapping title. The overlap was computed this way because the ISSNs and EISSNs were not assigned consistently when comparing the title lists provided by the publishers and, in a few cases, there was overlap for the journal title but not for the EISSN or ISSN. This method mitigates the number of false negatives reported in the data. While this approach may have increased the number of false positives detected in the data, the total number of matches on title was relatively small (n=8 -92) and contributed only a small percentage to the total overlaps (0.2-4.2%). Table 1 shows the total number of titles present in each of the databases.  Table 1: Name, scope, and extent of journals analyzed.

Results
The results of the overlap analysis performed in this work are provided in Table 2. The combined Science Citation Index Expanded and the Emerging Sources Citation Index from the Web of Science Core Collect ion was used as the reference database for most cases. The entire WOS Core Collection (Science Citation Index Expanded, Social Sciences Citation Index, Arts and Humanities Citation Index, and Emerging Sources Citation Index) was also used for comparison but did not account for substantial changes in the overlap between databases (Table S1) (Clarivate 2020). SciFinder and Reaxys were compared because they are both primarily chemistry databases. Compendex and Inspec were compared because they are engineering databases that are often used in concert through Elsevier's Engineering Village search platform. WOS titles were combined with these unique cases for comparison as well.
The results in Table 2 show that WOS is the largest database included in this analysis by far (17,014). This is because it is a comprehensive database meant to cover a wide range of topics. Reaxys is primarily designed for chemistry and chemical engineering and contains the second largest set of journal titles (14,863). By comparison SciFinder, which is another popular chemistry database, has the fewest titles (2,180). WOS contains many of the titles present within the other four databases. Compendex has the smallest overlap at 63.60% and SciFinder has the largest at 75.83%. Combining WOS with Compendex substantially increased the overlap percentage with Inspec from 54.48% to 77.55%. Combining WOS with Inspec increased the overlap with Compendex from 51.68% to 69.70%. Compendex has the highest proportion of unique titles within this subset of databases based on these comparisons. Combining WOS with Reaxys did not change the overlap with SciFinder very much (75.83% to 80.78%). Full processed data are available on OSF (Borycz, Carroll, and Eskridge 2020 Table 2: Summary of overlap comparisons. The numerator represents the database being analyzed and the denominator represents the reference database. TOTAL shows the number of titles in the analyzed database, # OVERLAP shows the number of overlapping titles, and % OVERLAP shows the percentage of the analyzed and referenced databases that overlap. A plus sign indicates that the reference combines the unique titles from two databases.

Discussion
Given the popularity of Google Scholar, licensed A&Is must provide demonstrated utility to researchers in the form of unique title indexing or advanced search features in order to continue to have value in the contemporary research environment (Little 2011;Oh and Colón-Aguirre 2019). The results of this analysis indicate that some prominent A&I databases designed to serve researchers working within the same fields (Compendex and Inspec, SciFinder and Reaxys) have content that overlaps substantially when combined with the WOS core collect ion (Table 2: 67.03-75.83%). For institutions where user communities' preferences may have moved away from usage of these tools, the relatively few unique titles offered by SciFinder and Inspec may provide justifications for research libraries reallocating their collection budgets towards other resources. While SciFinder supplements these relatively few journals with reference information on chemical structures, properties, and reactions drawn from the CAplus database, which many chemists find indispensable, combined with the ability to simultaneously search MEDLINE (Gabrielson 2018 tools over alternatives like Compendex, yet it also offers fewer unique titles (29.01%) when measured against WOS than Compendex (36.40%), and has an 77.55% overlap with WOS+Compendex (Table 2).
In addition to these specific findings, this paper offers several unique contributions when compared to previously published journal overlap analyses. The method of overlap analysis used improves upon basic ISSN: ISSN comparisons using VLOOKUP by adding in additional checks using EISSN and Title matches in order to identify additional overlapping journal coverage, reducing the number of false negatives. The thorough data cleaning processes used also catch duplicate records that will incorrectly have been deemed "overlaps" otherwise. Given that the data provided by vendors used in this paper often included duplicate records, this process helped limit the number of false positives detected in the analyses. Furthermore, while earlier studies have suggested that ISSN-based analyses using Excel could only perform 1:1 database comparisons, by combining unique title lists from multiple databases into a single data sheet, this study included N:1 database overlap comparisons. While this study's N:1 comparisons were limited to combinations of a single, subject-specific A&I tools combined with Web of Science, this same method could be utilized by future investigators to perform additional N:1 analyses.
All data associated with this paper are deposited online and can be reviewed by the broader library and information science community both now and in the future. Included in these data are the unprocessed data files the authors received either from vendors upon request or directly from the Internet, the cleaned data files that were prepared prior to data analysis, the processed data files used to generate the results reflected in this paper, as well as README files that outline the steps taken to clean the data and generate the processed data. Committing to open data practices and open science processes in this type of research is important for several reasons. The open data associated with this paper enhances the reproducibility of this study, as these data can be used by other investigators in the near term to conduct additional analyses on other subject specific or comprehensive A&I tools that were not addressed by the authors (e.g., MEDLINE, Scopus, MathSciNet, etc.). Additionally, these data may be used by future investigators to examine how database coverages have shifted overtime; journal overlap and the value proposition of these tools will no doubt continue to change as these tools are acquired by new vendors (Elsevier 2013), existing tools are merged together into single platforms (University of California 2018), and emerging platforms for journal discovery appear that challenge the popularity of these tools (Himmelstein et al. 2018).

Conclusions
While this study demonstrates substantive overlaps among many prominent STEM A&I tools, for many college and research libraries, overlapping journal titles alone will not provide enough justification for cancelling a licensed A&I tool. For example, uneven date coverage between two tools can complicate title overlap STEM Abstracting and Indexing (A&I) Tool Overlap Analysis JeSLIB 2021; 10(2): e1192 https: //doi.org/10.7191/jeslib.2021.1192 analysis, as one A&I tool's coverage of a given title may be substantively more expansive temporally. An additional limitation of overlap analyses, including this one, is questions about the reliability of the underlying data used. While vendors may be expected to provide accurate coverage information to library licensees, journal acquisitions and title changes make these data messy and difficult to wrangle; the use of incomplete or inaccurate title lists could lead to overstatements of the degree of overlap between two databases. Beyond title overlaps, librarians considering cancelations should weigh other factors as well, such as the specific unique titles a tool indexes and whether those titles may be of import to important academic units on campus. Finally, licensed A&I tools also may provide users with advanced search techniques (e.g., chemical structure), full -text searching, and reference value searching that are not available in other licensed or free discovery platforms.
When considering cancellation, usage statistics and user research may provide additional insights into the utility of a given tool. Many of the tools included in this study use the COUNTER standard for sharing usage data, and will make this data available to individual licensees upon request (COUNTER 2021). However, even among COUNTER-compliant vendors, limitations with these usage data exist. These data are only available upon request, and in some cases a library licensee may need to make several requests in order to receive it. Additionally, many of these data will be shared in formats that require extensive data cleaning in order to be machine readable (i.e., in .pdf format, in a locked .xslx file, etc.). Furthermore, while usage data provide quantitative insights into the extent of a tool's usage, these data do not provide context of the value of a tool when measured against licensed or free alternative options (Warwick et al. 2009). In order to identify discovery tools that may have specific value to user communities, either due to unique title coverage or to advanced search features, librarians should consider supplementing overlap analyses with user research studies that include surveys, interviews, or focus groups. Overlap analyses such as this, combined with usage data, can help spark these conversations with campus stakeholders on the relevancy and value of these tools. A&I tools that fail to offer a robust list of unique titles and that are not thoroughly integrated into the workflows of researchers may be candidates for cancellation.
The overlap of materials within these licensed tools, combined with the ascendance of Google Scholar, suggests that the developers of these A&I tools may need to invest additional development into designing value-additive search features if they wish to regain relevance among researchers. The reception users give to the new platforms developed for SciFinder and Inspec (SciFindern and Inspec Analytics, respectively) may provide insight into whether these licensed resources can convince researchers to reconsider Google Scholar as their discovery tool of choice at that library. The launch of these new platforms also could provide a valuable opportunity for vendors and research libraries to partner by conducting user research among local stakeholders to gather feedback on these new platforms. STEM Abstracting and Indexing (A&I) Tool Overlap Analysis JeSLIB 2021; 10(2): e1192 https://doi.org/10. 7191/jeslib.2021.1192 In addition to developing new user interfaces, vendors interested in demonstrating the continuing value of A&I tools to library licensees should consider improvements that can be made to better integrate these tools into the workflows of librarians and researchers. For example, by developing more effective processes for sharing usage data, vendors can more transparently communicate to library licensees the utility of these discovery tools; possible strategies include adopting COUNTER reporting standards and enabling licensees to pull their own data rather than requiring mediated data requests. Meanwhile, vendors wishing to regain market share from Google Scholar should consider how they can better integrate A&I tools into researchers' workflows, rather than expecting researchers to work around the idiosyncrasies of their tools. One possible avenue for A&I tools to create unique value over Google Scholar could be to focus on streamlining the systematic literature searches that are conducted in support of evidence syntheses (e.g., systematic reviews, scoping reviews, etc.). By developing tools like an automated search protocol documentation or enhanced record exporting, A&I providers could ensure that these tools remain indispensable for researchers interested in conducting evidence syntheses, which increasingly span from the health sciences to the social sciences (McKenzie and Brennan 2017).