Dataset Search: A lightweight, community-built tool to support research data discovery

Objective: Promoting discovery of research data helps archived data realize its potential to advance knowledge. Montana State University (MSU) Dataset Search aims to support discovery and reporting for research datasets created by researchers at institutions. Methods and Results: The Dataset Search application consists of five core features: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest, and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by MSU researchers. Conclusion: The Dataset Search application is designed to be easily customized and implemented by other institutions. Indexes like Dataset Search can improve search and discovery for content archived in data repositories, therefore amplifying the impact and benefits of archived data. Correspondence: Sara Mannheimer: sara.mannheimer@montana.edu Received: June 4, 2020 Accepted: September 8, 2020 Published: January 19, 2021 Copyright: © 2021 Mannheimer et al. This is an open access article licensed under the terms of the Creative Commons Attribution License. Data Availability: Code associated with this paper is available in Zenodo, via Github at: https://doi.org/10.5281/zenodo.4046567. MSU Dataset Search is available at: https://arc.lib.montana.edu/msu -dataset-search. Disclosures: The authors report no conflict of interest. The substance of this article is based upon a lightning talk at RDAP Summit 2020. Additional information at end of article. Full-Length Paper Dataset Search: A lightweight, community-built tool to support research data discovery Sara Mannheimer, Jason A. Clark, Kyle Hagerman, Jakob Schultz, and James Espeland Montana State University, Bozeman, MT, USA


Introduction and Background
Sharing the scientific data that underlie results is increasingly seen as a vital part of scholarly communication (Baker 2017;Boulton et al. 2012). Sharing research data has multiple potential benefits. Shared data can increase time efficiency and cost efficiency by allowing researchers to reuse data rather than collect new data (Pronk 2019); it can support reproducibility and replicability for scientific research (National Academies of Sciences, Engineering, and Medicine 2019); it can produce new discoveries to advance science (Fienberg et al. 1985); it can increase visibility and impact of research (Piwowar and Vision 2013); encourage new, mutually-beneficial collaborations between researchers (Pasquetto, Borgman, and Wofford 2019); and shared data can be used in the classroom and during apprenticeships to support the next generation of researchers (Haaker and Morgan-Brett 2017;Kriesberg et al. 2013).
In the United States, research data that result from public funding are further considered to be a public asset that should be shared openly (Holdren 2013). In response to this idea, federal funding agencies now require sharing data with other researchers. The National Science Foundation's policy states, "grantees are expected to encourage and facilitate [data] sharing" (National Science Foundation 2011); and the National Institutes of Health suggest that "data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data" (National Institutes of Health 2003). An increasing number of scientific journals also require that researchers share the data underlying their published articles. In 2011, a group of journals in the field of evolution coordinated to implement the Joint Data Archiving Policy requiring authors to publish the data underlying their publicat ions (Dryad Digital Repository 2011), and other scientific journals have followed suit, including PLOS journals (PLOS 2014) and the Committee of Medical Journal Editors (Taichman et al. 2017).
Researchers share their data in multiple ways: as supplementary material to published articles, as downloads on institutional or personal websites, through archiving in data repositories, or by sharing data "upon request"-that is, in response to inquiries from other researchers (Kim and Stanton 2016;Tenopir et al. 2015;Wallis, Rolando, and Borgman 2013). The 2016 FAIR Data Principles propose that beyond just being shared, data should be Findable, Accessible, Interoperable, and Reusable (Wilkinson et al. 2016). From a data stewardship perspective, and in order to best support FAIR data, sharing via data repositories allows for the most reliable long-term discovery, access, and preservation for shared data (Witt 2008;Poole 2015;Kim and Zhang 2015). Data repositories also integrate into the scholarly communication ecosystem, supporting data citation practices for data creators (Nature Biotechnology 2009;Fenner et al. 2019). Therefore, data repositories are often the preferred method for data sharing-for example, as stated the PLOS data sharing policy (Federer et al. 2018). As of May 2020, the Registry of Research Data Repositories (re3data.org) has indexed 1068 unique repositories in the United States (Registry of Research Data Repositories 2020). These data repositories can be categorized into four key types (Pampel et al. 2013 2. Disciplinary research data repositories that archive data in specific formats or from specific subjects-e.g. GenBank (Benson et al. 2013)  Data repositories are still a relatively new development in scholarly communication, and their infrastructure and metadata are far less standardized than in scientific journals-for instance, data repositories don't always require that depositors add institutional affiliation, and metadata are also often entered by the depositor, rather than entered in a standardized way by professional catalogers (Marcial and Hemminger 2010). In 2010, Marcial and Hemminger also identified preservation as an issue; only 62% of the data repositories they surveyed had "a clear mention of a preservation policy or similar" (2038). However, an increasing number of data repositories are now certified under initiatives such as the CoreTrustSeal Trustworthy Data Repositories Requirements, a set of standards for data stewardship that certify that repositories support healthy infrastructure and long-term preservation for repositories (CoreTrustSeal 2020). Additionally, the TRUST Principles can help repositories become more trustworthy data stewards, and help researchers select a trustworthy repository for data sharing (Lin et al. 2020).
While data repositories are increasingly focusing on long-term data stewardship, they still have room to grow in terms of promoting discovery for their resources. A 2017 study of natural resources and environmental scientists found that "while institutional repositories were commended by interviewees for providing permanent archiving and long-term preservation, for supporting storage and download, and for ensuring accessibility and credibility… [they were] not particularly valued for searchability and discoverability" (Shen 2017, 120). While efforts have been made to improve discovery for institutional repositories (Arlitsch and O'Brien 2012), Mannheimer, Sterman, and Borda (2016) find that research data are discovered and reused most often if they are: (1) archived in disciplinary research data repositories; and (2) indexed in multiple online locations.
An increasing number of recent projects focus on indexing data in repositories, including the NIH-funded DataMed (Chen et al. 2018), which uses the DATS suite of tags to support automatic indexing of scientific datasets (Sansone et al. 2017); SHARE, which cooperates with institutional repositories to use "a schema-agnostic approach" to metadata aggregation (Hudson-Vitale et al. 2017, sec. 1, para. 6); Elsevier DataSearch, which uses a two-tiered word embedding analysis to match natural language queries and a formal ontology assignment (Scerri et al. 2017); and Google Dataset Search, which uses Schema.org as a unifying metadata schema and which came out of beta in 2020 (Noy 2020). However, dataset indexing projects such as these may not reveal all available research data. Some research data cannot be published openly in data repositories, either because the research is still in -progress, or because the data are sensitive in nature. This has motivated the creation of data catalogs that include restricted data. Notable projects are NYU Langone Health Sciences Library's Data Catalog (Lamb and Larson 2016) and its fellow members of the Data Discovery Collaboration (formerly the Data Catalog Collaboration Project) (Read et al. 2018).
The Montana State University (MSU) Library aims to bring together ideas from each of the projects described above, as well as some innovations, to encourage discovery and reuse of datasets from MSU researchers.

The Montana State University Dataset Search
Montana State University (MSU) is a mid-sized university. In the 2019-2020 academic year, the university had 16,766 students (Montana State University 2020) and 56 library employees (MSU Library 2020). In 2019, MSU Library joined Dryad Digital Repository as an institutional member to support trustworthy, long-term preservation for research datasets at our institution. This allows us to focus our local efforts on research data curation and discovery. As part of these efforts, we built a Dataset Search tool to support discovery, access, and reuse for research datasets from our institution  (CAIRHE 2020) to support a pilot effort to manually produce metadata records for restricted datasets that can be accessed by contacting the Center. Indexing these datasets supports research transparency and data discovery and access for the Center's community stakeholders.
MSU Dataset Search complements existing data discovery efforts by indexing and creating metadata records for data in repositories, showcasing the data created at our institution through a visualization dashboard, as well as by creating metadata records for restricted data. MSU Dataset Search also adds three innovative features to these efforts. First, Dataset Search brings an institutional focus to the automated collection of metadata from third-party data repositories; automated metadata collect ion allows the index to be populated with metadata for local research datasets with less manual effort from library employees and therefore less resource expenditure from the institution. Second, Dataset Search is optimized for commercial web search engines, which supports discovery of MSU datasets on the open web. Third, Dataset Search automatically generates new descriptive metadata for individual datasets using external topic mining of scholarly profile sources like ORCID and Google Scholar Profiles.

Building the Tool
To begin building the Dataset Search tool, the team needed to understand how to identify datasets that had been published by researchers at our institution. Centering this question led us to also think about how we could construct the tool to allow other institutions to apply the software. In moving from our specific use case to a broadly-applicable model, five components became core features of the applicat ion: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by our researchers. These components are discussed in more detail below.

Browse and search interface
The need for an interface to allow for search and retrieval was a primary consideration. The team wanted a clean interface that made it easy for users to search, browse, and access datasets in external repositories. In the section "Lessons learned and continued challenges," we further discuss the particular challenge of designing the interface and our work with a designer to come up with primary actions for the application. These discussions helped us isolate the fundamental user experience; our team focused on helping users identify the purpose of the application, find a particular dataset, and then link from the metadata in our system to the repository where the dataset is stored. These core actions define the primary interface. The visual layout for the Dataset Search landing page can be seen in figure 1.  A user is able to recognize quickly the reason for their being on the page is to search for datasets. In turn, the search box and list of recent datasets are calls to action that impart what next steps might be, but also indicate that a user is at the landing page of the Dataset Search application. The landing page clearly directs the user to search and browse through the system.
Beyond the landing page and search/browse results, a user is led into a view of item metadata that displays a title and description, a permanent identifier for dataset, and a button linking to the actual dataset in an external data repository.  The item page is the link between the local metadata record and the external repository that provides access to the dataset. The metadata on the item page also allows us to catalog MSU researchers and the types of data they produce.

Data model based on dataset discovery
The data model for the datasets was also essential. Our research revealed no shortage of metadata schemes to follow. We ultimately took our cues from the Google Dataset Search metadata, which applies the Schema.org web vocabulary. This structured data vocabulary is a widely -adopted standard, and it sets up a series of types and properties to describe the datasets with a goal of indexing for discovery in commercial search engines (Schema.org 2020a). The overarching goal of discovery suited our needs, but there were times where the data model needed some enhancement for administrative and technical metadata. Schema.org prioritizes the "aboutness" of the dataset which leads to primary properties that help a person understand more about the content within the dataset. Properties like measurementTechnique (Schema.org 2020b) and variableMeasured (Schema.org 2020c) are just two examples of this "aboutness" prioritization within Schema.org. Within our data model, we made additions to support linked data identifiers and we added administrative properties like dataset_urlHash, recordInfo_recordContentSource, dataset_conditionsOfAccess . An example of our primary entity table, a datasets table, is featured in the figure below.  Figure 3 gives a picture of the 'datasets' table as a SQL CREATE query, but it also demonstrates where parts of the discovery metadata are not enough. Access restrictions (data set_ cond itio nsOf Acces s), dataset sources (recordInfo_recordContentSource), and methods for deduplicating datasets (dataset_urlHash) were added to supplement and build metadata to support technical and administrative tasks.
To enrich our data model, we have chosen to provide as much information about authors as possible. Currently, this means we are scraping Google Scholar (Google 2020a) for MSU's faculty profiles via Python script for their posted keywords. This script then takes the keywords, cross references them with WikiData (Wikidata 2020) and grabs the relevant machine label. Keyword and machine label are then stored side by side in the database. This means authors can be linked via their interests or professional skills and their published works can be found in a single query of our database.

Harvesting process for finding and vetting datasets
In building the tool, the team also set requirements around harvesting and vetting datasets for inclusion in the MSU inventory. This was in many ways the central organizing principle for the application. We needed to create a software process to search multiple, external dataset repositories and identify datasets that are affiliated with MSU research or produced by MSU researchers. A number of options from web scraping of search result pages to Application Programming Interface (API) querying were considered. Our team settled on API querying as it allowed an explicit contract between our application and the external dataset repositories as well as a structured data response that we could write a software process to consume.

Figure 4: Example XML mapping for an individual API
Currently, the MSU Dataset Search tool has functionality for storing XML feeds or API responses that are available for consumption from data repositories. When a feed is selected and added to the application, a PHP script breaks down the feed and determines the repeating tag used to store entries. There are no formal Dataset Search JeSLIB 2021; 10(1): e1189 https://doi.org/10.7191/jeslib.2021.1189 guidelines for how these repositories structure their feeds, so there are not any normalized naming schemes we can rely on. However, the repeating tag will always be the tag in a feed with the highest product between the number of instances it appears and the number of children it has. With the help of a curator using an HTML form, we can identify the tags in the feed as we have named them in our database and form an XML map of the feed.
Using the extracted XML map, we can traverse any feed according to its structure and auto-populate records to be inserted into the database. Should a feed ever change, we can either update the file containing the XML map, or re-add the feed and the script will find the corresponding tags again. By automating this process, we can handle a variety of different feed structures and tag naming conventions.
Beyond the initial querying and harvest of our datasets through the APIs, we needed a way to vet and deduplicate our dataset records. The team settled on a deduplication string that is currently a combination of the dataset title, link, description, creator, pubDate, and uid (if they are set). This is then used to create the dataset_urlHash which is a unique identifier that we can check against to verify if we have already harvested a dataset record. The team is encouraged by the results here as it allows us to automatically check for duplicates and has increased the efficiency of our ingest process.

Administrative interface
With the data model and harvesting in place, we needed a secondary interface that would allow us to manage the data. We constructed a series of web forms to enable harvesting, adding, updating, and deleting of metadata. The administrative interface also includes our harvesting routine for automatically populating our dataset records from external sources. This view is an editing table that pulls in data from theses external sources and then allows a curator to review or accept a dataset as a record for MSU Dataset Search. The view below shows the table as it is being populated. The harvesting view also allows the curator to control the amount of metadata that is visible and helps create a minimum viable metadata record that provides our catalog with an automated routine for data entry.
The administrative interface can also be used to manually create metadata records. Partnering with the Center for American Indian and Rural Health Equity (CAIRHE), our team has created pilot records in the system to promote discovery for restricted datasets. Instead of linking to the dataset in an external repository, the system directs users to contact the Center to request the dataset.

Visualization interface to demonstrate how data is produced and used
Part of our goal in creating the Dataset Search was to showcase research and research data at our institution. Our team considered how public dashboards could help shape different views and understandings of our dataset inventory, providing quick snapshots, trends, and analysis of the datasets in the application. These data dashboards are currently in-development.  There are a variety of fields that we capture within our database that allow a user to filter metadata by certain fields. To visually capture this, we have a series of queries that will display current data as infographics using D3.js, a Javascript visualization library. We are working to prototype dashboard landing pages unique to each field a user may want to filter on such as: author, college, department, affiliation, keyword, creator type, repository, published date, and modified date. Each page will have a different set of queries for each infographic to display relevant information. We are working to create snapshot visualizations that are suited to each type of data. For example, date dashboards will include a line graph over time and a department specific dashboard may show intradepartmental and Dataset Search JeSLIB 2021; 10(1): e1189 https://doi.org/10. 7191/jeslib.2021.1189 interdepartmental collaborations. As we finalize the work here, we'll consider how these dashboards work best for our users and how we might integrate visualizations into the next software release.

Lessons Learned and Continued Challenges
As has been noted in our review of the literature, the dataset repository landscape is new and dispersed, and the metadata describing datasets in these repositories is limited, especially when looking to identify a dataset creator and their affiliat ion to an institution. Frequently, our team had to work through researcher disambiguation and understanding the researcher's connection to our university as we turned toward large-scale aggregation and harvesting of datasets. While this initially slowed down parts of our work, we ultimately created some viable solutions to identifying our datasets and the work of our researchers. We arrived on a three-pronged strategy for identifying and enhancing metadata for MSU datasets.
First, we looked to survey metadata records for fields that potentially indicated a connection or loose affiliation with Montana State University research. In most cases, our work involved isolating metadata fields that suggested the sources of the dataset. Most of this work was done through manual searches (i.e., a person running searches) to understand how datasets were described and indexed by external data repositories. This work also allowed us to understand coverage of our MSU research and to find the source dataset repositories with the best representation of our research data for the automated work in our next two steps. Second, we query the source repositories for potential matches using the APIs keyword and subject searching functions. We do this by querying each API with several different queries, including "Montana State," "Montana State University," and "MSU." Third, because many data repositories do not log the institutional affiliation of authors, the team looked to identify MSU researchers by going to one of the primary sources of institutional data, the MSU Office of Planning and Analysis (OPA). Most universities will have an institutional data and statistical body that collects and records student enrollment data, faculty numbers, research hires, etc. In our case, we met with OPA to describe our use case for the data and reasons behind the Dataset Search application, and they agreed to provide us with an annual list of names for all tenure-track and non-tenure track faculty. We used this list to query data repository APIs for each individual name. As metadata records were returned, we could use cues from the metadata to attempt to disambiguate the names-for instance, if the researcher was in the Plant Sciences Department at MSU, it was unlikely that they would conduct social science research. Human curators also play an important role in disambiguation.
Even as we started to see success with our strategies for identifying MSU datasets, we also noted a need to build ways to enrich the harvested metadata and to help standardize the metadata. Our API calls were successfully identifying MSU datasets, but the amount and types of metadata returned were sometimes limited and in need of some cleaning up. We saw many of these metadata limitations in Dataset Search JeSLIB 2021; 10(1): e1189 https://doi.org/10. 7191/jeslib.2021.1189 the descriptive keywords and subjects for the datasets. We could do much of the manual standardization and cleaning up of records using our administrative interface within MSU Dataset Search. However, we wanted to enhance the subjects and keywords to refine and build out a better level of description. To do this, we harvested keywords from Google Scholar profiles and reconciled those keywords with linked data expressions. In this reconciliation process, we mapped the harvested keywords to Wikidata item entities so that each keyword was associated with a Wikidata URL. We used a Python script to carry out this harvest and reconciliation work; all of our code is openly available in a GitHub repository (Clark et al. 2020). Our working theory was that the keywords and subject terms in Google Scholar profiles were created by the researchers themselves and therefore represent the closest approximation of the type of research they produce and their preferred terms for describing themselves. Adding Wikidata linked data expressions also helps make these enhanced subjects and keywords available to machines to improve indexing via search engines.
Among the other lessons learned and challenges faced, the team needed to understand what a successful index of our datasets looked like. Would 80% of our dataset output provide enough scope and a working inventory of data production at our institution? The completeness of the index was a quantifiable element that we needed to reconcile. We ultimately understand that our index likely won't be a comprehensive list of datasets from MSU researchers. False positives are common when querying data repositories for full names, and we also anticipate that we are not finding all datasets that have been published by MSU researchers. In the future, we can help reduce false positives and increase completeness by integrating ORCID with our tool, and by using CrossRef and DataCite DOI metadata to connect datasets with any associated publicat ions that include institutional affiliation.
Dataset Search should also be findable in external environments like commercial search engines and Google Dataset Search. We noted above how our data model was predicated on metadata fields for discovery settings, like commercial search engines, and how this focus forced us to modify the data model to accommodate technical and administrative metadata. This was one of our first lessons learned, but there were other solutions that became part of this work. The team pursued what we started to call "architectures for findability" which led to particular patterns of markup for our datasets. We wanted to allow for machine processes and intelligent software agents to discover and understand our datasets and we wanted our datasets to be indexed in Google Dataset Search. Working backward from these goals, we adhered to the best practices for dataset markup released and supported by Schema.org (Google Developers 2020b). In its simplest form, we included the dataset markup on individual dataset items as part of the HTML webpages. We also built an XML sitemap (Google Support 2020) that listed and identified our structured data markup for web indexing tools 2 . We continue to Dataset Search JeSLIB 2021;10(1): e1189 https://doi.org/10.7191/jeslib.2021.1189 monitor the success and return on these markup activities using validation tools like the Rich Results Testing Tool (Google 2020c), and analytics tools like Google Search Console (Google 2020b) to confirm correct markup patterns and understand the coverage and indexing rates of our datasets in search engines. Benchmarking the appearance of our dataset item records in repositories like Google Dataset Search and DataCite will provide additional insight here. This is a work in progress, but we have seen results for these markup activities in other library properties. A similar markup and indexing project for our library databases (Clark and Rossmann 2017) guides our work here. In that research, we saw increased traffic and organic search referrals based on markup and optimized search engine indexing routines. We are following the same model here and expect to see a similar increased visibility for our datasets.
And finally, our team wanted to find ways to streamline the user interface for dataset retrieval. We worked with user interface designer Lorraine Chuen (Chuen 2020) to create streamlined interfaces that are beautiful and usable. Chuen also helped conduct an expert review of the tool to streamline patterns of use and improve users' navigation through the system. Among the highlights of this work: a clean, simple design using MSU's institutional branding; improved scannability of the page by changing the layout and adding a "Details" panel for metadata; an "Access Dataset" button that clearly guides away from our search interface to the data repository where they can access the dataset; and removal of administrative screens from public view. Chuen's design and expert review have led to a much improved interface.

Future Directions
We see several future directions for Dataset Search. First, we have not conducted large-scale user testing or other assessment of the tool. Next steps could include continuing to monitor our search engine optimization protocols to ensure that the tool is discoverable on the web; conducting user testing locally and updating the user interface in response to any remaining usability issues; installing Google Analytics to understand user traffic; and adding a contact form to the site to support direct user feedback.
After completing the pilot project providing discovery for restricted data records with the Center for American Indian and Rural Health Equity (described above), we may reach out to other research centers who would benefit from increased transparency by sharing metadata records for restricted data. Dataset Search metadata records could also support discovery of data from in-progress projects that are stored locally at MSU, thus encouraging new collaborations and accelerating scientific discoveries.
With our structured data activit ies and enhancements, we are also noting some new possibilities around sharing the datasets and reuse of the data. MSU Dataset Search has a default API that is under development, but it is not standardized or documented. The team recognizes that there is some useful work to be completed Dataset Search JeSLIB 2021; 10(1): e1189 https://doi.org/10. 7191/jeslib.2021.1189 here and has begun looking at new API formats that could benefit the data community if implemented. A member of our team has been working with the Research Object Crate (RO-Crate) standards group to shape the emerging standard for use with datasets and to pilot a use case of RO-Crate. RO-Crate is "lightweight approach to packaging research data with their structured metadata, rephrasing the Research Object model as Schema.org annotations to formalize a JSON-LD format that can be used independently of infrastructure" (Carragáin et al. 2019). More specifically, our team is looking to standardize the Dataset Search API using the RO-Crate standard which would allow us to connect our API implementation to the broader work of the research objects community and help shape documentation and use of our API.
Dataset Search is built with open source code (Clark et al. 2020) and we have outlined a straightforward installat ion process; the front-end design is also customizable to match the branding of any institution. We therefore hope that the Dataset Search will be adopted by other small-and mid-sized institutions who are looking for a lightweight tool to promote discovery and access for their local research data. As a member of the Data Discovery Collaboration (DDC 2020), the Dataset Search project benefits from alignment with other similar projects, and we will continue to pursue connections with the data discovery community and explore how the functionalities of the Dataset Search tool can be integrated with other data catalog infrastructures such as the NYU-developed Data Catalog software (Lamb and Larson 2016). Our automatic harvesting routine could also be integrated with data repository software such as Dataverse (Dataverse 2020).

Conclusion
As research data sharing grows, institutions are increasingly building initiatives that support discovery, access, and reuse for published data. Montana State University's Dataset Search is designed as a lightweight, open-source solution that supports discovery and reporting for research data created by researchers at our institution. The Dataset Search applicat ion provides five core features to support dataset discovery: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest, and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by MSU researchers. Dataset Search is designed to be easily customized and implemented by other institutions to improve search and discovery and therefore amplify the impact and benefits of research data.