Preparing to Accept Research Data: Creating Guidelines for Librarians

Rutgers University Libraries have recognized the need to expand their current research data services into a well-documented and well-supported service available to the Rutgers research community. In 2005, Rutgers University Libraries created RUcore, Rutgers University Community Repository, which has served as the University’s formal repository for institutional scholarship, special collections, and Electronic Theses & Dissertations. With the impetus of the 2010 NSF directive for research data sharing and preservation, RUcore development was extended to accept research data content. Ingest of pilot data projects began in 2010 via a librarian-mediated process. In order to provide a better defined workflow and mission for research data services, in July 2014, the Rutgers University Librarian organized a Task Force to investigate the evaluation process for technical, legal, and confidential issues involved in research data acceptance, and to establish an administrative and evaluation framework for the deposit of research data. After a review of 35 repositories using 34 criteria, the Task Force drafted a plan for research data acceptance which proposes wide-spread acceptance of mediated data projects, and prepares for future self-deposit in an online interface. This paper will discuss the issues addressed by the Task Force; acknowledging ownership of data through an institutional data policy, preventing exposure of confidential or sensitive data, establishing a reconfigured data team, requirements for storage capacity and funding, creating a workflow which includes collaboration with research offices, and offering guidance for both researchers and librarians working with research data. Correspondence: Laura B. Palumbo: laura.palumbo@rutgers.edu


Introduction
In 2013, the Office of Science and Technology Policy mandated that the direct results of research funded by federal agencies with annual "research and development expenditures" of more than one hundred million dollars provide plans to increase public access to that research (Holdren 2013). This mandate followed the 2010 policy change by the National Science Foundation (NSF), which required researchers to submit a data management plan outlining how their funded data would be managed, shared, and preserved (National Science Foundation 2010). As a result, researchers are seeking to comply with these new mandates through effective sharing practices which do not cause an "undue burden" of time or cost (Van den Eynden and Bishop 2014, 25).
Academic libraries have started to fill the demand for digital repositories, allowing their researchers' data to be discoverable, accessible, and preserved for the long term. Rutgers University Libraries have an established institutional repository, known as RUcore. In 2012 a pilot program of data acceptance into RUcore was begun. During the pilot, concerns arose about rights issues and intake practices, and in July 2014 the University Librarian appointed a Task Force on Research Data Implementation to create an administrative and evaluation framework for the deposit of research data in accordance with the Libraries' and the University's Strategic Plans. The Task Force was charged with the completion of 10 items to define the ongoing and efficient acceptance of research data.
These 10 items included an environmental scan of other institutions for administrative structures and evaluation processes for technical, legal, and confidential issues, which might serve as models. Task items also included consultation with the research offices at Rutgers University to determine how to best integrate with their workflows; data service staffing requirements; storage needs; and funding of data repositories. Based on the results of this research, the Task Force made recommendations for data team responsibilities, created evaluation guidelines and workflows for librarians, and developed questionnaires and forms for researcher use in the data deposit process.

Scan of Data Repositories
The Task Force completed a review of 35 repositories 1 to assess their administrative structure, and their evaluation processes for technical, legal, and confidential issues in fulfillment of the first two task items of our charge. The repositories were evaluated based on the Association of Research Libraries Systems and Procedures Exchange Center (ARL SPEC): Research Data Management Services Kit 334, which "…surveys ARL member libraries on their activities related to access, management, and archiving of research data at their institutions." (Fearon et al 2013) The final list of repositories analyzed which were accepting data at the time of our research is shown in Table 1.
From the SPEC Kit, the Task Force developed a set of 34 review criteria to analyze the Research Data Management (RDM) systems of the reviewed institutions, which were grouped into five categories: Research Data Management Services (RDMS); Data Archiving Services; RDM Service Staffing; Partnerships; and Research Data Policy. These criteria were reviewed based on publicly available information from the repositories' and libraries' websites, and the findings were summarized in an Interim Report by the Task Force, dated October 2014. Additional information was later sought from selected repository managers via phone conversations. Following are some of the summarized findings from our review, which we felt were most relevant to our data service. We discovered that:  Many of the institutions reviewed provided research data management consulting, typically in data management plan preparations. This is an area to be leveraged to increase library visibility and to establish additional connections with research faculty.
 The majority of the repositories we analyzed were operated by libraries, and many worked in collaboration with outside units and offices such as the Office of the Vice President for Research, and the Office of Information Technology.
 The number of research data management service staff members was dependent on each institution's funding and culture. Staffing numbers ranged in size, from one or two staff to as many as 18 at one institution, and many of these had part-time responsibilities for research data services.
 Most repositories place the responsibility for the evaluation of data on the principal investigator. Only two of the reviewed repositories placed curation responsibilities exclusively with librarians, although others used teams including librarians.
 Most of the repositories we analyzed allowed self-deposit of data or self-deposit and mediated deposit.
 Data deposit agreements were common, and most shared a similar format. Depositors typically needed to agree that they were legally allowed to deposit the data for public access; that the data does not contain any personal or sensitive information; that the depositor holds the institution harmless from any liability incurred as a result of the deposit and public access of the data; and that the repository may enact certain described operations in order to provide for data discovery, maintenance, and preservation.
 Privacy and security issues were often addressed by agreements wherein the depositor stated that the data was free of any confidential or sensitive data; by stripping of identifying information; and in some cases encryption. Responsibility for the protection of confidential data was often placed with the principal investigator or the researcher depositor.
 Information about repository storage capacity was limited. Restrictions to file sizes and file types were more prevalent, with offerings ranging from 10 -500 GB free of charge; and acceptance of most standard file types associated with open source and widely used proprietary software was common.
 Funding models for storage and preservation of research data have not been established for many repositories, although a few did provide information about costs of services, typically in the form of storage fees. The above findings guided the completion of our charged tasks, and grounded our recommendations for a Libraries-sponsored data service.

Collaborators and Policies
Not surprisingly, the institutional offices most frequently found to be collaborating with their libraries with regard to research data are the Office of the Vice President for Research and the Office of Information Technology. We consulted with our research offices and Institutional Review Boards (IRB), to see how data deposit with the Libraries would fit with their existing workflows, and found that these offices were receptive to deposit of research data in our institutional repository. Of concern for some at Rutgers University Libraries was the lack of a University-level data policy from the research office, although our scan of repositories revealed that not all institutions have one, at least not one that is publicly visible. However, there were consistent similarities in the policies we found, and based on discussions with our research office we believe that a similar Rutgers University data policy will be forthcoming. We found that the commonalities in the best policies seem to be that the university owns the data; the principal investigator is the steward of that data and is responsible for complying with any restrictions or legal requirements; and that protocols exist in the event the PI leaves the institution.
Of the data policies we found, we considered four of these to be especially well written. These were from Johns Hopkins University 2 , New York University 3 , Ohio State 4 , and University of Wisconsin-Madison 5 . One of the most thorough data policies is from the Office of Research at Ohio State, and it covers definitions, policy details, ownership, collection and retention, data security, access, transfer in the event the primary investigator leaves, expert control, author disputes, and data access disputes. It is noteworthy that the principal investigator is responsible for the collection, management, and retention of research data for the periods required by the policy, to control access to research data, and to select the vehicle for publication or presentation of the data.

Security of Data
Security of confidential data was another concern within Rutgers University Libraries. During our review we found that in most cases, the various institutions rely on the IRB to clarify the requirements for the protection of sensitive data, and in several cases reiterate IRB procedures in a repository-related policy. Methods used to ensure protection of confidential or sensitive data were encryption (Universities of Indiana and Iowa), and de-identification. In cases where data are meant to be destroyed, all repositories surveyed that mention the destruction of data urge researchers to follow the protocols and requirements of granting agencies when destroying data. Several institutions highlighted the importance of off-site backup, and ICPSR indicates that off-site backups should be encrypted.
Although access to data may be controlled by the repository, Penn State, as do other institutions, places the responsibility for security of data and confidentiality with the primary Preparing to Accept Research Data JeSLIB 2015; 4(2): e1080 doi:10.7191/jeslib.2015.1080 investigator: "Typically, when research is funded by federal or nonprofit granting agencies, the data are owned by the institution receiving the grant. The primary researcher or scholar receiving the grant has the responsibility for storage and maintenance of the data, including the protection of confidential or sensitive information… Scholars and researchers have a moral and professional responsibility to ensure that confidential or sensitive data is stored and released in a way that protects research participants" (Office of the Vice President for Research at Penn State 2015).

Storage Capacities and Fees
Only two of the reviewed repositories provided information directly related to their storage capacities, and one seems to be an archive only. Stanford University provides information on their repository website about storage capacity, and is currently maintaining 174 terabytes (TB) of items in its holdings 6 . Other universities and institutions did not share total capacity information, but concerns about storage space are evident in deposit policies and storage services.
A common limit for individual datasets is 10 gigabytes (GB); one institution limits deposits to 1GB per dataset, and another limits project dataset sizes to 100 megabytes (MB). Many institutions offer to allow deposit of datasets larger than specified, but for a yearly storage fee.
Funding models were found at the data-specific, non-institutional repositories (ICPSR, Dryad, Odum); and among university repositories, a few specify limits to the free service offered. Two of the more generous with free space were the University of Edinburgh, which allows each researcher up to 500 GB of space without charge, and the University of Iowa, which offers free research data storage up to 3TB. At Iowa, additional terabytes are available for purchase at $270/TB per year.
Some repositories were vague about costs, and where fees were mentioned at all, only indicated that fees may be assessed for complex or large projects. In cases where fees were established, some were not significant. Princeton charges $0.006/MB (or $6/GB) as a one-time charge. Berkeley charges $0.14/month for each GB stored. Unlike most, Purdue's funding model is well-defined; central university funding pays for the following free allocations: 10 GB for 3 years for trial projects, 1 GB for 10 years for a small publication, and 100 GB for 10 years for a grant-funded project or publication. Additional space is billed per GB on a yearly, or a 10-year basis. At the higher end of the spectrum, Johns Hopkins charges $1600 for a small collection, in part because it was designed from the start to become self-funding once initial grants ran out.
The Task Force concluded that data could be accepted by Rutgers University Libraries initially without fees for projects up to 100 GB, but that larger projects would be accepted on a case-by-case basis. Some funding can be achieved by establishing fees for additional storage capacity, which can be passed on to funders by incorporating them into grant proposals. However, because storage is relatively inexpensive, this probably will not be a major source of income. If the Libraries can establish research data acceptance as a core service, funding could be provided through budgeting from departments who would benefit from this service.

Data Service Staffing
In order to address the unique needs of research in various disciplines, Research Data Management Services (RDMS) frequently include staff members from stakeholder groups across the institution. Libraries, research offices, and IT departments are the organizational units most often involved in the provision of RDMS. Institutions with larger numbers of staff are assisting researchers in all phases of the data lifecycle, and those with smaller numbers often are only providing basic guidance for research data management plans. Funding and culture appear to play major roles in the staffing of RDMS. It is important to note that, although we were able to determine the number of staff associated with RDMS at many institutions from a preliminary review of their websites, we could not determine how many staff have RDMS as their primary job responsibility and how many contribute only small portions of their job portfolio to RDMS-related tasks.
Of the repositories we reviewed, nine employed between seven to 10 staff members, and five institutions have as many as 15 to 18 staff members. The remaining institutions reviewed did not provide specific staff information; however, they do provide a centralized contact for researchers to ask questions or to schedule a consultation. The staff positions, where information was available, included but were not limited to data management librarians, subject specialists, business managers, and the staff who create and maintain all technical resources for both library and IT services. Their job duties included data storage and data migration tasks, verifying legal information, conducting financial activities, insuring data security, creating and assigning metadata, project management, and data preservation.
The Task Force determined that the Rutgers University Libraries Research Data Team should consist of existing Libraries personnel, who are already well qualified for the review and acceptance of research data. The Team would be led by a full-time Data Manager, whose time would be one hundred percent attributable to the activities related to data acceptance in the institutional repository. The team would consist of two parts: a Core Data Team, who will be responsible for preliminary review of data projects and who will also serve as Project Managers when appropriate; and an Expanded Data Team, who will act as Project Managers and oversee data projects to their completion. In addition to the Data Manager, six Libraries' personnel were identified to serve on the Core Team. An additional eight members were recommended for inclusion in the expanded team.
It was anticipated that issues with rights, commercialization, sensitive information, or other legal issues would be best referred to appropriate personnel, who although not part of the Data Team, would work with the Data Team and the researcher to resolve these issues. The Task Force identified the Copyright Librarian, the Repository Collection Librarian, the Office of Technology Commercialization, and the Institutional Review Boards as potential collaborators or consultants.

Guidelines and Workflows
The primary goals of the Task Force were the completion of guidelines for the acceptance of research data, for both researchers and the librarians who would be working with them; and workflows which would chart the path of data deposit. In 2012, Rutgers University Libraries Preparing to Accept Research Data JeSLIB 2015; 4(2): e1080 doi:10.7191/jeslib.2015.1080 rolled out its research data portal, RUresearch, and began accepting a variety of research data. As a result of that experience, it was determined that a better-defined and structured approach to acceptance of research data would be beneficial. The Task Force worked to explicitly define workflows that the Rutgers University Libraries Research Data Team would use to work with researchers for mediated data deposit, under the guidance of a Project Manager designated for each data project.
It was also determined that the initial implementation of mediated ingest would consist of data projects without human or animal subjects, commercial interests, and which are typically less than 100 GB of data volume per project. By limiting our initial acceptance of data projects in this way, we hoped to take data without complications due to rights and privacy issues, which would delay acceptance and ingest. Data projects outside the guidelines for the initial implementation, such as those with human subjects, would be considered in the full implementation of research data services, or on a case-by-case basis as a special data project. We envision development of a full implementation of data acceptance that would allow researchers to self-deposit data, in addition to providing mediated deposit when necessary. A time frame for the full implementation of self-deposit services has not been established, but we anticipate that this could occur after one to two years of mediated ingest, provided that Rutgers University Libraries technical resources could be allocated to the creation of an online interface and any necessary infrastructure modifications, and that any legal implications of sharing data would be resolved.
We proposed acceptance of research data that was the result of unfunded as well as grant-funded research, to allow for a broad spectrum of research areas to be included; however, projects which require data deposit to comply with funder mandates may be given preference. For grant-funded research, the Principal Investigator is the responsible party for the data from that grant, but in order to include those responsible for non-funded research, the term Responsible Researcher is used to designate the project lead. The Task Force proposed that the Primary Responsible Researcher would be responsible for assuring that the data can be shared publicly in accordance with University policies, Federal and other funders' directives, and is in compliance with any legal restrictions. Through a deposit agreement, they would attest that by sharing the data they will not be in violation of any confidentiality agreements, copyright laws, or other laws, and will hold Rutgers University Libraries harmless from any damages resulting from the sharing or misuse of the data.
The Task Force created research data service guidelines and separate high-level criteria intake questionnaires for the initial acceptance of mediated data projects, and for the full implementation of data acceptance, which also includes self-deposit. The questionnaires for each stage of data acceptance seek to ensure that the requirements of the guidelines are met, and are to be signed by the Responsible Researcher. Once the questionnaire has been completed and it has been determined that the high-level criteria are met, an application form is completed by the Responsible Researcher to establish a minimum amount of metadata. During mediated data deposit, the questions would be asked of a researcher by the appropriate member of the Rutgers University Libraries Research Data Team, and/or a subject liaison. Once the project application is complete, the Responsible Researcher would sign a data deposit agreement, allowing Rutgers University Libraries to accept the data (See Appendices for Guidelines, and Application Form. Additional Questionnaires were created but are omitted for the sake of space). The data deposit agreements reviewed during our environmental scan of data repositories typically state that the Responsible Researcher is responsible for insuring that they are legally allowed to deposit the data for public access; that the data does not contain any personal or sensitive information; that the depositor holds the institution harmless from any damages incurred as a result of the deposit and public access of the data; and that the repository may enact certain described operations in order to provide for data discovery, maintenance, and preservation. The Task Force has recommended that a deposit agreement be prepared which will align with a University data policy, when such a policy is adopted, subject to the review of University Counsel. To reiterate, we found that most data policies assert that the University owns the research data; that the Principal Investigator or Responsible Researcher is the custodian of that data; and it stipulates that the data would remain with the institution should the Responsible Researcher leave.
During the proposed full implementation of data services, data deposit may be automated as well as mediated. Mediated data deposit will still be an option for researchers needing assistance, and for projects which are very large, complex, or which would require infrastructure modifications, i.e., a special research data project. For self-deposited data, the forms would be online and would require NetID authentication and an electronic signature. Guideline questions would be affirmed by the researcher, preliminary metadata entered, and the deposit agreement accepted by the depositor. This self-deposit process would include a brief waiting period before data would be made visible, during which time the Rutgers University Libraries Research Data Team would perform a cursory review of documents and data files. The data will be checked for descriptive documentation in the form of a "README" file or codebook, so that researchers will be able to understand and use the data files; to verify that file names are not nonsensical; that the file types can be accepted into RUcore; that the files can be opened and read in the appropriate application; that there is sufficient supplementary documentation provided such as codebooks or questionnaires; and that any URLs are persistent. This brief review for completeness should take no more than five working days. After that time the researcher will be notified regarding the acceptance of the project, and the name of the Rutgers University Libraries Project Manager who will become the primary contact for questions concerning the data project.
For both mediated and self-deposited data in RUcore, we propose that the responsibility for compliance with any legal restrictions would lie with the Principal Investigator/Responsible Researcher. They would assume responsibility for determining if their data is free from any copyright or intellectual property constraints, sensitive or confidential information, any restrictions on public accessibility, or any other legal and ethical issues which might prevent their depositing and sharing the data publicly. However, in order to allay concerns over the existence of sensitive data, a method of automated scanning for identifying information should be investigated.

Guidance for Librarians
Rutgers University Libraries previously established a training course for subject librarians and other interested personnel titled "Supporting Faculty Research Data Needs" (Womack 2012). This course consisted of classes which covered data models, metadata and ontologies, preservation, copyright, the data lifecycle and project management. Guidelines have been drafted to better enable subject liaisons that have completed the training course to work Preparing to Accept Research Data JeSLIB 2015;4(2): e1080 doi:10.7191/jeslib.2015.1080 directly with researchers in assisting with data deposit. If a subject liaison has not been trained, they will work with an appropriate Project Manager from the Rutgers University Libraries Research Data Team until such time as they are able to manage a data project without assistance. Project Managers will provide assistance to researchers with forms, and referrals to other personnel or offices if necessary, and enter metadata into RUcore. These guidelines for project managers discuss information they will need to know about the researcher and their status at Rutgers University and as the responsible custodian of their data; issues concerning copyright, other rights and legal issues, and sensitive or confidential information of which they need to be aware; and storage, access, and other file-related concerns.

Conclusion
There has been a rapid advancement of academic libraries into research data services, in an effort to help researchers fulfill the requirements for public access to federally funded research. In addition to institutional repositories, data specific repositories such as Dryad and ICPSR continue to grow. Academic libraries with institutional repositories see the opportunity to become part of the research workflow, and are actively promoting their research data services to their communities. Rutgers University Libraries are poised to offer comprehensive research data services to its institutional research community. The RUL Research Data Implementation Task Force sees the need to establish research data services as a core function of the Libraries.
Our review of institutional and data repositories found similarities in the way others are facing the challenge of providing research data services and in their research data policies. The Task Force assumed that a policy similar to those we reviewed would be adopted by Rutgers University, and allowed this to guide our thinking about data acceptance. We also found that most of the institutional and data repositories we reviewed offered self-deposit of data, or both self-deposit and mediated data deposit. The Task Force believes that we should create an efficient method of self-deposit of data as many of our peer and aspirant institutions have done, and as Rutgers University Libraries is already doing with scholarly articles.
In order to create a sustainable service, funding should be sought once the Libraries have begun to accept research data on a regular basis. The most logical source of this funding would be from the research offices, whose goal it is to help researchers obtain grants and comply with funder directives. Some additional funding can be achieved by establishing fees for storage capacity, which can be passed on to funders by incorporation into grant proposals. However, because storage is relatively inexpensive, this probably will not be a major source of income. Continued outreach to the research offices and Institutional Review Boards are needed to establish integration of data acceptance through the Libraries into their workflows, so that researchers are aware of the availability of our institutional repository for sharing and preservation of research data; and of the related services that the Libraries can provide, such as the preparation of data management plans and consulting on data projects.
Accepting research data into our institutional repository will leverage the expertise of the Libraries, and will allow us to establish deeper relationships with our research communities. It could also become a source of funding as a core service to researchers. However, research data services must be easy to use in order to be of value to time-pressed researchers, and to be seen as worthy of financial support. A balance should be sought between library-mediated guidance and staff time for metadata and ingest, which will make researchers' data discoverable and able to be of long-term value; and ease of use for researchers who are interested in efficient compliance with federal requirements. We believe that the establishment of appropriate and well-written research data policies, both at an institutional level and within the Libraries, and the creation of guided workflows and knowledge of the issues concerning rights and protection of sensitive information, will pave the way to a seamless and carefully considered online deposit process for research data in our Libraries' institutional repository.
The work of the Task Force was presented to Rutgers University Libraries' Cabinet in February 2015 and endorsed by this group. In preparation for data ingest, the immediate next steps include establishing a charge for the proposed Rutgers University Libraries' Data Team, and planning and executing outreach activities to researchers regarding the acceptance of research data.

Supplemental Content
Appendices A and B An online supplement to this article can be found at http://dx.doi.org/10.7191/jeslib.2015.1080 under "Additional Files".

Disclosure
The authors report no conflict of interest.