Article Title
Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool
Article Type
EScience in Action
Publication Date
2021-08-11
DOI
10.7191/jeslib.2021.1209
Abstract
Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.
Keywords
data curation, PDF, data extraction, tabular data, reusable, discoverable, institutional repository
Data Availability
The datasets analyzed during the current study are available at https://doi.org/10.7554/elife.44898.
Acknowledgments
We would like to thank the data curation team at the Penn State University Libraries for the discussions and support for this work, the Data Curation Network (DCN) for all the training and shared expertise in research data curation, Ally Laird, Paulina Krys, and Tara Anthony from the Research Informatics and Publishing at the Penn State University Libraries for feedback on the manuscript, and Dr. Keith C. Cheng from the Penn State College of Medicine for allowing us to use his research article to demonstrate the data extraction process with the data analytics tool.
Repository Citation
Choi AJ, Xin X. Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool. Journal of eScience Librarianship 2021;10(3): e1209. https://doi.org/10.7191/jeslib.2021.1209. Retrieved from https://escholarship.umassmed.edu/jeslib/vol10/iss3/10
Rights and Permissions
© 2021 Choi & Xin. This is an open access article licensed under the terms of the Creative Commons Attribution License.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.