Volume74, Issue9 | September 2023 | Pages 1124-1139
Egoitz Laparra, Alex Binford-Walsh, Kirk Emerson, Marc L. Miller, Laura López-Hoffman, Faiz Currim, Steven Bethard
September 2023
Pages 1124-1139
https://doi-org.ezproxy3.library.arizona.edu/10.1002/asi.24809
Abstract
Natural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi-file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real-world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.
1 INTRODUCTION
Natural language processing (NLP) provides a powerful toolbox for data-scientists and librarians dealing with digitized text documents. When documents are not accompanied by the metadata required to catalog them, NLP can be used to train models to extract such information automatically. Although some complex metadata extraction problems may require a great deal of expertise and engineering, many simple metadata extraction problems may be addressed with straightforward applications of supervised machine learning. Thanks to the broad ecosystem of machine learning libraries available nowadays, it only takes a few lines of code to train a bag-of-words based linear classifier or fine-tune a pretrained transformer model for text classification. However, the application of NLP techniques in real scenarios have to face some structural hurdles that worsen their performance. Examples include documents of extremely large size, or documents split into multiple files without a clear order. Since these are not strictly linguistic problems, they have not received much attention from the NLP community, and there are still unresolved questions about how to tackle these difficulties.
In this paper, we present a comprehensive analysis of how such structural challenges negatively impact tasks that one would expect to be able to solve with off-the-shelf supervised learning techniques, that is, approaches that do not require significant NLP expertise. We have carried out our study under the NEPAccess project (https://www.nepaccess.org/) whose objective is to develop a platform to store, search, download, and analyze environmental impact statements (EIS) and other documents created under the US National Environmental Policy Act of 1969 (NEPA; Congress, 1970).
NEPA was passed almost unanimously by the congress of the United States in 1970 and it established a new review process for a scientifically driven environmental governance that incorporated public participation. Under NEPA, any agency proposing a federal action should analyze its potential impacts and report them rigorously in an environmental impact statement (EIS). An EIS generally must include all expected direct, indirect, and cumulative impacts of the proposed action on the environment, and an assessment of possible alternative actions. The draft version of the EIS is published for public review and comment. Agencies must assess relevant comments conducting further analyses and changes if required. A final version of the EIS is then issued that will give support for a final decision on the action.
Since it was passed, NEPA has generated large amounts of data and documents reflecting analyses on actions in areas such as transportation infrastructure, mineral extraction, and the management of public lands. However, each federal agency is in charge of managing their own data. The result is a wide body of disjoint repositories with nonstandard sets of metadata accompanying the EISs. The lack of consistency and a central repository has made access to NEPA documents difficult. Besides, EISs are a fundamental source of environmental information in the United States and contain scientific analyses that are not published anywhere else. Missing pieces of metadata prevent knowing how efficiently NEPA works and finding ways it might be improved.
The NEPAccess platform is designed to overcome these limitations. To make it possible, we perform large-scale web-crawling to collect all NEPA documents available online and retrieve as much associated metadata as possible (Bethard et al., 2019). For those cases where pieces of metadata are missing, we develop machine learning models to automatically extract that information from the text. In general, the kind of information we extract from the documents does not require much linguistic engineering and can be tackled as text classification problems with a simple linear bag-of-words or a fine-tuned transformer.
However, the data is affected by structural hurdles that harm the performance of machine learning models even for the simplest tasks. An EIS (Figure 1) is typically composed of hundreds or thousands of pages which far exceeds the usual length of texts where NLP models are evaluated. It is also common practice for agencies to split each EIS into several individual PDFs, as they may contain images and maps that result in a total size of more than 1 GB. But there is no standardized splitting strategy nor a standardized naming convention for the split PDFs, meaning that for tasks where text order is important, the original EIS order must be recovered somehow for NLP to accurately extract metadata.
Finally, though the U.S. Environmental Protection Agency (EPA) provides a database with some of the needed metadata, there are inconsistencies between the manual labels and the information the documents contain. For example, the text of the documents typically contains the date when they were created, but the manual labels from the U.S. Environmental Protection Agency (EPA) often do not match those dates, but instead record the date when the document was registered by the EPA. This is challenging for supervised machine learning models that expect the output labels to match the input text.
We present a study where we try to elucidate how such structural issues affect some of the most widely used machine-learning techniques for NLP. Specifically, we train a bag-of-words based support vector machine (BOW-SVM) and fine-tune a RoBERTa (Liu et al., 2019) transformer to extract four types of metadata fields from EIS documents: Document type, Date, Agency, and States. These are all document-level text classification tasks, not named-entity recognition tasks. For example, a single EIS document will typically mention many federal agencies, but we are not interested in extracting all agencies; we are only interested in identifying the single lead agency that wrote the EIS.
We define a number of settings to evaluate the effects of different problems we identify in the NEPA scenario. The code used for these experiments can be found at https://github.com/EgoLaparra/meta-data. Our main findings are that:
- There is no one model that works well off-the-shelf on all text classification tasks: BOW-SVM is best for States, and RoBERTa is best for Document type, and Date. No model out-of-the-box achieves >0.9
- on any of the tasks.
- Truncating to the first 512 tokens may either help or hurt depending on the task: it hurts for States, while it helps for the other three tasks.
- Accurately ordering the split documents is critical, yielding gains across the board for both the BOW-SVM and RoBERTa.
- Properly tuning the structural hyperparameters (truncation, document ordering, etc.) yields performance gains on all tasks.
We believe these results show the importance of considering not just the linguistic information but also the structural setup when training NLP models for metadata extraction. They also point out how problematic the lack of standardization in the organization of EISs and their associated metadata is. We hope that our study will serve as a stimulus to improve this aspect of the NEPA process in the future.