Πλοήγηση ανά Επιβλέπων / ουσα "Androutsopoulos, Ion"
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω
Τώρα δείχνει 1 - 1 από 1
- Αποτελέσματα ανά σελίδα
- Επιλογές ταξινόμησης
Τεκμήριο Natural language processing for business documents(2025-11-04) Loukas, Lefteris; Λούκας, Λευτέρης; Paliouras, Georgios; Leledakis, Georgios; Koutsopoulos, Iordanis; Pavlopoulos, Ioannis; Stafylakis, Themos; Kotidis, Υannis; Androutsopoulos, IonNatural Language Processing (NLP) for business and finance-related documents (Hahn et al., 2018; Chen et al., 2022) is an expanding research area applying computational techniques to text such as company filings, analyst reports, and economic news. These documents present unique challenges due to specialized vocabulary (El-Haj et al., 2019), the critical role of numerical data, distinct syntactic structures, and domain-specific semantics. These issues are compounded by broader difficulties, including processing large volumes of unstructured public data and deploying language models cost-effectively, especially for resource-limited organizations like small-to-medium enterprises (SMEs). Addressing these challenges is crucial for applications ranging from fraud detection (Goel and Gangolly, 2012), long-form summarization (Cao et al., 2024), and information extraction to financial question answering (Maia et al., 2018). This thesis aims to advance applied NLP and the use of business documents for real-world tasks by addressing current industry challenges across different layers of artificial intelligence, more specifically across the data, application, and deployment layers, with a consistent focus on resource-constrained environments. We tackle three main research questions: (1) How can open-access, unstructured business documents be effectively leveraged for NLP? (2) How can current NLP methods that utilize deep learning (DL) techniques be adapted and extended to create business value in tasks like automatic document tagging, considering the nuances of financial language, particularly its heavy reliance on numerics? (3) For common industrial text classification tasks, what are the most accurate and cost-efficient approaches in resource-limited settings? We investigate the latter question by focusing on a real-world use case of intent recognition from customer dialogues, comparing BERT-based models and Large Language Models (LLMs), and optimizing LLM deployment for cost. Addressing these questions, we first focus on “democratizing” access to business documents by developing EDGAR-CORPUS, the largest publicly available financial NLP corpus in English, domain-specific word embeddings (EDGAR-W2V) which outperform alternatives, and EDGAR-CRAWLER, an open-source software toolkit for financial data extraction with hundreds of users ranging from academic researchers to FinTech practitioners and web developers. Then, we introduce XBRL tagging, a real-world NLP task, where we compile FiNER-139, the first dataset for this task, and find that LSTM models can outperform BERT due to issues of the transformer architecture (and its standard tokenization technique) with numeric token fragmentation. After benchmarking different methods, we then propose a novel tokenization technique using pseudo-tokens for transformer models that significantly improves performance on numeric-first tasks, leading to the release of new state-of-the-art BERT models (SEC-BERT), specially created for the finance domain. Finally, for cost-efficient intent recognition, we conduct a comprehensive benchmarking study on the Banking77 dataset (Casanueva et al., 2020), showing that smaller BERT-based models can be more effective and economical than LLMs, requiring only slightly more annotated training data than LLMs. We also showcase “Dynamic Few-Shot Prompting”, a Retrieval-Augmented Generation (RAG)-based method that drastically reduces LLM inference costs while maintaining high accuracy, and explore the utility of synthetic data generation. This thesis makes several key resources publicly available: the EDGAR-CORPUS, EDGAR-W2V embeddings, the EDGAR-CRAWLER software, the FiNER-139 dataset, the family of SEC-BERT models, as well as a curated subset of the Banking77 dataset containing expert-selected examples for each intent class, which we show to be crucial for achieving high performance in few-shot learning scenarios. The work of the thesis constitutes a significant contribution toward advancing industrial NLP on business documents by providing foundational open-source resources, novel methodologies for handling the unique characteristics of financial text, particularly numerical data, and practical, cost-effective strategies for deploying advanced NLP solutions in real-world financial applications, especially benefiting small-to-medium enterprises.
