Speakers
Description
This work develops an end-to-end, low-code workflow that automatically converts heterogeneous documents—reports, theses and open-data PDFs—into high-quality, machine-readable Life Cycle Assessment (LCA) inventories. Built on the open-source orchestrator n8n, the pipeline (i) prunes non-informative pages with custom Python APIs, (ii) extracts text while preserving tables through the free LLMWhisperer API, (iii) enriches content via Deepseek LLM nodes for domain-specific entity recognition, (iv) normalises and annotates data with JavaScript routines to yield standards-compliant JSON-LD, and (v) uploads the output to a MySQL-backed Mexican LCA web platform. A supervised-validation layer—combining rule-based checks and expert review—assigns data-quality scores, ensuring transparent provenance before database ingestion. Tested on twenty academic theses, the system cut manual curation time by 80 % and produced consistent inventories in under five minutes per document; metadata enrichment improved downstream query performance by ~30 % compared with hand-curated entries. By eliminating repetitive tasks, enforcing schema uniformity and providing a direct bridge from source document to live database, the workflow accelerates the population of national and international LCA repositories and supports rapid creation of efficient datasets and databases. Key challenges include processing poorly scanned files, harmonising domain-specific nomenclature and scaling the validation module, while opportunities lie in multilingual expansion, uncertainty quantification and broader integration with circular-economy datasets. Overall, coupling open-source automation, advanced language models and supervised quality assurance offers a replicable blueprint for reliable, rapid LCA data generation that lowers barriers for researchers and strengthens global open-data infrastructure.
How much time do you ideally wish for your contribution? | 15 minutes |
---|