12–17 Oct 2025
CEA Grenoble
Europe/Zurich timezone

From PDF to JSON-LD: An Automated n8n–AI Pipeline for Rapid, Validated LCA Inventory Creation

16 Oct 2025, 11:05
15m
CEA Grenoble

CEA Grenoble

Presentation open data T2: Data - backbone of LCAs

Speakers

Alejandro Padilla-Rivera (Instituto de Ingeniería- Universidad Nacional Autónoma de México)Mr Luis Trujillo (Universidad Autónoma Metropolitana)

Description

This work develops an end-to-end, low-code workflow that automatically converts heterogeneous documents—reports, theses and open-data PDFs—into high-quality, machine-readable Life Cycle Assessment (LCA) inventories. Built on the open-source orchestrator n8n, the pipeline (i) prunes non-informative pages with custom Python APIs, (ii) extracts text while preserving tables through the free LLMWhisperer API, (iii) enriches content via Deepseek LLM nodes for domain-specific entity recognition, (iv) normalises and annotates data with JavaScript routines to yield standards-compliant JSON-LD, and (v) uploads the output to a MySQL-backed Mexican LCA web platform. A supervised-validation layer—combining rule-based checks and expert review—assigns data-quality scores, ensuring transparent provenance before database ingestion. Tested on twenty academic theses, the system cut manual curation time by 80 % and produced consistent inventories in under five minutes per document; metadata enrichment improved downstream query performance by ~30 % compared with hand-curated entries. By eliminating repetitive tasks, enforcing schema uniformity and providing a direct bridge from source document to live database, the workflow accelerates the population of national and international LCA repositories and supports rapid creation of efficient datasets and databases. Key challenges include processing poorly scanned files, harmonising domain-specific nomenclature and scaling the validation module, while opportunities lie in multilingual expansion, uncertainty quantification and broader integration with circular-economy datasets. Overall, coupling open-source automation, advanced language models and supervised quality assurance offers a replicable blueprint for reliable, rapid LCA data generation that lowers barriers for researchers and strengthens global open-data infrastructure.

How much time do you ideally wish for your contribution? 15 minutes

Authors

Alejandro Padilla-Rivera (Instituto de Ingeniería- Universidad Nacional Autónoma de México) Mr Luis Trujillo (Universidad Autónoma Metropolitana) Mr Marcos Nolasco (Universidad Autónoma Metropolitana)

Co-authors

Mr Ivan Vásquez (Universidad Nacional Autónoma de México) Dr Patricia Güereca-Hernández (Instituto de Ingeniería- Universidad Nacional Autónoma de México)

Presentation materials

There are no materials yet.