Skip to content

nlp streaming#2264

Merged
dayesouza merged 3 commits intomainfrom
nlp
Mar 3, 2026
Merged

nlp streaming#2264
dayesouza merged 3 commits intomainfrom
nlp

Conversation

@dayesouza
Copy link
Contributor

@dayesouza dayesouza commented Mar 3, 2026

Workflow profiling for txt:

  • runtime: -6.8%
  • peak memory: -30%
  • memory delta: -31%

Findings:

  1. The concurrent_requests and async_mode configuration fields in extract_graph_nlp were never actually functional. The underlying derive_from_rows_asyncio_threads helper contains a bug: it passes an async function into asyncio.to_thread(), which causes execution to silently fall back to the event loop instead of running in a separate thread. In practice, this means the code was already running sequentially.
    For the NLP pipeline, threading would not provide any benefit anyway, since spaCy and TextBlob are CPU‑bound and constrained by the GIL. We therefore removed the unused threading logic from build_noun_graph and replaced it with a straightforward sequential loop. This matches the real execution behavior prior to the change, while reducing overhead and lowering memory usage.
  2. Edge weights may differ slightly from prior versions: duplicate noun phrases within a single text unit are now deduplicated before co-occurrence counting.

This pull request refactors and enhances the NLP-based graph extraction workflow to support streaming results directly into output tables, improving scalability and efficiency. The main changes include updating the workflow to process data using storage tables instead of dataframes, implementing streaming writes for entities and relationships, and restructuring the graph extraction logic for better performance and maintainability.

Workflow and API changes:

  • The extract_graph_nlp workflow now processes data using Table objects for text units, entities, and relationships, enabling more efficient handling of large datasets and streaming output. [1] [2]
  • Entities and relationships are streamed into their respective output tables as they are extracted, replacing the previous approach of writing entire dataframes at once. Sample rows are returned for downstream use.

Graph extraction logic improvements:

  • The noun graph extraction (build_noun_graph.py) is refactored to operate on storage tables, with improved async batching and progress logging, and more efficient node and edge construction.
  • Edge extraction is rewritten for clarity and performance, using dictionaries and batching rather than dataframe groupby operations.

General updates:

  • Copyright headers updated to 2026, and module docstrings clarified. [1] [2]
  • A semversioner patch release is documented for the streaming graph extraction feature.

@dayesouza dayesouza requested a review from a team as a code owner March 3, 2026 14:08
@dayesouza dayesouza merged commit bb9afeb into main Mar 3, 2026
18 checks passed
@dayesouza dayesouza deleted the nlp branch March 3, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants