nlp streaming by dayesouza · Pull Request #2264 · microsoft/graphrag

dayesouza · 2026-03-03T14:08:08Z

Workflow profiling for txt:

runtime: -6.8%
peak memory: -30%
memory delta: -31%

Findings:

The concurrent_requests and async_mode configuration fields in extract_graph_nlp were never actually functional. The underlying derive_from_rows_asyncio_threads helper contains a bug: it passes an async function into asyncio.to_thread(), which causes execution to silently fall back to the event loop instead of running in a separate thread. In practice, this means the code was already running sequentially.
For the NLP pipeline, threading would not provide any benefit anyway, since spaCy and TextBlob are CPU‑bound and constrained by the GIL. We therefore removed the unused threading logic from build_noun_graph and replaced it with a straightforward sequential loop. This matches the real execution behavior prior to the change, while reducing overhead and lowering memory usage.
Edge weights may differ slightly from prior versions: duplicate noun phrases within a single text unit are now deduplicated before co-occurrence counting.

This pull request refactors and enhances the NLP-based graph extraction workflow to support streaming results directly into output tables, improving scalability and efficiency. The main changes include updating the workflow to process data using storage tables instead of dataframes, implementing streaming writes for entities and relationships, and restructuring the graph extraction logic for better performance and maintainability.

Workflow and API changes:

The extract_graph_nlp workflow now processes data using Table objects for text units, entities, and relationships, enabling more efficient handling of large datasets and streaming output. [1] [2]
Entities and relationships are streamed into their respective output tables as they are extracted, replacing the previous approach of writing entire dataframes at once. Sample rows are returned for downstream use.

Graph extraction logic improvements:

The noun graph extraction (build_noun_graph.py) is refactored to operate on storage tables, with improved async batching and progress logging, and more efficient node and edge construction.
Edge extraction is rewritten for clarity and performance, using dictionaries and batching rather than dataframe groupby operations.

General updates:

Copyright headers updated to 2026, and module docstrings clarified. [1] [2]
A semversioner patch release is documented for the streaming graph extraction feature.

dayesouza added 2 commits March 3, 2026 14:05

fix cooccurences

adf6b57

Merge remote-tracking branch 'origin/main' into nlp

6abd9b6

dayesouza requested a review from a team as a code owner March 3, 2026 14:08

unused async fixes

efcdd95

natoverse approved these changes Mar 3, 2026

View reviewed changes

dayesouza merged commit bb9afeb into main Mar 3, 2026
18 checks passed

dayesouza deleted the nlp branch March 3, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlp streaming#2264

nlp streaming#2264
dayesouza merged 3 commits intomainfrom
nlp

dayesouza commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dayesouza commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dayesouza commented Mar 3, 2026 •

edited

Loading