Data Pipeline
We combine PMC full-texts with OpenAlex metadata, filtering for high-impact oncology/metabolism research.
Full-Text Corpus
7M+
PMC Articles
Citation Metadata
40M+
OpenAlex Records
Aligned & Filtered
6.6M
Relevance Matched
Curated Vocab
3,400+
Metabolites & Genes
End-to-End Workflow
-
1
Ingest & Align
Pull XML from PMC and align with OpenAlex citation metrics.
-
2
Relevance Filtering
Score papers based on keyword density, impact factor, and recency.
-
3
KG-RAG Extraction
Extract entities and hyperedges using LLMs. Each hyperedge is a verified claim.
-
4
Graph Construction
Build a bipartite graph (Entity Nodes ↔ Evidence Nodes) in Neo4j.
Relevance Scoring Algorithm
Every paper receives a continuous relevance score to prioritize high-quality, impactful science.
Each component is mapped to a 0–1 quantile rank (`qrank`).
1. Text Signal (35%)
Section-aware boosting and local keyword density.
\[
T_{1} = \mathrm{qrank}(\text{Section Boost}) \\
T_{2} = \mathrm{qrank}(\text{Density per 1k words}) \\
\text{Text Signal} = 0.8 \cdot T_{1} + 0.2 \cdot T_{2}
\]
2. Citation Signal (35%)
Log-transformed citation counts to normalize outliers.
\[
C = \log(1 + \text{Citation Count}) \\
\text{Citation Signal} = \mathrm{qrank}(C)
\]
3. Recency Signal (15%)
Down-weights older work while clipping extremes.
\[
R = \mathrm{clip}(\text{Pub Year} - \text{Baseline}, -10, 10) \\
\text{Recency Signal} = \mathrm{qrank}(R)
\]
4. Journal Prestige (15%)
A blended score of multiple impact metrics.
\[
J = 0.55 \cdot \mathrm{qrank}(\text{SJR}) \\
+ \; 0.20 \cdot \mathrm{qrank}(\text{Cites/Doc}) \\
+ \; 0.15 \cdot \mathrm{qrank}(\text{CiteScore}) \\
+ \; 0.10 \cdot \mathrm{qrank}(\text{H-index})
\]
Final Relevance Score
\[
\text{Score} = 0.35T + 0.35C + 0.15R + 0.15J
\]
Why KG-RAG?
Standard LLM
- Hallucinates links
- Loses citations
- Black-box reasoning
KG-RAG (Ours)
- Grounded evidence
- Traceable Hyperedges
- Graph-based logic