MetaKnogic Architecture

The MetaKnogic - stands for Metabolite Knowledge-Logic HyperGraph maps the metabolic landscape of cancer. It ingests millions of full-text articles, evaluates them via a rigorous relevance model, and extracts complex relationships using Knowledge-Graph RAG (KG-RAG).

The Problem

Biomedical knowledge is scattered across millions of unstructured articles. Critical relationships - like how a nutrient shapes a metabolic pathway driving tumor behavior - are buried in prose. Standard vector search often misses the explicit logic connecting these entities.

Our Solution

We model biological entities as nodes and evidence sentences as Hyperedges. Unlike standard KGs that link only two items, our hyperedges connect multiple entities (e.g., Gene + Metabolite + Cancer Subtype) simultaneously, preserving the full scientific context.

Data Pipeline

We combine PMC full-texts with OpenAlex metadata, filtering for high-impact oncology/metabolism research.

Full-Text Corpus

7M+

PMC Articles

Citation Metadata

40M+

OpenAlex Records

Aligned & Filtered

6.6M

Relevance Matched

Curated Vocab

3,400+

Metabolites & Genes

End-to-End Workflow

1

Ingest & Align

Pull XML from PMC and align with OpenAlex citation metrics.
2

Relevance Filtering

Score papers based on keyword density, impact factor, and recency.
3

KG-RAG Extraction

Extract entities and hyperedges using LLMs. Each hyperedge is a verified claim.
4

Graph Construction

Build a bipartite graph (Entity Nodes ↔ Evidence Nodes) in Neo4j.

Relevance Scoring Algorithm

Every paper receives a continuous relevance score to prioritize high-quality, impactful science. Each component is mapped to a 0–1 quantile rank (`qrank`).

1. Text Signal (35%)

Section-aware boosting and local keyword density.

T_{1} = \mathrm{qrank}(\text{Section Boost}) \\ T_{2} = \mathrm{qrank}(\text{Density per 1k words}) \\ \text{Text Signal} = 0.8 \cdot T_{1} + 0.2 \cdot T_{2}

2. Citation Signal (35%)

Log-transformed citation counts to normalize outliers.

C = \log(1 + \text{Citation Count}) \\ \text{Citation Signal} = \mathrm{qrank}(C)

3. Recency Signal (15%)

Down-weights older work while clipping extremes.

R = \mathrm{clip}(\text{Pub Year} - \text{Baseline}, -10, 10) \\ \text{Recency Signal} = \mathrm{qrank}(R)

4. Journal Prestige (15%)

A blended score of multiple impact metrics.

J = 0.55 \cdot \mathrm{qrank}(\text{SJR}) \\ + \; 0.20 \cdot \mathrm{qrank}(\text{Cites/Doc}) \\ + \; 0.15 \cdot \mathrm{qrank}(\text{CiteScore}) \\ + \; 0.10 \cdot \mathrm{qrank}(\text{H-index})

Final Relevance Score

\text{Score} = 0.35T + 0.35C + 0.15R + 0.15J

Why KG-RAG?

Standard LLM

Hallucinates links
Loses citations
Black-box reasoning

KG-RAG (Ours)

Grounded evidence
Traceable Hyperedges
Graph-based logic

MetaKnogic Architecture

The Problem

Our Solution

End-to-End Workflow

Ingest & Align

Relevance Filtering

KG-RAG Extraction

Graph Construction

Standard LLM

KG-RAG (Ours)