MetaKnogic

MetaKnogic Architecture

The MetaKnogic - stands for Metabolite Knowledge-Logic HyperGraph maps the metabolic landscape of cancer. It ingests millions of full-text articles, evaluates them via a rigorous relevance model, and extracts complex relationships using Knowledge-Graph RAG (KG-RAG).

The Problem

Biomedical knowledge is scattered across millions of unstructured articles. Critical relationships - like how a nutrient shapes a metabolic pathway driving tumor behavior - are buried in prose. Standard vector search often misses the explicit logic connecting these entities.

Our Solution

We model biological entities as nodes and evidence sentences as Hyperedges. Unlike standard KGs that link only two items, our hyperedges connect multiple entities (e.g., Gene + Metabolite + Cancer Subtype) simultaneously, preserving the full scientific context.

Data Pipeline

We combine PMC full-texts with OpenAlex metadata, filtering for high-impact oncology/metabolism research.

Full-Text Corpus
7M+
PMC Articles
Citation Metadata
40M+
OpenAlex Records
Aligned & Filtered
6.6M
Relevance Matched
Curated Vocab
3,400+
Metabolites & Genes

End-to-End Workflow

  • 1

    Ingest & Align

    Pull XML from PMC and align with OpenAlex citation metrics.

  • 2

    Relevance Filtering

    Score papers based on keyword density, impact factor, and recency.

  • 3

    KG-RAG Extraction

    Extract entities and hyperedges using LLMs. Each hyperedge is a verified claim.

  • 4

    Graph Construction

    Build a bipartite graph (Entity Nodes ↔ Evidence Nodes) in Neo4j.

Relevance Scoring Algorithm

Every paper receives a continuous relevance score to prioritize high-quality, impactful science. Each component is mapped to a 0–1 quantile rank (`qrank`).

1. Text Signal (35%)
Section-aware boosting and local keyword density.
\[ T_{1} = \mathrm{qrank}(\text{Section Boost}) \\ T_{2} = \mathrm{qrank}(\text{Density per 1k words}) \\ \text{Text Signal} = 0.8 \cdot T_{1} + 0.2 \cdot T_{2} \]
2. Citation Signal (35%)
Log-transformed citation counts to normalize outliers.
\[ C = \log(1 + \text{Citation Count}) \\ \text{Citation Signal} = \mathrm{qrank}(C) \]
3. Recency Signal (15%)
Down-weights older work while clipping extremes.
\[ R = \mathrm{clip}(\text{Pub Year} - \text{Baseline}, -10, 10) \\ \text{Recency Signal} = \mathrm{qrank}(R) \]
4. Journal Prestige (15%)
A blended score of multiple impact metrics.
\[ J = 0.55 \cdot \mathrm{qrank}(\text{SJR}) \\ + \; 0.20 \cdot \mathrm{qrank}(\text{Cites/Doc}) \\ + \; 0.15 \cdot \mathrm{qrank}(\text{CiteScore}) \\ + \; 0.10 \cdot \mathrm{qrank}(\text{H-index}) \]
Final Relevance Score
\[ \text{Score} = 0.35T + 0.35C + 0.15R + 0.15J \]
Why KG-RAG?
Standard LLM
  • Hallucinates links
  • Loses citations
  • Black-box reasoning
KG-RAG (Ours)
  • Grounded evidence
  • Traceable Hyperedges
  • Graph-based logic