The Scale of This Search

"Better to over-collect and triage than under-collect and redo."

We interrogated three major biomedical databases -- PubMed, Europe PMC, and Semantic Scholar -- with 213 structured queries spanning every dimension of the lung cancer multi-omics and AI/ML landscape.


The Funnel

                    T H E   F U N N E L

        +-----------------------------------------+
        |                                         |
        |         78,775  Raw API Hits             |
        |                                         |
        +-------------------+---------------------+
                            |
                     -11,896 duplicates
                            |
        +-------------------v---------------------+
        |                                         |
        |       66,879  Unique Papers              |
        |                                         |
        +------+----------+----------+------------+
               |          |          |
          +----v---+ +----v----+ +--v------+       +----------+
          | Tier A | | Tier B  | | Tier C  |       | Retracted|
          |   49   | |  2,861  | | 63,969  |       |   208    |
          | Must   | | Strong  | | Back-   |       | Excluded |
          | Cite   | | Support | | ground  |       |          |
          +--------+ +---------+ +---------+       +----------+

By the Numbers

Metric Value
Total raw hits 78,775
Unique after deduplication 66,879
Non-retracted (master RIS) 66,671
A Must-cite papers (score >= 0.7) 49
B Supporting evidence (score 0.4--0.7) 2,861
C Background reference (score < 0.4) 63,969
Bridge papers spanning 3+ themes 1,928
Retracted papers flagged and excluded 208
Anchor PMIDs recovered 34 / 35

Search Architecture

  • Primary source: PubMed (NCBI E-utilities) -- up to 800 results per query, relevance-sorted
  • Supplementary: Europe PMC -- 3 broadest queries per theme, up to 200 results each
  • Enrichment: Semantic Scholar -- anchor PMIDs + ~30 top papers per theme for citation counts
  • Deduplication: PMID-based primary, DOI secondary, fuzzy title match (Levenshtein distance <= 3) tertiary
  • Scoring: Title/abstract keyword density + journal tier + citation count + recency + article type
  • Caching: All API responses cached for reproducible re-runs

results matching ""

    No results matching ""