Data Science Publications

The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.

Explore NIAID data science publications on PubMed:

If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:

  • The title of your published article.
  • A link to the article.
  • A 50-60 word description of the article. 

67 Results

Partially characterized topology guides reliable anchor-free scRNA-integration

April 4, 2025
Communications Biology

The tool, scCRAFT, enables reliable single-cell RNA-seq integration by preserving confidence within-batch cell-to-cell topology through a dual-resolution triplet loss.

Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections

April 4, 2025
PLOS Computational Biology

Authors describe how current approaches to mapping arboviral diseases have become unnecessarily siloed, ignoring the strengths and weaknesses of different data types and methods. This places limits on data and model output comparability. Authors propose a new generation of risk mapping models that jointly infer risk from multiple data types.

Quantitative characterization of tissue states using multiomics and ecological spatial analysis

April 1, 2025
Nature Genetics

Multiomics and ecological spatial analysis (MESA) calculates ecodiversity-inspired metrics in spatially resolved omics integrated with single-cell data, enabling the quantitative comparison of tissue states across a range of conditions.

Putting computational models of immunity to the test—An invited challenge to predict B.pertussis vaccination responses

March 31, 2025
PLOS Computational Biology

Systems vaccinology studies have been used to build computational models that predict individual vaccine responses and identify the factors contributing to differences in outcome. Comparing such models is challenging due to variability in study designs. To address this, authors established a community resource to compare models predicting B. pertussis booster responses and generate experimental data for the explicit purpose of model evaluation.

Automatic detection and extraction of key resources from tables in biomedical papers

March 20, 2025
BioData Mining

Authors introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, “Table Transformer” models for table detection, and table structure recognition. Authors also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. 

kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like Receptor (KIR) Genotyping From Short-Read Second-Generation Sequencing Data

March 17, 2025
HLA Immune Response Genetics

Authors present kir-mapper, a toolkit to analyse killer cell immunoglobulin-like receptor (KIR) genes from short-read sequencing, focusing on detecting KIR alleles, copy number variation, as well as SNPs and InDels in the context of the hg38 reference genome. kir-mapper can be used with whole-genome sequencing (WGS), whole-exome sequencing (WES) and sequencing data generated after probe-based capture methods.

TRain: T-cell receptor automated immunoinformatics

March 6, 2025
BMC Bioinformatics

Authors introduced an open-source tool in Python that streamlines going from full T-cell receptor (TCR) sequence information to predicted 3D T-cell receptor to peptide-Major Histocompatibility Complexes, using well-established tools. Analyzing these predicted complexes can provide deeper insights into the binding properties of TCRs, and can help shed light on one of the key steps in adaptive immune responses.

Revisiting the Plasmodium falciparum druggable genome using predicted structures and data mining

March 4, 2025
npj Drug Discovery

Leveraging recent advances in protein structure prediction, authors systematically assessed the Plasmodium falciparum genome, with review eventually yielding 27 high-priority antimalarial target candidates. This study also provides a genome-wide data resource for P. falciparum and implements a generalizable framework for systematically evaluating and prioritizing novel pathogenic disease targets.

Precise mycobacterial species and subspecies identification using the PEP-TORCH peptidome algorithm

March 4, 2025
EMBO Molecular Medicine

This study introduces the PEP-TORCH Peptidome Algorithm, an innovative LC-MS/MS-based approach for the accurate, rapid, and comprehensive identification of mycobacterial species and subspecies, including co-infections, directly from liquid culture samples.

Viral genomic features predict Orthopoxvirus reservoir hosts

February 26, 2025
Communications Biology

Authors applied machine learning models incorporating both host ecological and viral genomic features to predict likely reservoirs of orthopoxviruses (OPVs). Authors demonstrated that incorporating viral genomic features in addition to host ecological traits enhanced the accuracy of potential OPV host predictions, highlighting the importance of host-virus molecular interactions in predicting potential host species. Authors also identified hotspots for geographic regions rich with potential OPV hosts. 

VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions

February 19, 2025
JAMIA Open

Human Papillomavirus (HPV) vaccine is an effective measure to prevent and control the diseases caused by HPV. However, widespread misinformation and vaccine hesitancy remain significant barriers to its uptake. This study focuses on the development of VaxBot-HPV, a chatbot aimed at improving health literacy and promoting vaccination uptake by providing information and answering questions about the HPV vaccine.

A comparative study of antibiotic resistance patterns in Mycobacterium tuberculosis

February 11, 2025
Scientific Reports

This study leverages the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) to analyze over 27,000 Mycobacterium tuberculosis (MTB) genomic strains, providing a comprehensive and large-scale overview of antibiotic resistance (AMR) prevalence and resistance patterns. Authors used MTB++, which is the newest and most comprehensive AI-based MTB drug resistance profiler tool, to predict the resistance profile of each of the 27,000 MTB isolates and then used feature analysis to identify key genes that were associated with the resistance.

Conditional similarity triplets enable covariate-informed representations of single-cell data

February 9, 2025
BMC Bioinformatics

Authors introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information.

Challenging a paradigm: Staggered versus single-pulse mass dog vaccination strategy for rabies elimination

February 7, 2025
PLOS Computational Biology

Authors constructed a stochastic, metapopulation model to examine how the timing of pulsed vaccination campaigns across patches can affect metapopulation dynamics. They explored general metapopulation dynamics for pulsed vaccinations as well as parameterizing the model for canine rabies in Arequipa, Peru, and simulated how the timing of the planned vaccination campaign, staggered over 6 months versus a single yearly pulse, affected the prospects for regional rabies elimination.

Systematic collection, annotation, and pattern analysis of viral vaccines in the VIOLIN vaccine knowledgebase

February 7, 2025
Frontiers in Cellular and Infection Microbiology

To better understand and design viral vaccines, it is critical to systematically collect, annotate, and analyse various viral vaccines and identify enriched patterns from these viral vaccines. Authors systematically collected experimentally verified viral vaccines from the literature, manually annotated, and stored the information in the VIOLIN vaccine database. Enriched patterns were identified from systematical analysis of the viral vaccines and vaccine antigens.

zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters

February 5, 2025
Nucleic Acids Research

Many universally and conditionally important genes are genomically aggregated within clusters. Here, the authors introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes.

Multi-strain phage induced clearance of bacterial infections

February 4, 2025
PLOS Computational Biology

Authors combine theory and computational models of in vivo phage therapy to study the efficacy of a phage cocktail, composed of two complementary phages motivated by the example of Pseudomonas aeruginosa facing two phages that exploit different surface receptors, LUZ19v and PAK_P1.

Leveraging public AI tools to explore systems biology resources in mathematical modeling

February 4, 2025
NPJ Systems Biology and Applications

Authors investigated the usage of public Artificial Intelligence (AI) tools in exploring systems biology resources in mathematical modeling. They tested public AI’s understanding of mathematics in models, related systems biology data, and the complexity of model structures.

RSero: A user-friendly R package to reconstruct pathogen circulation history from seroprevalence studies

February 3, 2025
PLOS Computational Biology

The authors introduce an R package, Rsero, that implements a series of serocatalytic models and estimates the force of infection (FOI, i.e., the rate at which susceptible individuals become infected) from age-stratified seroprevalence data using Bayesian methods. The package also contains a series of features to perform model comparison and visualise model fit. 

A multi-omics spatial framework for host-microbiome dissection within the intestinal tissue microenvironment

January 31, 2025
Nature Communications

Authors introduce Microbiome Cartography (MicroCart), a framework for simultaneous in situ probing of host and microbiome across multiple spatial modalities. They demonstrate MicroCart by investigating gut host and microbiome changes in a murine colitis model, using spatial proteomics, transcriptomics, and glycomics. 

Longitudinal Viral Load Clustering for People With HIV Using Functional Principal Component Analysis

January 29, 2025
AIDS Research and Treatment

Longitudinal measures of viral load (VL) are critical in monitoring the HIV status. While multiple lab indicators exist for monitoring measures of VL, research on clustering historical/longitudinal VL measures is limited. This study uses functional data clustering to classify longitudinal VL patterns and characterize each cluster by demographics, comorbidities, social behaviors, and CD4 count.

Tandem Repeats Provide Evidence for Convergent Evolution to Similar Protein Structures

January 24, 2025
Genome Biology and Evolution

Fast protein structural search programs are revolutionizing our ability to detect remote homologs. It is presently unclear whether strong protein structure matches arise solely through homology or may result from convergent evolution to similar protein structures. Shared ancestry is the basis for many evolutionary analyses, and analogous structures could taint search results. Here, the author shows strong structure matches lacking evidence for homology in their underlying sequences are depleted in multidomain proteins and enriched in structural repeats.

Harnessing Big Heterogeneous Data to Evaluate the Potential Impact of HIV Responses Among Key Populations in Sub-Saharan Africa: Protocol for the Boloka Data Repository Initiative

January 22, 2025
JMIR Publications

This study aimed to leverage and harness big heterogeneous data on HIV in South Africa and harmonize and analyze it to inform a targeted HIV response for greater impact in Sub-Saharan Africa.

A practical guide to FAIR data management in the age of multi-OMICS and AI

January 20, 2025
Frontiers in Immunology

With the rapid development and increasing interest in using artificial intelligence (AI) for the life sciences, improving how biologic data is organized and shared is more pressing than ever for scientific progress. Here, the authors outline a practical approach to multi-modal data management and FAIR sharing, in order to extend the longevity and utility of data by allowing facile use and reuse, accelerating scientific discovery in the biomedical sciences.

Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts

January 15, 2025
Nature Communications

Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a significant challenge in nanopore sequencing bioinformatics. Here, authors report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications.