The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.
Explore NIAID data science publications on PubMed:
- NIAID-funded publications that use data science or computational biology (since 2023).
- Publications funded or co-funded by ODSET.
- Publications from the Harnessing Big Data to Halt HIV initiative.
If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:
- The title of your published article.
- A link to the article.
- A 50-60 word description of the article.
55 Results
TRain: T-cell receptor automated immunoinformatics
March 6, 2025 BMC Bioinformatics
Authors introduced an open-source tool in Python that streamlines going from full T-cell receptor (TCR) sequence information to predicted 3D T-cell receptor to peptide-Major Histocompatibility Complexes, using well-established tools. Analyzing these predicted complexes can provide deeper insights into the binding properties of TCRs, and can help shed light on one of the key steps in adaptive immune responses.
Revisiting the Plasmodium falciparum druggable genome using predicted structures and data mining
March 4, 2025 npj Drug Discovery
Leveraging recent advances in protein structure prediction, authors systematically assessed the Plasmodium falciparum genome, with review eventually yielding 27 high-priority antimalarial target candidates. This study also provides a genome-wide data resource for P. falciparum and implements a generalizable framework for systematically evaluating and prioritizing novel pathogenic disease targets.
Viral genomic features predict Orthopoxvirus reservoir hosts
February 26, 2025 Communications Biology
Authors applied machine learning models incorporating both host ecological and viral genomic features to predict likely reservoirs of orthopoxviruses (OPVs). Authors demonstrated that incorporating viral genomic features in addition to host ecological traits enhanced the accuracy of potential OPV host predictions, highlighting the importance of host-virus molecular interactions in predicting potential host species. Authors also identified hotspots for geographic regions rich with potential OPV hosts.
VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions
February 19, 2025 JAMIA Open
Human Papillomavirus (HPV) vaccine is an effective measure to prevent and control the diseases caused by HPV. However, widespread misinformation and vaccine hesitancy remain significant barriers to its uptake. This study focuses on the development of VaxBot-HPV, a chatbot aimed at improving health literacy and promoting vaccination uptake by providing information and answering questions about the HPV vaccine.
A comparative study of antibiotic resistance patterns in Mycobacterium tuberculosis
February 11, 2025 Scientific Reports
This study leverages the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) to analyze over 27,000 Mycobacterium tuberculosis (MTB) genomic strains, providing a comprehensive and large-scale overview of antibiotic resistance (AMR) prevalence and resistance patterns. Authors used MTB++, which is the newest and most comprehensive AI-based MTB drug resistance profiler tool, to predict the resistance profile of each of the 27,000 MTB isolates and then used feature analysis to identify key genes that were associated with the resistance.
Conditional similarity triplets enable covariate-informed representations of single-cell data
February 9, 2025 BMC Bioinformatics
Authors introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information.
Challenging a paradigm: Staggered versus single-pulse mass dog vaccination strategy for rabies elimination
February 7, 2025 PLOS Computational Biology
Authors constructed a stochastic, metapopulation model to examine how the timing of pulsed vaccination campaigns across patches can affect metapopulation dynamics. They explored general metapopulation dynamics for pulsed vaccinations as well as parameterizing the model for canine rabies in Arequipa, Peru, and simulated how the timing of the planned vaccination campaign, staggered over 6 months versus a single yearly pulse, affected the prospects for regional rabies elimination.
zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters
February 5, 2025 Nucleic Acids Research
Many universally and conditionally important genes are genomically aggregated within clusters. Here, the authors introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes.
Multi-strain phage induced clearance of bacterial infections
February 4, 2025 PLOS Computational Biology
Authors combine theory and computational models of in vivo phage therapy to study the efficacy of a phage cocktail, composed of two complementary phages motivated by the example of Pseudomonas aeruginosa facing two phages that exploit different surface receptors, LUZ19v and PAK_P1.
Leveraging public AI tools to explore systems biology resources in mathematical modeling
February 4, 2025 NPJ Systems Biology and Applications
Authors investigated the usage of public Artificial Intelligence (AI) tools in exploring systems biology resources in mathematical modeling. They tested public AI’s understanding of mathematics in models, related systems biology data, and the complexity of model structures.
RSero: A user-friendly R package to reconstruct pathogen circulation history from seroprevalence studies
February 3, 2025 PLOS Computational Biology
The authors introduce an R package, Rsero, that implements a series of serocatalytic models and estimates the force of infection (FOI, i.e., the rate at which susceptible individuals become infected) from age-stratified seroprevalence data using Bayesian methods. The package also contains a series of features to perform model comparison and visualise model fit.
A multi-omics spatial framework for host-microbiome dissection within the intestinal tissue microenvironment
January 31, 2025 Nature Communications
Authors introduce Microbiome Cartography (MicroCart), a framework for simultaneous in situ probing of host and microbiome across multiple spatial modalities. They demonstrate MicroCart by investigating gut host and microbiome changes in a murine colitis model, using spatial proteomics, transcriptomics, and glycomics.
Longitudinal Viral Load Clustering for People With HIV Using Functional Principal Component Analysis
January 29, 2025 AIDS Research and Treatment
Longitudinal measures of viral load (VL) are critical in monitoring the HIV status. While multiple lab indicators exist for monitoring measures of VL, research on clustering historical/longitudinal VL measures is limited. This study uses functional data clustering to classify longitudinal VL patterns and characterize each cluster by demographics, comorbidities, social behaviors, and CD4 count.
Harnessing Big Heterogeneous Data to Evaluate the Potential Impact of HIV Responses Among Key Populations in Sub-Saharan Africa: Protocol for the Boloka Data Repository Initiative
January 22, 2025 JMIR Publications
This study aimed to leverage and harness big heterogeneous data on HIV in South Africa and harmonize and analyze it to inform a targeted HIV response for greater impact in Sub-Saharan Africa.
A practical guide to FAIR data management in the age of multi-OMICS and AI
January 20, 2025 Frontiers in Immunology
With the rapid development and increasing interest in using artificial intelligence (AI) for the life sciences, improving how biologic data is organized and shared is more pressing than ever for scientific progress. Here, the authors outline a practical approach to multi-modal data management and FAIR sharing, in order to extend the longevity and utility of data by allowing facile use and reuse, accelerating scientific discovery in the biomedical sciences.
Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
January 15, 2025 Nature Communications
Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a significant challenge in nanopore sequencing bioinformatics. Here, authors report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications.
Fragmenstein: predicting protein–ligand structures of compounds derived from known crystallographic fragment hits using a strict conserved-binding–based methodology
January 13, 2025 Journal of Cheminformatics
Current strategies centered on either merging or linking initial hits from fragment-based drug design (FBDD) crystallographic screens generally do not fully leverage 3D structural information. Here, the authors show that an algorithmic approach (Fragmenstein) that ‘stitches’ the ligand atoms from this structural information together can provide more accurate and reliable predictions for protein–ligand complex conformation than general methods such as pharmacophore-constrained docking. This method is likely to further benefit future drug design campaigns and be integrated in future pipelines.
nipalsMCIA: flexible multi-block dimensionality reduction in R via nonlinear iterative partial least squares
January 12, 2025 Bioinformatics
With the increased reliance on multi-omics data for bulk and single-cell analyses, the availability of robust approaches to perform unsupervised learning for clustering, visualization, and feature selection is imperative. Authors introduce nipalsMCIA, an implementation of multiple co-inertia analysis (MCIA) for joint dimensionality reduction that solves the objective function using an extension to Nonlinear Iterative Partial Least Squares.
Prediction of cccDNA dynamics in hepatitis B patients by a combination of serum surrogate markers
January 9, 2025 PLoS Computational Biology
Authors employed a recently developed multiscale mathematical model describing intra- and intercellular viral propagation and applied it in HBV-infected patients under treatment. They developed a model that can predict intracellular HBV dynamics by use of extracellular viral markers, including HBsAg, HBV DNA, and HBcrAg in peripheral blood. Importantly, the model prediction of the amount of cccDNA in patients over time was confirmed to be well correlated with the data for quantified cccDNA by paired liver biopsy.
Learning predictive signatures of HLA type from T-cell repertoires
January 6, 2025 PLoS Computational Biology
The deep sequencing of immune repertoires from blood samples promises to offer diagnostic and precision medicine tools, and to help with the analysis of treatments and vaccinations and the design of immunotherapies. This study shows how, by training models on large datasets of annotated and unannotated repertoire sequencing, the human leukocyte antigen (HLA) type of patients can be computationally determined from their T-cell repertoire. This tool could be useful for typing repertoires for which the HLA is unknown, and to gain insight into the HLA restriction of T-cell receptor epitope specificity. It also provides lists of T-cell receptors that are associated to each HLA allele.
Characterizing cell-type spatial relationships across length scales in spatially resolved omics data
January 3, 2025 Nature Communications
Authors introduce CRAWDAD, an R package for quantifying cell-type spatial relationships across length scales in tissues using spatial omics data, enabling the identification of consistent as well as sample-specific celltype spatial relationships across multiple spatial omics datasets.
Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages
January 2, 2025 Nature Communications
The authors propose a Genetic Progression Score (GPS) model to predict autoimmune disease progression from preclinical stages, integrating genome-wide association study summary statistics and electronic health records from the All of Us and BioVU biobanks.
The Impact of Ancestry on Genome-Wide Association Studies
January 1, 2025 Pacific Symposium on Biocomputing
Decisions regarding how the genetic data are combined in genome-wide association studies (GWAS) can mask important findings that might serve individuals of ancestries that have been historically underrepresented in genetic studies. New methods that consider ancestry-specific variants in conjunction with the shared variants need to be developed.
A Prospective Comparison of Large Language Models for Early Prediction of Sepsis
January 1, 2025 Pacific Symposium on Biocomputing
Authors present a comparative study on the performance of two popular open-source large language models for early prediction of sepsis: Llama-3 8B and Mixtral 8x7B. The primary goal was to determine whether a smaller model could achieve comparable predictive accuracy to a significantly larger model in the context of sepsis prediction using clinical data.
A multivalent binding model infers antibody Fc species from systems serology
December 23, 2024 PLoS Computational Biology
Authors developed a computational model that interprets data from systems serology, allowing them to infer detailed, engagement-relevant structural information about antibodies that are difficult to measure directly. The model was applied to existing data from COVID-19 and HIV studies, revealing new insights into how antibody structure relates to vaccine efficacy and disease progression. The model also helped to identify which measurements in systems serology are most informative, potentially streamlining future studies.