Associate Investigator, The Feinstein Institute for Medical Research
Phone: (516) 562-1076
Dr. Wentian Li obtained his BS from Beijing University (physics) and PhD from Columbia University (physics and complex systems). He has a diverse background by working at the Condensed Matter Theory Group at Brookhaven National Laboratory, Center for Complex Systems Research at University of Illinois at Urbana-Champaign, Santa Fe Institute, Lab of Statistical Physics at Rockefeller University, Computational Biology group at Cold Spring Harbor, Statistical Genetics Lab at Columbia University Medical Center and New York State Psychiatric Institute. Before joining the Feinstein Institute for Medical Research, he was an assistant professor at Rockefeller University. He joined the Center for Genomics and Human Genetics to apply his genetic analysis expertise on data generated in studying rheumatoid arthritis. Besides the traditional genetic analysis such as pedigree/linkage analysis and case-control genetic association analysis, he is collaborating with many groups at Feinstein on microarray expression analysis. More recently, with the high demand on a biological understanding of the data, he is also doing bioinformatics and system biology analysis.
Dr. Li’s scope of expertise is statistical, mathematical, computational, and quantitative analysis of biomedical data. As the type of the data evolve with the biotechnology, the type of analyses which Dr. Li is involved also constantly shifting. Whether it is shifting from family genetic data to population data, from common variants typed by chip to rare variants obtained by next-generation-sequencing, from microarray expression data from sequence-based expression profile, from intermediate-sized data to big data, or from pure statistical analysis to bioinformatics and biology-annotated data analysis, Dr. Li’s work never stands still. He has published roughly 110 peer-reviewed publications, with 60% as the first or senior author.
Over the year, Dr. LI has reviewed more than 370 scientific papers for 95 different journals. He was an editorial board member of Bioinformatics from 2005 to 2008, and an editorial board member of Journal of Theoretical Biology since 2005. In 2013, he became one of the two editors-in-chief of Computational Biology and Chemistry. Since then, he helped to edit a special issue on “Complexity in Genomes”, and is producing another special issue “Advances in System Biology”. He served as an ad hoc reviewer for NIH’s Tumor Microenvironment Network grant proposal in 2011, as well as a grant reviewer for Centre National de la Recherché Scientifique (CNRS) in France, Biotechnology and Biological Sciences Research Council (BBSRC) in UK, Ministry of Science and Technology in India, and Wellcome Trust in UK. He was a co-organizer of the satellite meeting on genomic complexity at European Conference on Complex Systems (ECCS’12), and has been a program committee member for the annual Asia-Pacific Bioinformatics Conference since 2007.
Our research focus is to assess, analyze, summarize, annotate biomedical data, both to aid biologists to understand topics of their study, and for ourselves to generate theoretical hypothesis on biological phenomena and human diseases.
The genetic study of complex diseases has reached a stage where all amino-acid-changing, stop-codon-introducing, splicing-site-altering variants can be relatively easily investigated, whereas non-coding variants are difficult to study. We choose to pursue a reasonably established topic in bioinformatics, the transcription-factor-binding-site (TFBS), as an entry point to understand non-coding variants. The effect of a single-nucleotide-polymorphism (SNP) in a TFBS can be measured by the binding score difference between the reference allele and the variant allele. Combining the binding score change with other information, such as the openness of the chromatin structure, evolutionary conservation of the studied position, the distance to the transcription start site (TSS), as well as enhancer locations predicted by other experiment and analytic means, may help us to prioritize and filter variants that are more likely to be functional.
The number of genes genetically associated with a disease, the number of mRNA transcripts differentially expressed between diseased and normal samples, the number of candidate genes, can easily reach to 3-digit (one hundreds) per disease, if not thousands. To understand these huge number of factors, grouping genes and connecting dots are essentially. We propose a new principle for gene sets, organized by the level of their association (physical, chemical, biological, and medical). These new principles have certain advantage over the commonly used MSigDB organization structure, and more general than the gene ontology (GO). We constantly employ gene-network/system-biology programs such as GeneMANIA to examine candidate genes for hypothesis generating. In particular, we are interested in deepening our understanding of rheumatoid arthritis and other autoimmune diseases through the system biology approach.
Although next-generation-sequencing (NGS) offers much promises in discovering rare/personal variants, indels, and other copy number variations, the short read length (around 150 base pairs) prevents its application to repeating regions. The current reference human genome misses 8% of its content due to the difficulty in uniquely mapping redundant contigs. Of the remaining 92%, 1% are repeating more than once, thus unmappable, at the 1000 bases level. We have thoroughly examined these 1% region and overlap them with other genomic annotations. Close to 4% of protein-coding genes, 7% of pseudogenes, 7% of tRNA, 1% of microRNA, overlap with the unmappable regions. Since repeating regions are also hotspot for segmental duplication events or recent copy number variations, a better catalog of them help our understanding of genomic instability and their potential role in diseases.
We are interested in any statistics and characterization of the human genome. The well known genome size, number of genes, number of chromosomes, global guanine-cytosine content (GC%), etc. are all studied from the evolution perspective. The less studied quantities include the number of GC%-high and GC%-low domains, the number of transcription factor (TF) genes, the number of microRNA, the number of genetic variants in a population, etc. The number of target per TF, in particular, has immediate consequence on the impact of mutation in a TF gene. TFs with large number of regulation targets (e.g. master genes for cell subtype development, master circadian genes, etc.) functions differently from TFs with a single target. We have observed that the distribution of number of targets is a power-law function, indicating the co-existence of both ubiquitous and unique TFs.
Absolute pitch (AP) is the ability (phenotype) for an individual to identify musical note without using a reference tone. The contribution from genes and environment to absolute pitch, as well as their interaction, has been constantly debated. Through linkage analysis of a combined dataset of both absolute pitch and synesthesia, we have observed two chromosome regions, one on chromosome 6 and another on chromosome 2, which may contain AP genes. Besides data analysis, we also plan to carry out system biology analysis by exploring the space of genes involved in early brain development.
Thousands of genes, called circadian genes, express at certain time period within a 24-hour cycle. Some of them are circadian in the suprachiasmatic nucleus (SCN), thus central, while other genes exhibit passive oscillation in peripheral organs. We found circadian genes to be enriched in differentially expressed genes in chronic lymphocytic leukemia (CLL). We also observe that half of the genetically associated genes to rheumatoid arthritis (RA) are circadian genes. As a comparison, a similar analysis of genetically associated genes to schizophrenia leads to 30%, and the proportion of circadian genes among all genes is reported in the literature is 10%-20%. On the other hand, the level of immune responses, level of inflammatory cytokine, and clinical symptoms such as pain, are reported to be oscillating for RA patients in a 24-hour cycle. We intend to explore the circadian oscillation aspect of a complex disease.
I constantly introduce concepts from one field to another, characteristic of interdisciplinary research. One example is to introduction of volcano plot and regularized t-statistics, which is used in microarray analysis, to genetic association analysis. This results in a new concept of regularized chi-square statistics. The use of volcano plot easily detects the role of allele frequency of a variant. It also points to new criteria in selecting associated SNPs by combining both test p-values and odds-ratio. Another example is the introduction of Menzerath’s law in quantitative linguistics to genomic study. The Menzerath’s law in language states that the longer the words, the shorter the syllables. In the genomic context, we ask the question on whether genes with more number of eons tend to have smaller exon sizes. This was indeed observed to be true and a more recent study shows that this feature can be used to separate different types of genes.
Filed of Study: Physics
Field of Study: Physics and Complex Systems