Pengyi Yang - Omic Data Scientist
Pengyi Yang

About me

I obtained my PhD in bioinformatics from School of Information Technologies, University of Sydney, in 2012. I then moved to the United States and completed an interdisciplinary Research Fellowship in Systems Biology Group, ESCBL, at National Institutes of Health on characterising transcriptomic and epigenomic regulations in embryonic stem cells (ESCs) using ultrafast sequencing data. I relocated back to Australia in late 2015 on a Sydney University Postdoctoral Fellowship (DVCR) to pursue my own research in systems biology. I’m now affiliated with School of Mathematics and Statistics (SoMS); and Charles Perkins Centre, University of Sydney. I have been offered a Lectureship in Statistics in April 2016 and a Discovery Early Career Researcher Award (DECRA) by Australian Research Council (ARC). I'm currently teaching STAT5003.


  • Our work on adaptive sampling for classification problems has been accepted and will appear in IJCAI (International Joint Conference on Artificial Intelligence) 17 in Melbourne. Read the pre-print version here: PDF

  • Taiyun has had his first publication as a co-second author on integrative analysis of transcription factors in embryonic stem cells (ESCs). Congratulations to Taiyun! Read the fulltext from here

  • Dinuka has joined our group through the Talented Student Program (USyd). He will be working on visualising large-scale omics datasets.

  • Our KinasePA shiny app is recently published in Proteomics. Read more: Text

Research interests

My research interests are in the broad areas of Computational and Systems Biology with a focus on cell signaling, epigenetic, and transcriptional networks. Specifically, I am interested in developing computational methods and statistical models to reconstruct and characterize signaling cascades, and epigenetic and transcriptional networks that underlie cellular homeostasis, proliferation, differentiation, and cell-fate decisions.

Studying biological pathways at a systems level is essential, for it is not always possible to understand the behavior of complex systems by scaling up properties of individual components. Systematic study of complex interactions in biological networks at a global viewpoint allows us to discover fundamental principles that are not intuitive and to uncover global properties that can only be discovered by integrating interactions between individual components. I am using systems biology approaches to integrate and analyze heterogeneous high-throughput “–omics” data with the goal of generating testable hypotheses and predictions. My ultimate objective is to discover critical cell signaling pathways that regulate epigenetic landscapes and gene expression programs controlling cell type identity. Results from these studies will contribute to the comprehensive understanding of the cross-talk among cell signaling, epigenetic, and transcriptional regulations.




✢: Co-first author
#: Corresponding/Co-corresponding author


Systems biology:
Yang, P.#, Oldfield, A., Kim, T., Yang, A., Yang, J. & Ho, J.# (2017) Integrative analysis identifies co-dependent gene expression regulation of BRG1 and CHD7 at distal regulatory sites in embryonic stem cells. Bioinformatics, [Full Text], [PDF]

Methodology and tools:
Yang, P., Liu, W. & Yang, J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Pre-print [PDF]


Systems biology:
Zheng, X., Yang, P., Lackford, B., Bennett, B., 2 Wang, L., Li, H., Wang, Y., Miao, Y., Foley, J., Fargo, D., Jin, Y., Williams, C., Jothi, R. & Hu, G. (2016) CNOT3-dependent mRNA deadenylation safeguards the pluripotent state. Stem Cell Reports, 7(5), 897-910 [Text]

Minard, A., Tan, S., Yang, P., Fazakerley, D., Domanova, W., Parker, B., Humphrey, S., Jothi, R., Stöckli. J. & James. D. (2016) mTORC1 is a major regulatory node in the FGF21 signaling network in adipocytes. Cell Reports, 17(1), 29-36 [Pubmed]

Methodology and tools:
Yang, P., Patrick, E., Humphrey, S., Ghazanfar, S., James, D., Jothi, R. & Yang, J. (2016). KinasePA: Phosphoproteomics data annotation using hypothesis driven kinase perturbation analysis. Proteomics, 16(13), 1868-1871 [Text], [Online tool]

Yang, P.#, Humphrey, S., James, D., Yang, J. & Jothi, R.# (2016). Positive-unlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data. Bioinformatics, 32(2), 252-259. [Pubmed], [Predictions]

Lu, C., Wang, J., Zhang, Z., Yang, P. & Yu, G. (2016). NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity. Computational Biology and Chemistry, Corrected Proof.

Domanova, W., Krycer, J., Chaudhuri, R., Yang, P., Vafaee, F., Fazakerley, D., Humphrey, S., James, D. & Kuncic, Z., (2016). Unraveling kinase activation dynamics using kinase-substrate relationships from temporal large-scale phosphoproteomics studies. PLoS One, 11(6), e0157763.


Systems biology:
Pathania, R., Ramachandran, S., Elangovan, S., Padia, R., Yang, P., Cinghu, S., Veeranan-Karmegam, R., Fulzele, S., Pei, L., Chang, C., Choi, J., Shi, H., Manicassamy, S., Prasad, P., Sharma, S., Ganapathy, V., Jothi, R. & Thangaraju, M. (2015). DNMT1 is essential for mammary and cancer stem cell maintenance and tumorigenesis. Nature Communications, 6, 6910. [Pubmed]

Hoffman, N., Parker, B., Chaudhuri, R., Fisher-Wellman, K., Kleinert, M., Humphrey, S., Yang, P., Holliday, M., Trefely, S., Fazakerley, D., Stockli, J., Burchfield, J., Jensen, T., Jothi, R., Kiens, B., Wojtaszewski, J., Richter, E. & James, D. (2015). Global phosphoproteomic analysis of human skeletal muscle reveals a network of exercise-regulated kinases and AMPK substrates. Cell Metabolism, 22(5), 922-935. [Pubmed]

Methodology and tools:
Yang, P.#, Zheng, X., Jayaswal, V., Hu, G., Yang, J. & Jothi, R. (2015). Knowledge-based analysis for detecting key signaling events from time-series phosphoproteomics data. PLoS Computational Biology, 11(8), e1004403. [Pubmed]


Systems biology:
Oldfield, A., Yang, P., Conway, A., Cinghu, S., Freudenberg, J., Yellaboina, S. & Jothi, R. (2014). Histone-fold domain protein NF-Y promotes chromatin accessibility for cell type-specific master transcription factors. Molecular Cell, 55(5), 708-722. [Pubmed]

Ma, X., Yang, P., Kaplan, W., Lee, B., Wu, L., Yang, J., Yasunaga, M., Sato, K., Chisholm, D. & James, D. (2014). ISL1 regulates peroxisome proliferator-activated receptor γ activation and early adipogenesis via bone morphogenetic protein 4-dependent and -independent mechanisms. Molecular and Cellular Biology, 34(19), 3607-3617. [Pubmed]

Lackford, B., Yao, C., Charles, G., Weng, L., Zheng, X., Choi, E., Xie, X., Wan, J., Xing, Y., Freudenberg, J., Yang, P., Jothi, R., Hu, G. & Shi, Y. (2014). Fip1 regulates mRNA alternative polyadenylation to promote stem cell self‐renewal. EMBO Journal, 33(8), 878-889. [Pubmed]

Methodology and tools:
Yang, P., Patrick, E., Tan, S., Fazakerley, D., Burchfield, J., Gribben, C., Prior, M., James, D. & Yang, J. (2014). Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway. Bioinformatics, 30(6), 808-814. [Pubmed]

Yang, P.#, Yoo, P., Fernando, J., Zhou, B., Zhang, Z. & Zomaya, A. (2014). Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Transactions on Cybernetics, 44(3), 445-455. [IEEE Xplore] [PDF]


Systems biology:
Humphrey, S., Yang, G., Yang, P., Fazakerley, D. J., Stöckli, J., Yang, J. & James, D. (2013). Dynamic adipocyte phosphoproteome reveals that Akt directly regulates mTORC2. Cell Metabolism, 17(6), 1009-1020. [Pubmed]

Methodology and tools:
Yang, P., Liu, W., Zhou, B., Chawla, S. & Zomaya, A. (2013). Ensemble-based wrapper methods for feature selection and class imbalance learning. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD),  Lecture Notes in Artificial Intelligence 7818, Springer Berlin Heidelberg, 544-555. [Text]

Yang, P., Yang, J., Zhou, B. & Zomaya. A. (2013). Stability of feature selection algorithms and ensemble feature selection methods in bioinformatics. In Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, Wiley, New Jersey, USA, 333-352. [PDF]


Yang, P., Humphrey, S., Fazakerley, D., Prior, M., Yang, G., James, D. & Yang, J. (2012). Re-fraction: a machine learning approach for deterministic identification of protein homologues and splice variants in large-scale MS-based proteomics. Journal of Proteome Research, 11(5), 3035-3045. [Pubmed]

Yang, P.#, Ma, J., Wang, P., Zhu, Y., Zhou, B. & Yang, J. (2012). Improving X! Tandem on peptide identification from mass spectrometry by self-boosted Percolator. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(5), 1273-1280. [Pubmed]

Wang, P., Yang, P. & Yang, J. (2012). OCAP: an open comprehensive analysis pipeline for iTRAQ. Bioinformatics, 28(10), 1404-1405. [Pubmed]


Yang, P.✢, #, Ho, J., Yang, J. & Zhou, B. (2011). Gene-gene interaction filtering with ensemble of filters. BMC Bioinformatics, 12, S10. [Pubmed]

Yang, P., Zhang, Z., Zhou, B. & Zomaya, A. (2011). Sample subset optimization for classifying imbalanced biological data. In Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD),  Lecture Notes in Artificial Intelligence 6635, Springer Berlin Heidelberg, 333-344. [Text]


Yang, P.#, Ho, J., Zomaya, A. & Zhou, B. (2010). A genetic ensemble approach for gene-gene interaction identification. BMC Bioinformatics, 11(1), 524. [Pubmed]

Wang, P., Yang, P., Arthur, J. & Yang, J. (2010). A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data. Bioinformatics, 26(18), 2242-2249. [Pubmed]

Yoo, P., Ho, Y., Ng, J., Charleston, M., Saksena, N., Yang, P. & Zomaya, A. (2010). Hierarchical kernel mixture models for the prediction of AIDS disease progression using HIV structural gp120 profiles. BMC Genomics, 11, S22. [Pubmed]

Yang, P.#, Zhang, Z., Zhou, B. & Zomaya, A. (2010). A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data. Neurocomputing, 73(13), 2317-2331. [Text]

Yang, P.#, Zhou, B., Zhang, Z. & Zomaya, A. (2010). A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data. BMC Bioinformatics, 11, S5. [Pubmed]

Yang, P., Yang, J., Zhou, B. & Zomaya, A. (2010). A review of ensemble methods in bioinformatics. Current Bioinformatics, 5(4), 296-308. [Text]

Li, L., Yang, P., Ou, L., Zhang, Z. & Cheng, P. (2010). Genetic algorithm-based multi-objective optimisation for QoS-aware web services composition. In Proceedings of the 4th International Conference on Knowledge Science, Engineering and Management (KSEM), Lecture Notes in Computer Science 6291, Springer Berlin Heidelberg, 549-554. [Text]


Yang, P.#, Xu, L., Zhou, B., Zhang, Z. & Zomaya, A. (2009). A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genomics, 10, S34. [Pubmed]

Zhang, Z., Yang, P., Wu, X. & Zhang, C. (2009). An agent-based hybrid system for microarray data analysis. IEEE Intelligent Systems, 24(5), 53-63. [PDF]

Yang, P.# & Zhang, Z. (2009). An embedded two-layer feature selection approach for microarray data analysis. IEEE Intelligent Informatics Bulletin, 10(1), 24-32. [PDF]

Yang, P., Tao, L., Xu, L. & Zhang, Z. (2009). Multiagent framework for bio-data mining. In Proceedings of the 4th Rough Sets and Knowledge Technology (RSKT), Lecture Notes in Computer Science 5589, Springer Berlin Heidelberg, 200-207. [Text]


Zhang, Z. & Yang, P.# (2008). An ensemble of classifiers with genetic algorithm-based feature selection. IEEE Intelligent Informatics Bulletin, 9(1), 18-24. [PDF]

Yang, P. & Zhang, Z. (2008). A clustering based hybrid system for mass spectrometry data analysis. In Proceedings of the 3rd Pattern Recognition in Bioinformatics (PRIB), Lecture Notes in Bioinformatics 5265, Springer Berlin Heidelberg, 98-109. [Text]

Yang, P. & Zhang, Z. (2008). A hybrid approach to selecting susceptible single nucleotide polymorphisms for complex disease analysis. In Proceedings of BioMedical Engineering and Informatics (BMEI), IEEE, 214-218. [PDF]


Yang, P. & Zhang, Z. (2007). Hybrid methods to select informative gene sets in microarray data classification. In Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI), Lecture Notes in Artificial Intelligence 4830, Springer Berlin Heidelberg, 811-815. [Text]


Level 5 West (5W83), D17
Charles Perkins Centre
School of Mathematics & Statistics
Faculty of Science
The University of Sydney
NSW, 2006

Mobile: +61-452536773
Email: pengyi DOT yang AT sydney DOT edu DOT au