Machine Learning Transforms Insect Identification

Application

Adamowicz’s team’s research allows environmental managers and conservation scientists to identify insect species more quickly and accurately, and monitor ecosystem health in real-time, instead of relying on slower traditional methods. The improved accuracy in species identification and biodiversity monitoring, especially for insects and arthropods, helps policymakers and land managers make better-informed decisions about habitat protection and conservation strategies based on measurable data.

Challenge

Biodiversity faces threats from human activities such as habitat destruction, pollution, excessive harvesting, and agricultural practices. In the world of arthropods (insects, spiders, and related species), scientists continue to discover new species, changing our understanding of diversity. The combination of human pressure and the rate of species discovery exceeds traditional methods’ ability to classify and monitor biodiversity. There is an urgent need for real-time monitoring and decision-making tools that can accurately assess how human activities impact biodiversity. Machine learning methods (computer systems that can learn and improve from experience) can enhance biomonitoring tools, enabling them to process vast amounts of data and predict valuable insights about species and ecosystem status.

Did You Know?

DNA barcoding (using genetic markers to identify species) uses genetic data for accurate species identification and image-based classification uses visual patterns and machine learning for quick species categorization. Together, these methods enhance biodiversity monitoring by enabling rapid identification of bugs through images while confirming species identity through DNA analysis for greater accuracy.

Research

Dr. Sarah Adamowicz’s research program developed bioinformatics strategies that combine biodiversity research, machine learning, and data science. Focusing on arthropod species, the research used growing arthropod databases, including extensive metabarcoding reference libraries (collections of genetic information used to identify species), to train advanced machine learning models. The project developed and validated new models that can quickly and accurately predict ecosystem services, health indicators, and biodiversity status. The research team has improved image-based classification for insects, where accurate classification of detailed taxonomic groups has historically been difficult, and successfully used bi-gram based classification (analyzing pairs of characters in text) to improve species name recognition.

Results

Adamowicz’s team significantly advanced the use of bioinformatic and machine learning strategies to improve biodiversity analysis, particularly for understudied invertebrates. Her team developed computational tools including image-based classifiers, text analysis algorithms for species names, and DNA barcode-based deep learning models. For example, BarcodeBERT, a sophisticated computer model trained on invertebrate DNA barcodes, outperformed other methods in biodiversity classification and could even classify insects using previously unseen images. The project also introduced a new data analysis method using the statistical programming language R to assess genetic differences in flies across Canada and Greenland, revealing how traits like habitat, larval diet, and geography relate to genetic variation. To address missing trait data, the team developed a strategy using random forest models (a type of machine learning) with evolutionary information to accurately fill in gaps.

Impact

Adamowicz’s team combines machine learning with large-scale genetic and trait datasets to improve biodiversity monitoring and support extensive ecological and evolutionary studies, offering scalable bioinformatics systems that can be applied globally. This research improves species identification, especially for insects and arthropods, enabling better real-time decision-making for biodiversity management. By using expanding species reference libraries and machine learning, the findings deepen our understanding of insect biodiversity and what is needed to help them survive. This information helps decision-makers more precisely assess the impacts of their practices and use measurable biodiversity data to support conservation efforts. The team’s innovative research advances bioinformatics approaches that can be replicated elsewhere, contributing to broader efforts to classify species and develop meaningful conservation strategies.

Learn More

Arias, P. M., Sadjadi, N., Safari, M., Gong, Z., Wang, A. T., Lowe, S. C., Haurum, J. B., Zarubiieva, I., Steinke, D., Kari, L., Chang, A., & Taylor, G. (2023). BarcodeBERT: Transformers for biodiversity analysis. Arxiv. https://doi.org/10.48550/arXiv.2311.02401

Burgess, P., Betini, G., Cholewka, A., DeWaard, J., DeWaard, S., Griswold, C., Hebert, P., MacDougall, A. S., McCann, K. S., McGroarty, J., Miller, E., Perez, K., Ratnasingham, S., Steinke, D., Wright, E., Zakharov, E., & Fryxell, J. M. (2024). Spatio-temporal determinants of arthropod biodiversity across an agro-ecosystem landscape. Authoreahttps://doi.org/10.22541/au.170664765.55770723/v1

Castellanos-Labarcena, J., Steinke, D & Adamowicz, S.J. (2024) Anomalous latitudinal gradients in parasitoid wasp diversity – Hotspots in regions with larger temperature range. Journal of Animal Ecology.https://doi.org/10.1111/1365-2656.14196

Castellanos-Labarcena, J., Milian-Garcia, Y., Elliott, T.A., Steinke, D., Hanner, R. & Adamowicz S.J. (2025) Single specimen genome assembly of Culicoides stellifer shows evidence of a non-retroviral endogenous viral element. BMC Genomics. https://doi.org/10.1186/s12864-025-11449-5

Diao, J., Elliott, T.A., Adamowicz, S.J. & Hanner, R. (2025) Long-term impacts of peat-based soil amendments promote ecological recovery in a boreal forest mine site in northern Ontario, Canada. Restoration Ecology.https://doi.org/10.1111/rec.70038

Fernando, M.A.T.M., Fu, J. & Adamowicz, S.J. (2025) Testing phylogenetic placement accuracy of DNA barcode sequences on a fish backbone tree: Implications of backbone tree completeness and species representation. Ecology and Evolution.https://doi.org/10.1002/ece3.70817

Gharaee, Z., Gong, Z., Pellegrino, N., Zarubiieva, I., Haurum, J. B., Lowe, S. C., McKeown, J. T. A., Ho, C., McLeod, J., Wei, Y.-Y. C., Agda, J., Ratnasingham, S., Steinke, D., Chang, A., Taylor, G., & Fieguth, P. (2023). A step towards worldwide biodiversity assessment: The BIOSCAN-1M insect dataset. Arxiv. https://doi.org/10.48550/arXiv.2307.10455

Hempel, C. A., Buchner, D., Mack, L., Brasseur, M. V., Tulpan, D., Leese, F., & Steinke, D. (2023). Predicting environmental stressor levels with machine learning: A comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data. Frontiers in Microbiology, 14. https://doi.org/10.3389/fmicb.2023.1217750

Majoros, S. E., Elliott, T. A. & Adamowicz, S. J. (2025). CanFlyet: Habitat zone and diet trait dataset for Diptera species of Canada and Greenland. Biodiversity Data Journalhttps://bdj.pensoft.net/article/129610/

Majoros, S. E., Adamowicz, S. J., & Cottenie, K. (2023). Novel pipeline for large-scale comparative population genetics. bioRxiv. https://doi.org/10.1101/2023.01.23.524574

May, J. A., Feng, Z., & Adamowicz, S. J. (2023). A real data-driven simulation strategy to select an imputation method for mixed-type trait data. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1010154

Raffington, J., Steinke, D., & Tulpan, D. (2020). Recognition of arthropod species names using bigram-based classification. Research Square. https://doi.org/10.21203/rs.3.rs-26532/v1