Mapping the Research Landscape of Semaglutide: A Data-Driven Exploration

Map of centers with scientific and clinical publications for semaglutide : zoom in and out and click on bubbles to see centers

Semaglutide has gained immense attention in recent years, not just as a treatment for type 2 diabetes but also as a groundbreaking weight loss drug. Here are some key questions this analytics project seeks to answer: Who are the key research institutions behind semaglutide? Which centers are driving both clinical trials and peer-reviewed research on this medication?

To answer this, I combined insights from PubMed.gov and ClinicalTrials.gov to identify and geographically map the leading institutions involved with semaglutide-related research.

 


 

Why This Matters

Understanding who’s leading the research on semaglutide can benefit:

 

    • Pharmaceutical companies looking to identify influential clinical sites,

    • Academic institutions exploring collaboration opportunities,

    • Data scientists interested in how public datasets can reveal insights.

 


 

Data Sources Used

The data sources included information from two public biomedical repositories:

 

    1. PubMed – A database of biomedical literature. I extracted metadata on articles mentioning “semaglutide,” including affiliations.

    1. ClinicalTrials.gov – A registry of clinical studies. I used SQL to filter and extract all trials where semaglutide was listed as an intervention.

Both sources provide complementary views—one focused on peer-reviewed academic output, and the other on ongoing or completed clinical work.

 


 

Step-by-Step Methodology

1. Data Collection

 

    • ClinicalTrials.gov: I queried the trial database using SQL through their public download files to filter out trials where semaglutide was the primary intervention.

    • PubMed: I used a Python-based custom API script (built on the Entrez module from Biopython) to fetch metadata for articles with “semaglutide” in the title or abstract, including author affiliations.

2. Data Cleaning & Standardization

 

    • Organization names were messy and inconsistent (e.g., “Mayo Clinic”, “Mayo Cl”, “Mayo Clinic Rochester”).

    • I manually cleaned and grouped similar names using a combination of Python scripts and Excel for validation.

3. Geocoding Organizations

 

    • Once I had a clean list of institutions, I assigned geographical latitude and longitude values to each one by merging.

    • This enabled mapping each research hub accurately.

4. Visualization with Leaflet.js

 

    • Using the Leaflet.js library, I built an interactive map that shows:

       

        • Circles sized by number of publications/trials

        • Popups containing institution names and counts

        • Color-coded markers to differentiate between clinical trials, publications, or both.

 


 

Key Findings

 

    • The top US locations by publication volume included:

       

        • Boston , MA

        • Dallas, TX

        • Indianapolis

    • The leaders in clinical trial activity were:

       

        • Brigham and Womens Hospital

        • Dallas Diabetes Research Center

        • University of Texas Southwestern Medical Center

    • Notably, Boston and Dallas emerged as geographic hotspots for semaglutide-related research, both academically and clinically.

 


 

Challenges Encountered

 

    • Name Variants: Institutions often appeared under multiple names.

    • Affiliation Extraction: PubMed metadata doesn’t always clearly label institution names, requiring text pattern extraction.

    • Location Matching: Some organizations lacked full address data, requiring approximations or manual geocoding.

 


 

Tech Stack

 

    • Languages: SQL, Python

    • Libraries: pandas, requests, json, biopython, openpyxl

    • Tools: Excel (for grouping and cleaning), Leaflet.js (for visualization)

    • APIs: Entrez API (PubMed), ClinicalTrials.gov (data dumps), OpenCage Geocoder

 


 

Conclusion

This project demonstrated how public data, combined with simple but effective data wrangling and visualization tools, can yield powerful insights into the global research landscape of a single drug.

Semaglutide is at the forefront of metabolic health innovation, and this analysis provides a foundation for identifying key players and potential collaborators in this space.

 


 

What’s Next

 

    • Adding author-level analysis to identify individual key opinion leaders.

    • Mapping timeline trends in publications vs. trial registrations.

    • Expanding this model to other popular GLP-1 receptor agonists.

 


 

Check Out the Code

🧪 GitHub Repository 

Contains:

 

    • All cleaned datasets

    • Jupyter notebooks for analysis

    • Code to generate the map


If you’re working on similar healthtech projects or have ideas to expand this further, feel free to reach out or connect with me!

Share this Project:

LinkedIn
Twitter

More projects