Confluence Knowledge Base Analytics Pipeline

Context

Managing a large-scale Confluence knowledge base with 10,000+ pages across multiple spaces, with inadequate visibility into content quality, usage patterns, and workflow status.

Challenge

Confluence’s native analytics were too limited to be useful: Out-of-the-box reporting only provided basic page views and couldn’t track content quality metrics, workflow states, or structural relationships
Previous internal attempts had failed: Earlier projects couldn’t successfully extract Comala workflow status, a critical metric for tracking content approval and governance
Manual audits taking days to complete
No systematic way to track knowledge base health and growth over time
Inability to identify outdated content, orphaned pages, or workflow bottlenecks
No data-driven insights for content governance decisions

Solution

Built an automated Python pipeline that extracts, enriches, and tracks comprehensive metadata from Confluence spaces on a weekly schedule. The system:

Extracts 20+ metrics per page: content statistics (word count, attachments, tables), user activity (edit count, contributors, last editor), structural data (depth, child pages), and workflow status
Successfully integrates Comala workflow data: Solved the technical challenge that blocked previous attempts by correctly querying both status and parameters endpoints
Implements versioning: Creates weekly snapshots with automatic archiving to track changes over time
Handles scale: Processes thousands of pages with pagination, error handling, and progress tracking
Maintains history: Appends to historical dataset while managing storage with configurable retention policies

Impact

Reduced audit time from days to minutes: Weekly automated snapshots replace manual reviews
Unlocked workflow governance: First system to successfully track Comala approval states at scale, enabling identification of bottlenecks and stalled content
Enabled data-driven content governance: Identified 2,000+ orphaned pages and 500+ outdated documents
Improved visibility: Stakeholders can now track knowledge base growth, content quality trends, and workflow bottlenecks
Scalable foundation: System now tracks 15,000+ pages across 5 spaces with minimal maintenance

Tech Stack

Python 3.x – core scripting language for pipeline orchestration and data processing
Confluence REST API – primary data source for page metadata and content
Comala Workflow API – workflow status and parameters extraction
Requests library – HTTP client for API interactions with authentication
CSV data format – structured output for analysis and reporting
Datetime & scheduling – weekly automation with ISO week formatting

This combination was chosen to create a lightweight, maintainable solution that could be deployed quickly without additional infrastructure. The pipeline runs on existing systems and outputs to CSV for maximum compatibility with analytics tools like Power BI, Excel, and Tableau.

Skills used

Python programming
REST API integration
Data pipeline design
Error handling & pagination
Data enrichment
Workflow automation
CSV data processing
Version control & archiving

Key Technical Decisions

Solved Comala integration by combining two API endpoints (status + parameters) where previous attempts used only one
Chose weekly snapshots over real-time syncing to balance API load and data freshness
Implemented automatic archiving to prevent storage bloat while maintaining 12 weeks of rolling history
Used simplified user activity extraction to avoid expensive version history API calls
Structured for extensibility - easy to add new metrics or data sources

Code Overview

# Main workflow: Extract → Transform → Load with versioning
def main():
    # 1. Authenticate and setup
    auth, base_url = setup_auth()
    
    # 2. Extract data from all configured spaces
    for space_key in SPACE_KEYS:
        pages = get_all_pages_in_space(space_key, auth, base_url)
        
        # 3. Enrich each page with additional metrics
        for page in pages:
            child_count = get_child_page_count(page["id"], auth, base_url)
            content_info = get_page_content_info(page["id"], auth, base_url)
            user_activity = get_simplified_user_activity(page, auth, base_url)
            workflow_data = get_comala_status(page["id"], auth, base_url)
            
            # 4. Transform and structure data
            page_data = extract_required_data(
                page, workflow_data, child_count, content_info, user_activity
            )
            all_data.append(page_data)
    
    # 5. Load: Write to versioned CSV files
    write_to_csv(all_data, week_info)
    archive_old_snapshots()

Key Features:

Pagination handling for large datasets
Multi-source data enrichment (core API + workflow API + content analysis)
Graceful error handling with fallback values
Configurable retention and archiving
Weekly versioning with ISO week format

View full code on GitHub →

Concluding remarks

This pipeline demonstrates my ability to identify gaps in existing tools, design scalable data solutions, and deliver measurable business impact. The project showcases end-to-end thinking - from problem identification through technical implementation to ongoing maintenance considerations. It’s representative of the type of automation and analytics work I excel at: combining technical execution with strategic business value.