Confluence Knowledge Base Analytics Pipeline
Context
Managing a large-scale Confluence knowledge base with 10,000+ pages across multiple spaces, with inadequate visibility into content quality, usage patterns, and workflow status.
Challenge
- Confluence’s native analytics were too limited to be useful: Out-of-the-box reporting only provided basic page views and couldn’t track content quality metrics, workflow states, or structural relationships
- Previous internal attempts had failed: Earlier projects couldn’t successfully extract Comala workflow status, a critical metric for tracking content approval and governance
- Manual audits taking days to complete
- No systematic way to track knowledge base health and growth over time
- Inability to identify outdated content, orphaned pages, or workflow bottlenecks
- No data-driven insights for content governance decisions
Solution
Built an automated Python pipeline that extracts, enriches, and tracks comprehensive metadata from Confluence spaces on a weekly schedule. The system:
- Extracts 20+ metrics per page: content statistics (word count, attachments, tables), user activity (edit count, contributors, last editor), structural data (depth, child pages), and workflow status
- Successfully integrates Comala workflow data: Solved the technical challenge that blocked previous attempts by correctly querying both status and parameters endpoints
- Implements versioning: Creates weekly snapshots with automatic archiving to track changes over time
- Handles scale: Processes thousands of pages with pagination, error handling, and progress tracking
- Maintains history: Appends to historical dataset while managing storage with configurable retention policies
Impact
- Reduced audit time from days to minutes: Weekly automated snapshots replace manual reviews
- Unlocked workflow governance: First system to successfully track Comala approval states at scale, enabling identification of bottlenecks and stalled content
- Enabled data-driven content governance: Identified 2,000+ orphaned pages and 500+ outdated documents
- Improved visibility: Stakeholders can now track knowledge base growth, content quality trends, and workflow bottlenecks
- Scalable foundation: System now tracks 15,000+ pages across 5 spaces with minimal maintenance
Tech Stack
- Python 3.x – core scripting language for pipeline orchestration and data processing
- Confluence REST API – primary data source for page metadata and content
- Comala Workflow API – workflow status and parameters extraction
- Requests library – HTTP client for API interactions with authentication
- CSV data format – structured output for analysis and reporting
- Datetime & scheduling – weekly automation with ISO week formatting
This combination was chosen to create a lightweight, maintainable solution that could be deployed quickly without additional infrastructure. The pipeline runs on existing systems and outputs to CSV for maximum compatibility with analytics tools like Power BI, Excel, and Tableau.
Skills used
- Python programming
- REST API integration
- Data pipeline design
- Error handling & pagination
- Data enrichment
- Workflow automation
- CSV data processing
- Version control & archiving
Key Technical Decisions
- Solved Comala integration by combining two API endpoints (status + parameters) where previous attempts used only one
- Chose weekly snapshots over real-time syncing to balance API load and data freshness
- Implemented automatic archiving to prevent storage bloat while maintaining 12 weeks of rolling history
- Used simplified user activity extraction to avoid expensive version history API calls
- Structured for extensibility - easy to add new metrics or data sources
Code Overview
# Main workflow: Extract → Transform → Load with versioning
def main():
# 1. Authenticate and setup
auth, base_url = setup_auth()
# 2. Extract data from all configured spaces
for space_key in SPACE_KEYS:
pages = get_all_pages_in_space(space_key, auth, base_url)
# 3. Enrich each page with additional metrics
for page in pages:
child_count = get_child_page_count(page["id"], auth, base_url)
content_info = get_page_content_info(page["id"], auth, base_url)
user_activity = get_simplified_user_activity(page, auth, base_url)
workflow_data = get_comala_status(page["id"], auth, base_url)
# 4. Transform and structure data
page_data = extract_required_data(
page, workflow_data, child_count, content_info, user_activity
)
all_data.append(page_data)
# 5. Load: Write to versioned CSV files
write_to_csv(all_data, week_info)
archive_old_snapshots()
Key Features:
- Pagination handling for large datasets
- Multi-source data enrichment (core API + workflow API + content analysis)
- Graceful error handling with fallback values
- Configurable retention and archiving
- Weekly versioning with ISO week format
Concluding remarks
This pipeline demonstrates my ability to identify gaps in existing tools, design scalable data solutions, and deliver measurable business impact. The project showcases end-to-end thinking - from problem identification through technical implementation to ongoing maintenance considerations. It’s representative of the type of automation and analytics work I excel at: combining technical execution with strategic business value.