Confluence Intelligent Auto-Labeling System

Context

Managing a large Confluence knowledge base with consistent taxonomy and discoverability requires manual labeling of thousands of pages. Teams follow naming conventions like “Project Name — Page Title” but labels aren’t automatically applied, making content difficult to find and organize.

Challenge

Manual labeling is unsustainable: With 12,000+ pages and new content added daily, manual tagging is impossible to maintain
Inconsistent taxonomy: Different team members apply labels differently, creating fragmentation
Lost hierarchical context: Child pages don’t inherit parent labels, breaking logical groupings
Poor discoverability: Users can’t filter or find content effectively without proper labels
No pattern recognition: Structured title formats (e.g., “Engineering — API Documentation”) aren’t leveraged for automatic categorization

Solution

Built an intelligent auto-labeling system that extracts labels from page titles using pattern recognition and propagates them hierarchically through parent-child relationships. The system:

Extracts labels from title patterns: Recognizes formats like “Project — Title” and automatically creates appropriate labels
Implements hierarchical inheritance: Child pages automatically inherit all parent labels, maintaining logical taxonomy
Processes individual words: Breaks multi-word labels into individual searchable terms (e.g., “API Documentation” becomes both “api_documentation” and “api” + “documentation”)
Filters stop words: Excludes common words (“and”, “the”, “or”) to prevent noise
Handles special cases: Recognizes patterns like “[Archived]” and converts them to standardized labels
Avoids duplicates: Checks existing labels before applying new ones to prevent redundancy
Processes incrementally: Can run daily to label only recent pages without reprocessing the entire space

Impact

Eliminated manual labeling: Automated process that previously required hours of manual work per week
Improved discoverability: Users can now filter and search content by project, team, status, and topic
Consistent taxonomy: Standardized labeling across all pages regardless of author
Maintained hierarchy: Child pages automatically inherit context from parents, creating logical groupings
Scalable solution: Runs continuously as new content is added, keeping up with organizational growth

Technologies

Python, Confluence REST API, Regular Expressions, Hierarchical Data Structures, Logging

Key Technical Decisions

Used regex pattern matching on titles rather than NLP to avoid false positives and maintain control
Implemented hierarchical label propagation to automatically maintain context through page trees
Chose incremental processing (7-day window) over full-space processing for efficiency
Built stop word filtering to balance comprehensive labeling with noise reduction
Designed for idempotency so the script can run repeatedly without creating duplicates
Used set-based deduplication to handle complex inheritance chains efficiently

Code Overview

# Main workflow: Fetch → Filter → Extract → Inherit → Apply

def main():
    # 1. Fetch all pages from the space
    all_pages = fetch_all_pages(SPACE_KEY)
    
    # 2. Filter to recent pages (last N days)
    recent_pages = filter_recent_pages(all_pages, DAYS_BACK)
    
    # 3. Build hierarchical label structure
    # - Extract direct labels from each page title
    # - Propagate parent labels down to children
    page_labels = build_page_hierarchy(recent_pages)
    
    # 4. Apply all labels (direct + inherited) to each page
    for page_id, label_info in page_labels.items():
        add_labels(
            page_id,
            label_info['direct_labels'],
            label_info['inherited_labels']
        )

Key Features:

Pattern-based label extraction using regex
Hierarchical label inheritance through ancestor relationships
Incremental processing for efficiency
Duplicate detection and avoidance
Comprehensive logging for monitoring

View full code on GitHub →

Example Label Extraction

Title: Engineering — API Documentation

Extracted labels:

engineering_api_documentation (complete label)
engineering (individual word)
api (individual word)
documentation (individual word)

Inherited from parent “Engineering”:

engineering
tech
platform

Final labels applied: engineering, api, documentation, tech, platform, engineering_api_documentation

Concluding remarks

This system demonstrates practical application of pattern recognition and hierarchical data structures to solve a real organizational problem. The solution balances automation with control, using explicit patterns rather than machine learning to ensure predictable, maintainable results. It showcases my ability to identify inefficiencies, design scalable automation, and implement systems that deliver immediate operational value.