Confluence Intelligent Auto-Labeling System

Context

Managing a large Confluence knowledge base with consistent taxonomy and discoverability requires manual labeling of thousands of pages. Teams follow naming conventions like “Project Name — Page Title” but labels aren’t automatically applied, making content difficult to find and organize.

Challenge

  • Manual labeling is unsustainable: With 12,000+ pages and new content added daily, manual tagging is impossible to maintain
  • Inconsistent taxonomy: Different team members apply labels differently, creating fragmentation
  • Lost hierarchical context: Child pages don’t inherit parent labels, breaking logical groupings
  • Poor discoverability: Users can’t filter or find content effectively without proper labels
  • No pattern recognition: Structured title formats (e.g., “Engineering — API Documentation”) aren’t leveraged for automatic categorization

Solution

Built an intelligent auto-labeling system that extracts labels from page titles using pattern recognition and propagates them hierarchically through parent-child relationships. The system:

  • Extracts labels from title patterns: Recognizes formats like “Project — Title” and automatically creates appropriate labels
  • Implements hierarchical inheritance: Child pages automatically inherit all parent labels, maintaining logical taxonomy
  • Processes individual words: Breaks multi-word labels into individual searchable terms (e.g., “API Documentation” becomes both “api_documentation” and “api” + “documentation”)
  • Filters stop words: Excludes common words (“and”, “the”, “or”) to prevent noise
  • Handles special cases: Recognizes patterns like “[Archived]” and converts them to standardized labels
  • Avoids duplicates: Checks existing labels before applying new ones to prevent redundancy
  • Processes incrementally: Can run daily to label only recent pages without reprocessing the entire space

Impact

  • Eliminated manual labeling: Automated process that previously required hours of manual work per week
  • Improved discoverability: Users can now filter and search content by project, team, status, and topic
  • Consistent taxonomy: Standardized labeling across all pages regardless of author
  • Maintained hierarchy: Child pages automatically inherit context from parents, creating logical groupings
  • Scalable solution: Runs continuously as new content is added, keeping up with organizational growth

Technologies

Python, Confluence REST API, Regular Expressions, Hierarchical Data Structures, Logging

Key Technical Decisions

  • Used regex pattern matching on titles rather than NLP to avoid false positives and maintain control
  • Implemented hierarchical label propagation to automatically maintain context through page trees
  • Chose incremental processing (7-day window) over full-space processing for efficiency
  • Built stop word filtering to balance comprehensive labeling with noise reduction
  • Designed for idempotency so the script can run repeatedly without creating duplicates
  • Used set-based deduplication to handle complex inheritance chains efficiently

Code Overview

# Main workflow: Fetch → Filter → Extract → Inherit → Apply

def main():
    # 1. Fetch all pages from the space
    all_pages = fetch_all_pages(SPACE_KEY)
    
    # 2. Filter to recent pages (last N days)
    recent_pages = filter_recent_pages(all_pages, DAYS_BACK)
    
    # 3. Build hierarchical label structure
    # - Extract direct labels from each page title
    # - Propagate parent labels down to children
    page_labels = build_page_hierarchy(recent_pages)
    
    # 4. Apply all labels (direct + inherited) to each page
    for page_id, label_info in page_labels.items():
        add_labels(
            page_id,
            label_info['direct_labels'],
            label_info['inherited_labels']
        )

Key Features:

  • Pattern-based label extraction using regex
  • Hierarchical label inheritance through ancestor relationships
  • Incremental processing for efficiency
  • Duplicate detection and avoidance
  • Comprehensive logging for monitoring

View full code on GitHub →

Example Label Extraction

Title: Engineering — API Documentation

Extracted labels:

  • engineering_api_documentation (complete label)
  • engineering (individual word)
  • api (individual word)
  • documentation (individual word)

Inherited from parent “Engineering”:

  • engineering
  • tech
  • platform

Final labels applied: engineering, api, documentation, tech, platform, engineering_api_documentation

Concluding remarks

This system demonstrates practical application of pattern recognition and hierarchical data structures to solve a real organizational problem. The solution balances automation with control, using explicit patterns rather than machine learning to ensure predictable, maintainable results. It showcases my ability to identify inefficiencies, design scalable automation, and implement systems that deliver immediate operational value.