Confluence Intelligent Auto-Labeling System
Context
Managing a large Confluence knowledge base with consistent taxonomy and discoverability requires manual labeling of thousands of pages. Teams follow naming conventions like “Project Name — Page Title” but labels aren’t automatically applied, making content difficult to find and organize.
Challenge
- Manual labeling is unsustainable: With 12,000+ pages and new content added daily, manual tagging is impossible to maintain
- Inconsistent taxonomy: Different team members apply labels differently, creating fragmentation
- Lost hierarchical context: Child pages don’t inherit parent labels, breaking logical groupings
- Poor discoverability: Users can’t filter or find content effectively without proper labels
- No pattern recognition: Structured title formats (e.g., “Engineering — API Documentation”) aren’t leveraged for automatic categorization
Solution
Built an intelligent auto-labeling system that extracts labels from page titles using pattern recognition and propagates them hierarchically through parent-child relationships. The system:
- Extracts labels from title patterns: Recognizes formats like “Project — Title” and automatically creates appropriate labels
- Implements hierarchical inheritance: Child pages automatically inherit all parent labels, maintaining logical taxonomy
- Processes individual words: Breaks multi-word labels into individual searchable terms (e.g., “API Documentation” becomes both “api_documentation” and “api” + “documentation”)
- Filters stop words: Excludes common words (“and”, “the”, “or”) to prevent noise
- Handles special cases: Recognizes patterns like “[Archived]” and converts them to standardized labels
- Avoids duplicates: Checks existing labels before applying new ones to prevent redundancy
- Processes incrementally: Can run daily to label only recent pages without reprocessing the entire space
Impact
- Eliminated manual labeling: Automated process that previously required hours of manual work per week
- Improved discoverability: Users can now filter and search content by project, team, status, and topic
- Consistent taxonomy: Standardized labeling across all pages regardless of author
- Maintained hierarchy: Child pages automatically inherit context from parents, creating logical groupings
- Scalable solution: Runs continuously as new content is added, keeping up with organizational growth
Technologies
Python, Confluence REST API, Regular Expressions, Hierarchical Data Structures, Logging
Key Technical Decisions
- Used regex pattern matching on titles rather than NLP to avoid false positives and maintain control
- Implemented hierarchical label propagation to automatically maintain context through page trees
- Chose incremental processing (7-day window) over full-space processing for efficiency
- Built stop word filtering to balance comprehensive labeling with noise reduction
- Designed for idempotency so the script can run repeatedly without creating duplicates
- Used set-based deduplication to handle complex inheritance chains efficiently
Code Overview
# Main workflow: Fetch → Filter → Extract → Inherit → Apply
def main():
# 1. Fetch all pages from the space
all_pages = fetch_all_pages(SPACE_KEY)
# 2. Filter to recent pages (last N days)
recent_pages = filter_recent_pages(all_pages, DAYS_BACK)
# 3. Build hierarchical label structure
# - Extract direct labels from each page title
# - Propagate parent labels down to children
page_labels = build_page_hierarchy(recent_pages)
# 4. Apply all labels (direct + inherited) to each page
for page_id, label_info in page_labels.items():
add_labels(
page_id,
label_info['direct_labels'],
label_info['inherited_labels']
)
Key Features:
- Pattern-based label extraction using regex
- Hierarchical label inheritance through ancestor relationships
- Incremental processing for efficiency
- Duplicate detection and avoidance
- Comprehensive logging for monitoring
Example Label Extraction
Title: Engineering — API Documentation
Extracted labels:
engineering_api_documentation(complete label)engineering(individual word)api(individual word)documentation(individual word)
Inherited from parent “Engineering”:
engineeringtechplatform
Final labels applied: engineering, api, documentation, tech, platform, engineering_api_documentation
Concluding remarks
This system demonstrates practical application of pattern recognition and hierarchical data structures to solve a real organizational problem. The solution balances automation with control, using explicit patterns rather than machine learning to ensure predictable, maintainable results. It showcases my ability to identify inefficiencies, design scalable automation, and implement systems that deliver immediate operational value.