Chapter 4 · Monitoring

Understand the essentials of monitoring for ensuring system reliability and optimal performance.

Monitoring is all about gathering the right data to make informed decisions on software that we build and maintain. In this chapter, we cover how to use data and metrics to get a clear picture, create dashboards to communicate, and pick the right tools to provide better observability. We also discuss being oncall, managing costs, understanding the numbers, measuring changes with A/B tests, handling emergencies, and managing risks.

Chapter Contents

  • 4.1 Data
    • 4.1.1 Collecting Data
    • 4.1.2 Processing Data
    • 4.1.3 Leveraging Data
    • 4.1.4 Reporting Data
    • 4.1.5 Visualizing Data
    • 4.1.6 Mining Data
    • 4.1.7 Securing Data
  • 4.2 Machine Metrics
    • 4.2.1 Using Machine Metrics
      • 4.2.1.1 Troubleshooting
      • 4.2.1.2 Resource Allocation and Capacity Planning
      • 4.2.1.3 Performance Optimization
      • 4.2.1.4 Cost optimization
    • 4.2.2 Setting Up Machine Metrics
  • 4.3 Dashboarding
    • 4.3.1 Metrics on Dashboard
    • 4.3.2 Structuring Dashboards
  • 4.4 Tooling
    • 4.4.1 Why Tooling
    • 4.4.2 Advanced Tools
    • 4.4.3 Building Your Own Tool
  • 4.5 Oncall
    • 4.5.1 Runbooks
    • 4.5.2 Escalation
    • 4.5.3 Taking Responsibility
    • 4.5.4 Stress in Oncall
  • 4.6 Cost
    • 4.6.1 Running Cost
    • 4.6.2 Cost Cutting
  • 4.7 Metrics
    • 4.7.1 Health Metrics
    • 4.7.2 Decision Making with Metrics
    • 4.7.3 Planning and Experimenting Through Metrics
  • 4.8 A/B Testing
    • 4.8.1 Core Elements of A/B Testing
    • 4.8.2 Maintaining Balance and Preventing Contamination
    • 4.8.3 Collecting Metrics
    • 4.8.4 Running Experiments
    • 4.8.5 Experiment Review
  • 4.9 Incidents
    • 4.9.1 Triaging
    • 4.9.2 Communication
    • 4.9.3 Mitigation
    • 4.9.4 Postmortem
    • 4.9.5 Promoting Learnings
  • 4.10 Documenting Runbooks
    • 4.10.1 Writing Runbooks
    • 4.10.2 Playing Runbooks
  • 4.11 Kill Switches
  • 4.12 Risk Management
    • 4.12.1 Identifying Risks
    • 4.12.2 Prioritizing Risks