Are we monitoring our tools?

Background

Why?

Many organizations tend to ignore the concept of Key Performance Indicators (KPI) or Performance Metrics for the tools they are using. The tools could be either the ones subscribed by a third party or built in-house.

For the latter, the power is in the hands of the developer to build a proper monitoring mechanism in place during the Software Development Lifecycle (SDLC).

While for the former, teams could end up putting the entire onus of monitoring the tool performance on the third party. What happens in such a scenario, those organizations don’t have a trail of why, how, what, and when the downtime occurred and the impacts that followed.

In Tech industries where tools (internal/third-party) play a major role in the growth journey, and it becomes imperative to have an automated way of tracking the Key Performance Indicators (KPIs) for those tools.

How could monitoring/ tracking a tool help?

  1. Enabling data-driven insights and reports of tools used within an organization to make technological decisions.
  2. Monitoring multiple tools at a time (during trial versions) for the same job to compare those tools before deciding which one to subscribe to. With monitoring, this decision becomes quantifiable with the usage of metrics related to uptimes, scalabilities, and performances.
  3. Improving transparency when it comes to the Return on Investment (ROI) of the tool.
  4. Understanding whether a change to the subscription plan is required. i.e. invest more for a higher plan or lesser for a lower plan.
  5. How well are the tools delivering on the SLAs that were agreed upon?
  6. Finding the root cause of the issues in a tool.
  7. Real-time proactive monitoring to spot errors and service failures before they make an impact.
  8. Identify the scope of improvements in the tools built in-house.

The above reasons lead to improved business agility by quantitatively scoping the strengths and the limitations of the tool in real time.

System Design

How?

For automation, we need a robust infrastructure to help achieve the above goals.

At a technical level, we first need to determine the amount of data, and the number of requests we could be handling. The quantity of which is dependent on:

  1. The number of tools (, applications, or services) that we are planning to track.
  2. The frequency with which we track any tool.
  3. The number of requests that the tool is managing in a given time interval.
  4. The depth of information we’d like to track.

Assumptions

  1. The number of tools that an organization uses and the amount of data that it generates per request/ ping is finite and deterministic. So we can horizontally scale our infrastructure by adding more nodes.
  2. All the tools that we are monitoring are internal to the organization, so they are kept behind the Organization’s Virtual Private Network (VPC) Protected by AWS WAF (Firewall). While selecting any tool, we’d like to discuss if the third-party tool will provide the information related to the above metrics once the monitoring starts. Also, if the tool or application is built in-house, the development team should incorporate this capability into the tool.
  3. The Monitoring tool needs to be configurable. That will ensure that whenever a new tool or a service is added, we can easily update the monitoring system’s configurations to start tracking a new tool.

Data Requirement

  1. Assume there is a total of N (=100) tools that an organization uses (both in-house and external third party)
  2. Given that the action frequency is f (means f actions on the tool per second, generates f log messages), where f can be in the range 0–10. Assume, f to be equal to 10 (i.e. the worst-case scenario).
  3. Assume the number of requests that each ping represents, r is in the range 0-100
  4. Assume that each ping that we receive from the tool averages 1 KB (Each ping discussed later in detail).

With the above, we may infer that the total data generated per second is:

Total Data generated per second = N * f * r * 1 KB

Based on our assumptions,
= 100 * 10 * 100 * 1 KB
= 100000 KB
~ 100 MB

Also, number of requests per second = N * f * r = 100000 Requests

If we see, the scale isn’t too high because we are talking only about the internal tools used by the organizations.

If each compute unit can process 5000 Requests Per Second (RPS), we’d need a maximum of 20 Computing Units during maximum throttle.

Dissection of a Ping from the Tool

Structure of each ping

Database requirement

We would be using Amazon Dynamo for this purpose as we don’t expect structured data. Also, all the fields might not be populated on every ping.

Design

  1. The Tools send a ping to the SNS (Simple Notification Service) to one or both the topics -> All_Pings, and Errored_Pings.
  2. The SNS forwards notifications from All_Pings and Errored_Pings to the respective Simple Queue Service (SQS).
  3. The reason we don’t directly send messages to Lambda from the respective SNS Topic is to ensure the messages aren’t lost (Reliability). Once, the Lambda picks up the messages, it performs 3 simple tasks:
    a. Adds scrubbing logic for bad data
    b. Removes PII if any
    c. Include the missing messages from the tools that did not send messages (maybe they are not active anymore) to indicate the same.
  4. For the Lambda that processes all the pings, we use the step function to first push the data to the Amazon Dynamo DB and then to QuickSight that along with SageMaker can render dashboards with visualizations.
  5. For the lambda that processes all the Errored Pings, we use the step function to initiate a notification to email and Slack.

Metrics Captured

What?

We’d like to show the metrics for each tool related to:

  1. Uptime/ Downtime (Time series)
  2. Requests per Tool every unit time (Time Series)
  3. Number of Successful vs Unsuccessful requests
  4. Drill downs when we click on the Unsuccessful requests to do Root Cause Analysis — RCA (Tabular Representation)
  5. The users/ app that is using the tool the most
  6. Monitoring the type of requests the app/ user was making
  7. Past Downtimes duration
  8. With SageMaker integration, we can also show some predictions for the usage

References

  1. Top 10 Best System Monitoring Software Tools [2022 SELECTIVE]
  2. Design an Analytics Platform (Metrics & Logging)

Copyright © 2022 Ankit Sarraf — All rights reserved

Opinions expressed are personal and do not necessarily represent the views or opinions of any organization.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ankit Sarraf

Ankit Sarraf

36 Followers

Data, Cloud, and System Designs. All opinions are personal and subject to correction. LinkedIn: https://www.linkedin.com/in/ankitsarraf/