Developing is just a small part of an application’s lifecycle. After it goes live, there is maintenance to do. Measuring is critical to know what is going on, prevent downtimes or, in case it went wrong, to understand why.
Further reading on security for web applications by reading our blog post Cross-Site Scripting Taken by Black box Testing
Adding value from the beginning
There are several metric gathering tools or stacks. They all have advantages and disadvantages. After trying different approaches, we concluded that it should be easy to produce and consume metrics for an application.
For example, we started exposing metrics with SNMP. Creating and maintaining MIBs was hard, so developers tend to include metrics only by demand. Now we changed to the TICK stack (Telegraph, InfluxDB, Chronograf, and Kapacitor) which allows apps to produce metrics simply. This way, by default every application already provides some basic measurements, without a single line of code.
Consuming these measurements, shouldn’t be hard. For instance, creating new dashboards. We found Grafana to be the perfect tool for this. Since it’s very simple to use and everyone can create new dashboards to visualize data. From hardcore developers wanting to know low-level stuff to product analysts who have never coded, are able to use it.
Technical metrics are usually the first ones to appear. Our applications produce some technical metrics by default, like uptime
- memory usage
- processor usage
- garbage collection
Some other metrics like latency, timeouts, errors, requests, etc. are gathered. This metrics may seem to be only useful for developers, but that’s not real. The following example is a real-life example that happened to us with the first thing we measured: uptime. Every application reported it’s uptime and we displayed it on a monitor in the office.
Figure 1: Uptime of 4 services
Once, one of the product owners saw the dashboard and said: “Hey! Why is service D showing an uptime of 4 hs if all services started at the same time?”.
We didn’t notice any malfunction. The process monitor noticed that the service died and restarted it, so we saw it back online. But the question is: why did service D die?
As you see, something so trivial as service uptime can give you helpful information. In fact, what gave us a clue of what was going on was another metric.
Here measuring gets interesting and where a lot of value can be added. We can measure how our system is being used, which features are the most or least frequently used. Is our API being used as intended? Do we need to change something to make it more performant? Where are the bottlenecks?
These metrics need some thought. It’s not about recording every API call, but about measuring what’s important for our business. Some endpoint may produce several different metrics and others may not at all.
If our system interacts with other systems, it’s crucial to know latency and communication errors. How many 2xx/3xx, 4xx and 5xx are our services getting while trying to consume another API?
Gathering measurements is vital to take decisions. However, raw data is hard to understand. Additionally, a small system may produce thousands of measurements every couple of seconds. We need metrics from the hosts running our software. Metrics about how the database is operating, about networking and our services. What humans do understand well is graphical representation. That’s why a proper tool for this task is vital.
Our choice is Grafana for being very simple yet very powerful.
Take a look at our Jenkins dashboard in figure 2. It shows a clear look. We can quickly see job duration, select different nodes, phases, and components. The quick view lets us know:
- If one node is slower than another
- If one component is taking more time to build than another
- If one component is taking more time to build, compared to previous builds.
Figure 2: Jenkins dashboard
Of course, not every single measurement needs to have an associated dashboard. We may only need some metrics in some cases of troubleshooting. Meanwhile, other metrics may only be used by alerts.
While having dashboards is very useful, is unfeasible for a human to be watching dashboards 24/7 with all possible combinations. Besides, some conditions are hard to detect for the human eye. That’s why it’s essential to have a system with our measurement, for alerting.
There are several types of alerts, ranging from very simple to complex queries. Some of the most common ones are:
- Deadman: it triggers an alert if we don’t receive any data for some time.
- Threshold: if some measurement is below or above some number, triggers an alert. It has some variants:
- Fixed threshold: just a number. Example: if the disk usage is above 90%, it triggers an alert.
- Compare to another measurement. Example: if job’s running is more than available nodes, it triggers an alert.
- Gaussian threshold. Example: if the duration of a job is more than three standard deviations from previous builds, it triggers an alert. Is particularly powerful since we don’t have to define a fixed threshold, but we can compare with previous measurements. It’s another way of saying “if it’s taking longer than usual.”
- Based on calculations. While some measurements have direct alerts, some others require calculations. Example: we measure uptime, and it triggers an alert when the derivative is negative (meaning a restart).
Be careful when designing alerts. Not everything is critical, and not every alert should wake you up. If a service is restarted, the problem should be addressed but not at 3 in the morning. There are some companies where developers answer the alerts for the first months. Those systems have very good alerting rules.
System metrics are very useful for several reasons. From troubleshooting and making sure that your system works OK, to high-level analysis and product improvement. It could also be used as input for other systems (like scaling up or down a cluster).
It’s easy to implement and useful from the very beginning. So, if you are not already doing it, what are you waiting for?
Cross-Site Scripting (XSS) attack is one of the most important security risks in web applications described in the top ten OWASP. Find out more here!