Building in logging into an application can sound easy, but once you reach a certain scale it will all become a bit much. Diving through millions of lines in order to understand what is going wrong is pretty painful. To help you with this, I wrote some of my ideas around this issue that will make you want to look at your logs again.
Make sure whatever you put in the log message has a meaning to someone reading it. Seeing errors like
Something went wrong, please try again may be something you want to show your users but it will not help you if you encounter it in your log stream.
Instead explain in short what for example did not succeed. If let’s say you are calling an user service to fetch a users information, your message could show something like this:
Failed to fetch user information from user-service (uid:123).
If you have a case where you are logging an error and the case is known, it will not hurt to provide some sort of reference to some documentation explaining the case.
Set a strategy
Having a good strategy around how and what to log is important. Make sure you understand how to use log levels and make sure all your applications respect these levels.
Although log levels are something that can be interpreted any way you like. A common error is to log in a level that is higher than needed. Most likely, for your specific application a specific error is not a good thing, but make sure it’s actionable before logging it as an ERROR.
Here are some of my log level interpretations/guidelines.
- Gives information about everything that is going on in high detail. Example: Fetching information from the database took 12ms. These messages should not be used in production due to the high volume but could help troubleshoot application flow issues.
- Gives information in more detail about what your application is doing. Example: Calling an external service. These messages will most likely be ignored in a production scenario but may help troubleshooting for fellow colleagues.
- Indicates normal application behaviour where a significant event took place. Example: A new user registration succeeded. These messages will most likely not mean much from an operations perspective but do provide a skeleton of what happened.
- Indicates an handled situation your application is build to handle but is worth logging. Example: If for example your application uploads reports to an s3 bucket every hour but is build to retry uploading the next hour. In this case it's worth logging this as an WARN level as it may help to explain your users why the reports where not available. It may be useful to send out an informational message when these volumes get to high.
- Indicates an significant recoverable error occurred that should be investigated asap. Example: Your application encountered an error that should never have happened but can be investigated the next day. These individual errors should be logged in something like slack and may cause an on-call event when the volume gets high.
- Indicates a critical situation that your application could not recover from and requires immediate investigation. Example: Your application encountered something it could not handle and is about to stop. This information should help understand why it stopped working. These individual errors should result in someone on-call being paged.
With the log levels in mind, you most likely want to run your applications in INFO mode most of the time. If your need to investigate something more deeply, enabling DEBUG for a short while should be fine.
With an increase in traffic comes an increase in logs. This however increases the cost, both in storage price and in reading time. The more logs you have, the harder is becomes to find what you are looking for.
When developing a specific part of your application, keep in mind that your log entries may not cleanly follow each other. If for example two user registrations take place at the same time, these logs will be mixed within each other. Because of this it’s useful to add some sort of context to the message. It may not be clear in all cases what context to provide, in some cases you don’t even have a meaningful identifier. This can be solved by adding a
request id and/or
correlation id to every set of logs.
A request id is a unique identifier that is used for this http request only, any log generated in this request, will carry this information.
A correlation id is unique to the full request and will be provided to other services so they can use it. This helps troubleshoot issues across different services.
Equally important is to make sure you are able to identify the source of your log message when it ends up in a centralized system. Adding the name of the service, host, container, region, … may all help in understanding why something weird happened. Note that most centralized logging systems may already take care of this for you.