Tech, Microservices

How to Troubleshoot Microservices on Google Cloud Platform and App Engine

This is the second article of the series dedicated to building microservices on Google Cloud Platform and App Engine (check the intro article here). Now that we know how to build microservices on GCP and AppEngine, we will focus on how to inspect them for potential bugs.

This time I will explain how to take advantage of the logging and monitoring features in Google Cloud. Because your service will not always work as expected, Google Cloud has a powerful and robust solution called Cloud Operations (formerly StackDriver), which is the basis in our process of troubleshooting microservices.

These are the steps we will be following:

    1. Cloud Operations
    • 1.1. Cloud Logging
    • 1.2. Cloud Monitoring
    • 1.3. Alerts
    • 1.4. Debugger
    • 1.5. Profiler
    1. Practical Example
    • 2.1. Solution Architecture
    • 2.2. Troubleshooting Microservices
    1. Log Points
    1. Snapshots

1.  Cloud Operations (StackDriver)

Cloud Operations is the suite for monitoring all services on Google Cloud from one single place. It includes a centralized logging platform where all logs are being stored at project level. Stack Driver has a lot of cool features (if you want to know more about it, I recommend you consult the documentation here here).

1.1. Cloud Logging

Cloud Logging is the centralized logging solution of Google Cloud that helps you store, search, analyze and alert all of your logs and events.

Regardless of the service you are using, you can also ingest custom logs from any source taking advantage of the logging agent based on fluentd.

And since App Engine has built-in support for cloud logging, all your logs are being moved to the cloud automatically, which allows you to trace any request processed by your microservices.

By default, all Stack Driver logs are being stored for 30 days, but you can also configure sinks to move them to another storage option like BigQuery or Cloud Storage for a longer retention.

As a best practice, you should print all your logs following the stdout standard, which is supported in all the programming languages.

1.2. Cloud Monitoring

Cloud monitoring gets metrics from all our solutions in real time. Services like Compute Engine are sending performance insights such as CPU load, network traffic, disk IO.  Cloud Monitoring can also get metrics from other Cloud Providers like Amazon WS.

It's also good to know AppEngine provides out-of-the-box support for Cloud Monitoring.

1.3. Alerts

So far so good, but what happens if you have a critical microservice for your organization and it just goes down?

With Cloud Operations you can configure alerts and health checks based on different things like logging entries or your microservice availability. Alerts support different notification channels like Email, SMS, GCP console mobile app, Slack, Pagerduty or webhooks for a more customizable way (for instance, I just did a webhook some weeks ago to integrate Stack Driver alerting with Google Chat).

This way, you will immediately know in real time if your application behavior is not the expected one.

1.4. Debugger

Debugger is an amazing feature that helps you (provided some code adjustments) to inspect your microservice in real time, without impacting the performance.

With this you can create log points without changing your code or take a snapshot of your microservice inspecting variables values and the stack.

From the developers standpoint this is amazing because you can debug your service with production traffic without a very low impact in the performance.

1.5. Profiler

Profiler is another feature that helps you check your code performance in real time with low impact on it. It allows you to know which parts of your application or service are consuming the most resources.

2. Practical Example

Following our example from our previous article on building microservices, a sales company requires that each time a new invoice is generated by their ERP system several teams should receive a notification over email, SMS or chat with relevant information.

This was in production for 6 weeks working fine but now for some reason the notifications are not arriving since yesterday.

Today we’ll troubleshoot the entire solution to find the gaps and then we will configure monitoring and alerting to minimize the impact in the future.

All the code can be found here. To complete this tutorial you need to follow part 1 first.

2.1. Solution Architecture


We have this architecture, now let's add a small bug into our code. In the app.post method definition lets include the following code in line 24.

throw new Error("Service Timeout...");

The new method will looks like this:

app.post('/', async (req, res) => {
    const notification = decodeBase64Json(req.body.message.data);
    try
    {
        console.log(`Email Service: Report ${notification.id} trying...`);
        sendEmail();
        throw new Error("Service Timeout...");
        console.log(`Email Service: Report ${notification.id} success :-)`);
        res.status(204).send();
    }
    catch (ex) {
        console.log(`Email Service: Report ${notification.id} failure: ${ex}`);
        res.status(500).send();
    }
})

2.2. Troubleshooting Microservices

For today, the first step we need to follow is to check if the messages are being delivered to cloud pubsub, just to be sure that our architecture is working fine.

So you need to go to Cloud PubSub, subscription “email-service-pub”

As you will notice, the notifications are being received in the Topic but they are not being accepted by the email-service. Which is actually one of the coolest things about pub sub: they keep the message in the queue until it's processed by the subscription successfully, then you can also configure the retries and the time between those retries.

See in the screenshot below: the number of undelivered messages is 5 until now and the retry policy is set to 'retry immediately'.

Given that the problem is not in pub sub, let's check our service logs.

We go to Cloud Logging and select our email service resource. Our logs will be there as AppEngine provides out of the box support for StackDriver.

Here we can see that the service is not working well, and the logs are saying that we probably have a timeout issue with the SMTP relay that is sending the emails. But first let's verify our assumption by debugging our service in real time.

For this we should apply some changes to our code.

First, install the client library:

npm install @google-cloud/debug-agent

Then inside of the index.js lets add one more include.

require('@google-cloud/debug-agent').start({ allowExpressions: true });

The allowExpressions : true is important because it will enable us to verify the content of our variables in real time.

As for the dependency to the package.son, here you have the code:

"dependencies": {
    "@google-cloud/debug-agent": "^5.1.3",
    "express": "^4.17.1"
}

Now let’s deploy our service again and check the debugger.

First step is to connect your source code to the debugger, which has two amazing tools that will help us during our microservices troubleshooting.

3. Log Points

Log points help us add logging entries in our code without actually changing the code and deploying our microservice again. I know that developers will love it, which doesn’t mean that you don’t need to follow best practices in your code. The good part is that you can rely on Google Cloud to expedite the bugs review in production.

Let’s add our first log point in our code.

Select the Logpoint tab, then click on line 23 to add a new log point, just include the message “Hello World!!” and after some seconds you will be able to see the log entry in the logs in Stackdriver Logging at the bottom of the page.

Now let’s include some variables.

Edit the logpoint expression by clicking on the pencil, then add the {notification.id} expression just to log the content of this constant. Now you will be able to see this reflected in Stackdriver Logging too at the bottom.

4. Snapshots

Select the index.js file and then create a snapshot to see what is happening in our code by clicking on line 25.

Right after this, you will see that a new blue arrow is there at line 25 and the snapshot tab is actually taking a picture of all the variables in your code in real time without impacting your service performance.

In the snapshot you can see the value of each variable involved in the execution of this method - including the entire exception.

This is very useful for us in identifying the issue. Our assumption is correct: we have a problem with the SMTP, so we can act on it (advise the team that supports that service, they find the root cause and after applying the fix, our service is back to normal).

Now that we identified the issue, let's remove the bug and see how our microservice resumes the work with help of cloud pub sub.

Now if we go back to the Pub Sub subscription panel, the 5 messages were delivered properly to the email microservice, and the service logs look good too.

Conclusion

All the Cloud Operations suite of GCP can be key to troubleshooting microservices and doing root cause analysis, so I hope that you found this tutorial useful for your current and future projects.

If you learned something here, maybe you fancy learning some more by subscribing to the Around25 newsletter (check below). No spammy content, just solid tech bites and tutorials from our side.


Author image

by Pablo Portillo

Google Cloud Professional Certified Architect and Solutions Architect with more than 6 years of experience with cloud technologies. Pablo worked with companies in Aeronautics, Geolocation, or BPOs.
  • Santa Ana, El Salvador

Have an app idea? It’s in good hands with us.

Contact us
Contact us