wttech / aet

AET - a system that detects visual changes on web sites and performs basic page health checks
https://wttech.blog/open-source/
Apache License 2.0
146 stars 49 forks source link

Error messages are not verbose enough #71

Open jwadolowski opened 7 years ago

jwadolowski commented 7 years ago

Hey,

quite recently I encountered a bunch of AET issues. Unfortunately none of them was easy to troubleshoot and fix, mostly due to not enough feedback on client side. That's rather frustrating experience for someone who considers AET as a platform and just executes a bunch of test suites against predefined endpoint.

I have 3 stories that prove above standpoint.

First story

I have 2 environment in my project. Both of them have dedicated test suites and we used to run them on a quite regular basis. Everything was working fine on Thursday, we did app deployment and rerun AET Jenkins job the next day. It failed with the following error:

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building XYZ 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- aet-maven-plugin:2.0.1:run (default-cli) @ aet ---
[INFO] Successfully connected to tcp://10.aaa.bbb.ccc:61616
[INFO] HV000001: Hibernate Validator 4.3.2.Final
[INFO] HV000002: Ignoring XML configuration.
[INFO] CorrelationID: xyz-xyz-uat-suite-1488205130948
[INFO] ********************************************************************************
[INFO] ********************** Job Setup finished at 14:18:52.324.**********************
[INFO] *** Suite is now processed by the system, progress will be available below. ****
[INFO] ********************************************************************************
[INFO] [14:18:52.426]: COLLECTED: [success: 0, total: 49] ::: COMPARED: [success: 0, total: 0]
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[INFO] [14:19:15.441]: COLLECTED: [success: 1, total: 49] ::: COMPARED: [success: 3, failed: 5, total: 8]
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[INFO] [14:19:16.443]: COLLECTED: [success: 2, total: 49] ::: COMPARED: [success: 6, failed: 10, total: 16]
...
[INFO] [14:30:53.475]: COLLECTED: [success: 48, total: 49] ::: COMPARED: [success: 144, failed: 240, total: 384]
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[WARNING] null
[INFO] Received report message: FinishedSuiteProcessingMessage{correlationId=xyz-xyz-uat-suite-1488205130948, status=OK, errors=[]}
[INFO] Report is available at https://<AETDOMAIN>.com/report.html?company=xyz&project=xyz&correlationId=xyz-xyz-uat-suite-1488205130948
[INFO] Suite processing finished.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:25 min
[INFO] Finished at: 2017-02-27T14:31:15+00:00
[INFO] Final Memory: 11M/129M
[INFO] ------------------------------------------------------------------------

I opened the report, but for all the pages there was a message that said something about Unknown error (w/o any description at all), but apparently all the pages were visited because JS and status code errors were present in the report (all confirmed by log entries on the server).

Thankfully we had an AET engineer on board and he found out something was wrong with data collected in MongoDB. We made a few attempts to take some pattern IDs directly out of the database and injected that into report app URL. Unfortunately we ended up with the following in every single case:

{
  "message": "Unable to get artifact with id : 58ad96527700471ab4ec98f0 form {company='XYZ', project='ABC'}"
}

All in all we decided to remove all the data related to that suite from MongoDB (that wasn't that important for us at that time) and rerun the test - all was fine this time. It's hard to say what exactly went wrong, but based just on client logs and w/o deep understanding of what's going on under the hood it was impossible to track that bug down. There was no information about MongoDB whatsoever in the log.

It's worth to mention that MongoDB itself was 100% healthy and had no downtime or availability issues.

Second story

Recently we made a few changes related the way HTTPS gets terminated and how we pass on HTTP traffic from load balancer to Apache web server. Unfortunately due to webserver misconfiguration we accidentally introduced basic auth dialog on certain paths.

The change was implemented and we run AET as usual, but we ended up with this:

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building XYZ 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- aet-maven-plugin:2.0.1:run (default-cli) @ aet ---
[INFO] Successfully connected to tcp://10.xxx.yyy.zzz:61616
[INFO] HV000001: Hibernate Validator 4.3.2.Final
[INFO] HV000002: Ignoring XML configuration.
[INFO] CorrelationID: XYZ-XYZ-uat-suite-1488186346394
[INFO] ********************************************************************************
[INFO] ********************** Job Setup finished at 09:05:47.727.**********************
[INFO] *** Suite is now processed by the system, progress will be available below. ****
[INFO] ********************************************************************************
[INFO] [09:05:47.809]: COLLECTED: [success: 0, total: 49] ::: COMPARED: [success: 0, total: 0]
[WARNING] Report will be generated after timeout - some results might be missing!
[INFO] [09:07:49.136]: COLLECTED: [success: 0, total: 49] ::: COMPARED: [success: 0, total: 0]
[INFO] Received report message: FinishedSuiteProcessingMessage{correlationId=XYZ-XYZ-uat-suite-1488186346394, status=OK, errors=[Failed to generate reports because it took too much time!]}
[INFO] Report is available at https://aet-XYZcom.cognifide.com/report.html?company=XYZ&project=XYZ&correlationId=XYZ-XYZ-uat-suite-1488186346394
[ERROR] Failed to obtain xUnit report from: http://10.254.167.105:8181/xunit?company=XYZ&project=XYZ&correlationId=XYZ-XYZ-uat-suite-1488186346394
[INFO] Suite processing finished.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:04 min
[INFO] Finished at: 2017-02-27T09:09:50+00:00
[INFO] Final Memory: 14M/135M
[INFO] ------------------------------------------------------------------------

As you can see there was information only about some timeout. Based on the output itself it's hard to say where exactly that timeout happened and what might be the root cause. As in previous story there were Unknown errors on every single page. I wish there's a message about basic auth dialog somewhere to speed up troubleshooting.

There was also a similar story with the identical effect - basic auth credentials were changed, but no one updated Authorization header in the XML file, so the entire test failed. Again - no information about 401 or any other indication that something's wrong with the request itself.

Third story

By default Apache Karaf comes with 5 collector workers. We stumbled upon a situation where we had to decrease that value.

The background story behind that was, as usual, triggered by failed Maven AET build:

[INFO] ------------------------------------------------------------------------
[INFO] Building XYZ 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- aet-maven-plugin:2.0.1:run (default-cli) @ aet ---
[INFO] Successfully connected to tcp://10.xxx.yyy.zzz:61616
[INFO] HV000001: Hibernate Validator 4.3.2.Final
[INFO] HV000002: Ignoring XML configuration.
[INFO] CorrelationID: XYZ-XYZ-integration-suite-1488200959249
[INFO] ********************************************************************************
[INFO] ********************** Job Setup finished at 13:09:20.591.**********************
[INFO] *** Suite is now processed by the system, progress will be available below. ****
[INFO] ********************************************************************************
[INFO] [13:09:21.022]: COLLECTED: [success: 0, total: 49] ::: COMPARED: [success: 0, total: 0]
[WARNING] Report will be generated after timeout - some results might be missing!
[INFO] [13:11:22.879]: COLLECTED: [success: 0, total: 49] ::: COMPARED: [success: 0, total: 0]
[INFO] Received report message: FinishedSuiteProcessingMessage{correlationId=XYZ-XYZ-integration-suite-1488200959249, status=OK, errors=[Failed to generate reports because it took too much time!]}
[INFO] Report is available at https://aet-XYZcom.cognifide.com/report.html?company=XYZ&project=XYZ&correlationId=XYZ-XYZ-integration-suite-1488200959249
[ERROR] Failed to obtain xUnit report from: http://10.xxx.yyy.zzz:8181/xunit?company=XYZ&project=XYZ&correlationId=XYZ-XYZ-integration-suite-1488200959249
[INFO] Suite processing finished.

URL that points to xUnit report returned 500 error code and the following body:

HTTP ERROR 500

Problem accessing /xunit. Reason:

    Server Error
Caused by:

java.lang.NullPointerException
    at com.cognifide.aet.xunit.MetadataToXUnitConverter.convert(MetadataToXUnitConverter.java:64)
    at com.cognifide.aet.rest.XUnitServlet.generateXUnitAndRespondWithIt(XUnitServlet.java:113)
    at com.cognifide.aet.rest.XUnitServlet.process(XUnitServlet.java:86)
    at com.cognifide.aet.rest.BasicDataServlet.doGet(BasicDataServlet.java:79)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:445)
    at org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:70)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:556)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1044)
    at org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:124)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:372)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:189)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:978)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:81)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:367)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:486)
    at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
    at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
    at org.eclipse.jetty.server.nio.BlockingChannelConnector$BlockingChannelEndPoint.run(BlockingChannelConnector.java:298)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Thread.java:745)

It didn't say much what's going on. After a session with AET engineer we figured out that the server we run Apache Karaf on wasn't powerful enough to handle 5 Firefox browsers simultaneously, so we decreased that value to 3 and magically it started to work.

As in previous stories there was no indication what's going on and without support from AET engineer it's almost impossible to track the problem down on your own.


As you can see it can be quite tricky to troubleshoot and resolve issues you may encounter during your day-to-day work with AET.

Is it possible to make AET client logs more descriptive to give users some hints what actually went wrong? I believe there's a huge room for improvement in this area.

I'm quite convinced that without help of AET engineers (btw huge thanks for that!) I wouldn't be able to fix those issues.

Thanks, Kuba

tomekmwagner commented 7 years ago

Hi Kuba,

Thank you very much for your plenteous contribution and time to write down these painpoints.

We will analyze these things very carefully and most probably expand our logging system to give more meaningful messages.

Thanks, Tomek

jwadolowski commented 7 years ago

Here's yet another mysterious error

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building xxxxx 1.0.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- aet-maven-plugin:2.0.1:run (default-cli) @ aet ---
[INFO] Successfully connected to tcp://10.x.y.z:61616
[INFO] HV000001: Hibernate Validator 4.3.2.Final
[INFO] HV000002: Ignoring XML configuration.
[INFO] CorrelationID: xxxxx-xxxxx-uat-suite-1489587812901
[INFO] ********************************************************************************
[INFO] ********************** Job Setup finished at 14:23:34.265.**********************
[INFO] *** Suite is now processed by the system, progress will be available below. ****
[INFO] ********************************************************************************
[INFO] [14:23:34.341]: COLLECTED: [success: 0, total: 60] ::: COMPARED: [success: 0, total: 0]
[WARNING] Report will be generated after timeout - some results might be missing!
[INFO] [14:25:35.524]: COLLECTED: [success: 0, total: 60] ::: COMPARED: [success: 0, total: 0]
[INFO] Received report message: FinishedSuiteProcessingMessage{correlationId=xxxxx-xxxxx-uat-suite-1489587812901, status=OK, errors=[Failed to generate reports because it took too much time!]}
[INFO] Report is available at https://zzzzzzzz.yyyyyyyyyy.com/report.html?company=xxxxx&project=xxxxx&correlationId=xxxxx-xxxxx-uat-suite-1489587812901
[ERROR] Failed to obtain xUnit report from: http://10.x.y.z:8181/xunit?company=xxxxx&project=xxxxx&correlationId=xxxxx-xxxxx-uat-suite-1489587812901
[INFO] Suite processing finished.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:04 min
[INFO] Finished at: 2017-03-15T14:27:36+00:00
[INFO] Final Memory: 14M/137M
[INFO] ------------------------------------------------------------------------

For every single page the following error is displayed:

aet_report_processing_error