zotonic / zotonic

Zotonic - The Erlang Web Framework & CMS
http://zotonic.com/
Apache License 2.0
817 stars 206 forks source link

Zotonic restarting every 5 minutes #546

Closed simonsmithies closed 11 years ago

simonsmithies commented 11 years ago

My zotonic server is restarting every 5 minutes for no reason I can see. Less frequently it will also crash during the restart - based on the logs this happened 9 times between midnight and 0500 this morning.

I am running latest, but have seen similar problems ever since upgrading to the 0.9 release. This continues to occur even with all sites disabled.

Have completely rebuilt the server, this time with Erlang 15b01 built from source; Postgres 9.1.7; ImageMagick 6.6.0-4.

Running on Port 80 via Authbind.

Had none of these problems on 0.8 on a very similar server and configuration.

console.log: http://pastebin.com/TBun9gVh crash.log: http://pastebin.com/5QgmdTxF error.log: http://pastebin.com/3m9uVTVa

mworrell commented 11 years ago

That is strange indeed.

Looks like almost exactly every 5 minutes. I just looked if we have a timer somewhere that waits 5 minutes before doing something, but couldn't find it.

Can you maybe set the debug level of the console to debug? lager:set_loglevel(lager_console_backend, debug).

Then we might see more detail about what is happening during those 5 minutes.

kaos commented 11 years ago

I'll do it.. as I'm currently logged on to simon's server... btw, he's running on master too. If I can't find what's causing this, will try 0.9.x and see if it makes a difference..

mmzeeman commented 11 years ago

Hi,

There is an IDLE_TIMEOUT inside mochiweb which is set to 5 minutes. It is used to time out slow senders. Could this repeated restarts be caused by a slow sender? Probably not as the site supervisor should handle this.

Btw, it also looks like the crash and error logs are overwritten by a heart restart. This erases traces of possible errors.

Maas

That is strange indeed.

Looks like almost exactly every 5 minutes. I just looked if we have a timer somewhere that waits 5 minutes before doing something, but couldn't find it.

Can you maybe set the debug level of the console to debug? lager:set_loglevel(lager_console_backend, debug).

Then we might see more detail about what is happening during those 5 minutes.

— Reply to this email directly or view it on GitHub.

kaos commented 11 years ago

I had a tail -f on it, didn't see anything new...

mworrell commented 11 years ago

That Mochiweb idle timout is consistent with the echild error in the logs. Looked like it was an inet error there.

kaos commented 11 years ago

why would mochiweb timeout?

arjan commented 11 years ago

BTW it should not be too hard to move the old logs into a different place (directory with the current date+time) before starting zotonic..?

mworrell commented 11 years ago

Maas said it was a protection against slow senders. Opening a port, sending a byte, doing nothing etc.

In this way Mochiweb closes the port when such extremely slow senders are active.

mmzeeman commented 11 years ago

Otherwise a lot of slow senders can swamp resources of the machine.

why would mochiweb timeout?

— Reply to this email directly or view it on GitHub.

simonsmithies commented 11 years ago

Running without heart has certainly helped -- the server has not restarted since Andreas made that change.

I then started the six sites on the server, and stopped three of them a minute or two later. In between starting the six sites and stopping 3 of them, five new errors were logged, without a restart - http://pastebin.com/hVjGiM47 - so these are the first since Andreas's restart.

And as I write this, it has now crashed and restarted with the new error below - I spoke too soon!

2013-04-04 21:31:04.155 [error] <0.164.0> gen_server z_sites_dispatcher terminated with reason: no such process or port in call to gen_server:call('z_trans_server$jail', table) 2013-04-04 21:31:04.168 [error] <0.164.0> CRASH REPORT Process z_sites_dispatcher with 0 neighbours crashed with reason: no such process or port in call to gen_server:call('z_trans_server$jail', table) 2013-04-04 21:31:04.169 [error] <0.137.0> Supervisor z_sites_sup had child z_sites_dispatcher started with z_sites_dispatcher:start_link() at <0.164.0> exit with reason no such process or port in call to gen_server:call('z_trans_server$jail', table) in context child_terminated 2013-04-04 21:31:04.170 [error] <0.13221.15> CRASH REPORT Process <0.13221.15> with 0 neighbours crashed with reason: {{noproc,{gen_server,call,['z_trans_server$jail',table]}},{gen_server,call,[z_sites_dispatcher,{dispatch,"jail.co.nz","/",'GET',http}]}} 2013-04-04 21:31:04.173 [info] <0.15333.15>@z_sites_manager:handle_cast:195 Site started: zotonic_status (<0.15344.15>)

Note that jail.co.nz is one of the sites I shut down at 21:23:28.808.

kaos commented 11 years ago

About the failed to depickle messages. Could it be that you had the a page loaded from a different zotonic instance, then updating dns so the next request would end up at another instance with a different key, thus rendering the pickled values unparsable? Just a wild theory..

About heart issues. I'm wondering if there could be some discrepancies in your version of erlang? Or a mix of versions, in case there is an older version laying around somewhere...

simonsmithies commented 11 years ago

Your wild theory on the depickle is good - I did update DNS at around that time and I haven't seen the error again overnight - everything seems stable.

In terms of having two versions of erlang, when I rebuilt the VPS I elected to backup the old files. So everything that was previously there -- including erlang -- is sitting in /old/. But there should be no reference to that stuff in any environment variables, profiles etc ... I can't really see how that would be an issue. Am going to rm -rf /old/ now though and try with the standard start script.

Another possibility on heart: if the system clock changes by more than HEART_BEAT_TIMEOUT, heart will "timeout and try to reboot the system" according to the erlang docs. I'm on a virtuozzo/parallels vps and according to their docs the clock is driven from the hardware node which I take to be the physical machine my instance is running on. Maybe that's changing - will ask my VPS provider about that.

simonsmithies commented 11 years ago

More thoughts re system clock: The times in console.log seem incorrect I started 3 sites on the server one after the other, yet the times logged for those events are 06:53:18.534; 06:53:24.153 and 06:58:27.557 so the last one appears the have happened 5minutes after the first two, yet I started them pretty much together - as fast as you can click the 3 start buttons so maybe something's up with the clock on my VPS and from the erlang docs, a change in the system clock can trigger a heart restart I've noticed before that sometimes console.log will appear to go backwards -- later events have earlier timestamps

kaos commented 11 years ago

Yes indeed. If the system clock is unstable, so will an heart monitored erlang system be: https://github.com/erlang/otp/blob/maint/erts/etc/common/heart.c#L424

simonsmithies commented 11 years ago

On the VPS that has the problem, erlang:now() and erlang:localtime() return different results.

For example:

1> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}.
{{{2013,4,5},{22,35,29}},{{2013,4,5},{22,30,29}}}

Have logged this as an Erlang bug.

Note that although the above example shows a 5 minute difference, the difference varies each time you do the comparison.

Comment from kaos re this:

there could be some timestamp comparison that starts off being 5 minutes into the future, then as time pass, it catches up and triggers a restart... somehow..

mmzeeman commented 11 years ago

Do you call it a lot in your application? If you do, then erlang:now can get ahead of the time of the real world. The reason for this is that now returns a unique timestamp on the vm. It even locks the schedulers to do this. If you don't need this it is better, and faster, to use os:timestamp.

On 5 apr. 2013, at 19:22, Simon Smithies notifications@github.com wrote:

On the VPS that has the problem, erlang:now() and erlang:localtime() return different results.

For example:

1> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,5},{22,35,29}},{{2013,4,5},{22,30,29}}} Have logged this as an Erlang bug.

Note that although the above example shows a 5 minute difference, the difference varies each time you do the comparison.

Comment from kaos re this:

there could be some timestamp comparison that starts off being 5 minutes into the future, then as time pass, it catches up and triggers a restart... somehow..

— Reply to this email directly or view it on GitHub.

simonsmithies commented 11 years ago

I'm just using Zotonic as a CMS, so any calls to erlang:now() would be from core zotonic functions. Nothing specific to my sites.

On Sat, Apr 6, 2013 at 7:50 AM, Maas-Maarten Zeeman < notifications@github.com> wrote:

Do you call it a lot in your application? If you do, then erlang:now can get ahead of the time of the real world. The reason for this is that now returns a unique timestamp on the vm. It even locks the schedulers to do this. If you don't need this it is better, and faster, to use os:timestamp.

On 5 apr. 2013, at 19:22, Simon Smithies notifications@github.com wrote:

On the VPS that has the problem, erlang:now() and erlang:localtime() return different results.

For example:

1> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,5},{22,35,29}},{{2013,4,5},{22,30,29}}} Have logged this as an Erlang bug.

Note that although the above example shows a 5 minute difference, the difference varies each time you do the comparison.

Comment from kaos re this:

there could be some timestamp comparison that starts off being 5 minutes into the future, then as time pass, it catches up and triggers a restart... somehow..

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/zotonic/zotonic/issues/546#issuecomment-15973737 .

simonsmithies commented 11 years ago

Have upgraded Erlang to R15B03-1. I executed this at varous points in the upgrade process:

1> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}.

Found that the times returned were identical at every check, until I ran make on Zotonic. At that point, erlang:now() shifted to erlang:localtime() plus 5 minutes exactly.

So somehow Zotonic make is setting erlang:now() back by 5 minutes.

mworrell commented 11 years ago

That is strange. Especially the exactly 5 minutes. That doesn't make sense to me...

I tried it on my work station (Mac OS X, Core i5):

(zotonic001@Lamma)15> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}.
{{{2013,4,13},{10,44,33}},{{2013,4,13},{10,44,33}}}
(zotonic001@Lamma)16> z:m().
Recompile: priv/sites/maxclass/modules/mod_maxclass_message/support/maxclass_message_resend
Recompile: priv/sites/maxclass/modules/mod_maxclass_membership/models/m_maxclass_membership_notes
ok
(zotonic001@Lamma)17> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}.
{{{2013,4,13},{10,45,4}},{{2013,4,13},{10,45,4}}}

Nothing strange here.

mmzeeman commented 11 years ago

Hi,

It really is strange. Looking into the erlang:now implementation right now. It is way down in the core, with all its scheduler locks and all.

There is some workaround code in place to get the time. The comment says for beta linux 2.4 and 2.5 kernels.

See: erts/emulator/sys/unix/sys_times.c, line 66

It right shifts the time found with 3 bits. It is probably a long shot, but I still wonder… Simon runs Debian 6, which is linux 2.6 if I'm not mistaken. So this workaround should not be in effect right?

That is strange. Especially the exactly 5 minutes. That doesn't make sense to me...

I tried it on my work station (Mac OS X, Core i5):

(zotonic001@Lamma)15> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,13},{10,44,33}},{{2013,4,13},{10,44,33}}} (zotonic001@Lamma)16> z:m(). Recompile: priv/sites/maxclass/modules/mod_maxclass_message/support/maxclass_message_resend Recompile: priv/sites/maxclass/modules/mod_maxclass_membership/models/m_maxclass_membership_notes ok (zotonic001@Lamma)17> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,13},{10,45,4}},{{2013,4,13},{10,45,4}}} Nothing strange here.

— Reply to this email directly or view it on GitHub.

mmzeeman commented 11 years ago

Could you compile and run this? If this returns > 1000, then the workaround will kick in.

On all the unix boxes I have tried this on it returns 100.

--- snip here ---

include

include

int main(int argc, char **argv) { printf("_SC_CLK_TCK: %ld\n", sysconf(_SC_CLK_TCK)); return 0; }

That is strange. Especially the exactly 5 minutes. That doesn't make sense to me...

I tried it on my work station (Mac OS X, Core i5):

(zotonic001@Lamma)15> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,13},{10,44,33}},{{2013,4,13},{10,44,33}}} (zotonic001@Lamma)16> z:m(). Recompile: priv/sites/maxclass/modules/mod_maxclass_message/support/maxclass_message_resend Recompile: priv/sites/maxclass/modules/mod_maxclass_membership/models/m_maxclass_membership_notes ok (zotonic001@Lamma)17> {calendar:now_to_local_time(erlang:now()), erlang:localtime()}. {{{2013,4,13},{10,45,4}},{{2013,4,13},{10,45,4}}} Nothing strange here.

— Reply to this email directly or view it on GitHub.

simonsmithies commented 11 years ago

Returns 100 - _SC_CLK_TCK: 100

mmzeeman commented 11 years ago

Thanks. That rabbit hole is closed.

On 14 apr. 2013, at 00:30, Simon Smithies notifications@github.com wrote:

Returns 100 - _SC_CLK_TCK: 100

— Reply to this email directly or view it on GitHub.

simonsmithies commented 11 years ago

I have copied everything from that VPS to an Amazon EC2 instance - just had to do something to get it working. Is running like a charm. Should've done this weeks ago.

Simon Smithies Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Sunday, 14 April 2013 at 7:28 PM, Maas-Maarten Zeeman wrote:

Thanks. That rabbit hole is closed.

On 14 apr. 2013, at 00:30, Simon Smithies notifications@github.com wrote:

Returns 100 -
_SC_CLK_TCK: 100


Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub (https://github.com/zotonic/zotonic/issues/546#issuecomment-16347242).

kaos commented 11 years ago

Regarding the time diff after a zotonic build. I noticed that the error went away again after a minute or so..

kaos commented 11 years ago

The system clock is real funky:

32>  {calendar:now_to_local_time(erlang:now()), erlang:localtime(),os:cmd("date")}.
{{{2013,4,16},{2,42,5}},
 {{2013,4,16},{2,42,5}},
 "Tue Apr 16 02:42:05 NZST 2013\n"}
33>  {calendar:now_to_local_time(erlang:now()), erlang:localtime(),os:cmd("date")}.
{{{2013,4,16},{2,47,10}},
 {{2013,4,16},{2,42,10}},
 "Tue Apr 16 02:47:10 NZST 2013\n"}

And after a few minutes, when it skips back, the erl shell borks (stops responding, heart would've killed it), and a new shell reveals:

1> {calendar:now_to_local_time(erlang:now()), erlang:localtime(),os:cmd("date")}.
{{{2013,4,16},{2,44,51}},
 {{2013,4,16},{2,44,51}},
 "Tue Apr 16 02:44:51 NZST 2013\n"}
kaos commented 11 years ago

This error was caused by the system clock jumping 5 minutes back and forth, which Erlang doesn't handle, at all.

simonsmithies commented 11 years ago

I investigated this with the company that provided the VPS: They found a kernel fault in the hardware node that was causing the time to move on every VPS on that node. They moved my instance to another node two days ago and Zotonic has been running perfectly on that ever since.

mworrell commented 11 years ago

It was an interesting quest :)

kaos commented 11 years ago

Wow!

Thanks for the feedback. That was a lessons learned, indeed :)

2013/4/17 Marc Worrell notifications@github.com

It was an interesting quest :)

— Reply to this email directly or view it on GitHubhttps://github.com/zotonic/zotonic/issues/546#issuecomment-16531611 .