|
Discussion Forums
|
Thread: ELB: 3 minutes delay in detecting instance health
 |
This question is answered.
|
|
|
|
Replies:
9
-
Pages:
1
-
Last Post:
Dec 4, 2009 4:44 AM
by: enetpulse
|
|
|
Posts:
47
Registered:
8/26/08
|
|
|
|
ELB: 3 minutes delay in detecting instance health
Posted:
Aug 3, 2009 11:24 AM PDT
|
|
|
We have been observing 3-4 minutes delay until when an elb starts to observe instance health correctly. Combined with Auto Scaling, this seems to cause a number of additional timeout errors at end users' side, when an auto scaling group triggers instance initiations (when a service is typically under heavy load).
For example, here are what I reproduced to explain what's happening.
I have an elb named heavy-app:
AWS> elb-describe-lbs --show-long --headers
LOAD-BALANCER,NAME,DNS-NAME,HEALTH_CHECK,AVAILABILITY-ZONES,INSTANCE-ID,LISTENERS,CREATED_TIME
LOAD-BALANCER,heavy-app,heavy-app-123456789.us-east-1.elb.amazonaws.com,"{interval=15,target=HTTP:8080/?max=100,timeout=10,healthy-threshold=3,unhealthy-threshold=8}",us-east-1c,(nil),"{protocol=HTTP,lb-port=80,instance-port=8080}",2009-08-01T05:29:06Z
and an auto scaling group associated with the heavy-app elb:
AWS> as-describe-auto-scaling-groups --headers
AUTO-SCALING-GROUP GROUP-NAME LAUNCH-CONFIG AVAILABILITY-ZONES LOAD-BALANCERS MIN-SIZE MAX-SIZE DESIRED-CAPACITY
AUTO-SCALING-GROUP heavy-app-c2 heavy-app-config us-east-1c heavy-app 0 0 0
At this stage, there were no instances associated with the elb as min-size is set to be 0.
Now, I let the auto scaling group to initiate two new instances by chaning min-size and max-size at 2009-08-03T16:50:00+0000.
AWS> as-update-auto-scaling-group heavy-app-c2 --max-size 2 --min-size 2
OK-Updated AutoScalingGroup
About 30 seconds later (2009-08-03T16:50:29+0000), 2 new instances were actually initiated according to ec2-describe-instances. An instance usually takes 4-5 minutes to get ready to serve after its initiation in this experiment.
As soon as I got the instances' public host names, I started to check the followings periodically - say every 15 seconds from 2009-08-03T16:51:46+0000:
(1) whether elb-describe-instance-health reports the two instances are healthy or not
(2) whether two instances respond to a http get request at its http port 8080
(3) whether the elb responds to a http get request at port 80
As I said before, the two instances are not available during the first 4-5 minutes. Therefore, elb-describe-instance-health should report they are OutOfService during this period. However, what I found is:
(a) Up until 2009-08-03T16:53:07+0000, the two instances were reported as "InService" although no instance was available at their port 8080.
(b) At 2009-08-03T16:53:07+0000, elb-describe-instance-health started to report as "OutOfService"
(c) One instance finally got ready at 16:55:25, and the other at 16:55:26 at port 8080
(d) At 2009-08-03T16:55:30, elb-describe-instance-health started to report as "InService" and I can access the app at the elb port 80.
In this experiment, there was a FALSE NEGATIVE period for 3 minutes. Here are what CloudWatch tells me.
AWS> mon-get-stats HealthyHostCount --period 60 --statistics "Average,Maximum,Minimum,Sum" --namespace "AWS/ELB" --dimensions "LoadBalancerName=heavy-app" --headers --start-time 2009-08-03T16:50:00.000Z
Time Samples Average Sum Minimum Maximum Unit
2009-08-03 16:53:00 2.0 0.5 1.0 0.0 1.0 Count
2009-08-03 16:54:00 8.0 0.0 0.0 0.0 0.0 Count
2009-08-03 16:55:00 8.0 0.375 3.0 0.0 2.0 Count
2009-08-03 16:56:00 8.0 2.0 16.0 2.0 2.0 Count
2009-08-03 16:57:00 8.0 2.0 16.0 2.0 2.0 Count
AWS> mon-get-stats UnHealthyHostCount --period 60 --statistics "Average,Maximum,Minimum,Sum" --namespace "AWS/ELB" --dimensions "LoadBalancerName=heavy-app" --headers --start-time 2009-08-03T16:50:00.000Z
Time Samples Average Sum Minimum Maximum Unit
2009-08-03 16:53:00 2.0 1.5 3.0 1.0 2.0 Count
2009-08-03 16:54:00 8.0 2.0 16.0 2.0 2.0 Count
2009-08-03 16:55:00 8.0 1.625 13.0 0.0 2.0 Count
2009-08-03 16:56:00 8.0 0.0 0.0 0.0 0.0 Count
2009-08-03 16:57:00 8.0 0.0 0.0 0.0 0.0 Count
According to the above, there was no health check performed during the first 3 minutes after the instances were initiated AND associated with the elb.
To me, there are two potential causes. One is the instance health status at elb, by default, was set to be "InService." The other is, an elb does not perform instance health check ups for a few minutes after it was notified a new instance was added to the elb. If these are true, a fix of any of the two would resolve the problem.
In the above, I have described the symptoms when the very first instances were added to an elb for the sake of simplicity. But in practice, things could be even worse. The same thing happens when an auto scaling group scales up. Here is an example. There are two existing and running instances registered at an elb. And two new instances are added to the elb due to heavy load. None of the two added instances are not ready to serve for the first 5 minutes, but the elb thinks they are available. Then we observe an increase in the number of timeouts at the user's end for a few minutes. I suspect the elb distributes requests to all the four instances including the two new instances which are not yet ready, which might result in the increased timeouts. Increased timeouts due to auto scaling activities would defeat the purpose of auto scaling.
Or, did I miss something in using elb+as correctly?
|
|
Posts:
26
Registered:
1/12/09
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 3, 2009 2:03 PM PDT
in response to: wizardofcrowds
|
 |
Helpful |
|
|
Congratulations on your very thorough and effective use of the AWS toolset to diagnose the issue.
The situation you describe (newly registered instances showing as InService, even though they may not yet be running), is a known issue listed in the ELB release notes at:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2533&categoryID=86
Which states:
"Regardless of the actual health of an instance, the LoadBalancer will assume a newly registered instance is healthy and put it into the InService state. If the instance is unhealthy, it will be transitioned to the OutOfService state on the first health check."
You are right that in general, this creates a window where it is possible for ELB to send traffic to an instance that is not yet ready to process the traffic. There are, however, a few other behaviors of ELB that reduce this risk:
- ELB will not attempt to send any traffic to instances that are not yet in the RUNNING state, even if elb-describe-instance-health reports InService.
- ELB will not send traffic to an instance if it cannot even open a TCP connection to that instance's application port, even if elb-describe-instance-health reports InService.
So there are at least two possible ways to avoid sending traffic to an instance that is not yet ready to handle that traffic, until the underlying ELB problem is fixed:
1) Make sure that your application does not begin accepting connections until it can properly handle traffic over those connections.
2) If, for some reason, your application must start accepting connections before it is able to process requests correctly, then one can tune the ELB health check parameters 'UnhealthyThreshold' and/or 'Interval' to be more agressive: you would want to tune them so that in the interval between the instance transitioning to 'running' and the application starting to accept connections, enough health ELB health checks have failed so that the instance has been marked OutOfService.
Please accept our apologies for this problem. We realize that this impacts the availability of our customers' applications, which we take very seriously.
Best Regards,
Chris
|
|
Posts:
47
Registered:
8/26/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 3, 2009 2:59 PM PDT
in response to: Christopher@AWS
|
|
|
Thank you Chris, for the known issue page and further tips.
Our instances do not open port 8080 until everything is ready on the instance. Thus, based on the two specs you described, I should say our elb should not have sent requests to the newly added instances until they are ready. In other words, the time window issue I reported here does not explain the increased timeouts during scaling up activities. Since we have no means to look inside the elb, it is difficult to identify these timeout causes on our side. Perhaps, the root cause of this timeout issue might be that a virtual load balancer in an elb does not outperform other software based load balancers (e.g. nginx/haproxy) as some people reported here in this forum.
Thanks again for your reply. And if AWS could post a note whenever elb is improved during the beta period, we would really appreciate it.
|
|
Posts:
15
Registered:
11/20/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 4, 2009 4:12 PM PDT
in response to: wizardofcrowds
|
 |
Correct |
|
|
The reason for the increased timeouts is that when a new instance is added, our current default behavior is to mark the instance as InService and start healthchecks. Beacuse your UnhealthyThreshold is set to 8 with 15 second intervals and 10 second timeout, the minimum amount of time before the load balancer would report OutOfService would be 2 minutes and 10 seconds.
We appreciate your investigation and raising this concern with us. We are working to correct this issue to prevent such timeouts from occurring. In the meantime, you can significantly reduce the period of timeouts by reducing the UnhealthyThreshold and/or interval of your load balancers.
Thanks,
Erik
|
|
Posts:
47
Registered:
8/26/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 4, 2009 7:17 PM PDT
in response to: Erik@AWS
|
|
|
Erik, many thanks you for pointing me out what could be done on my side!
|
|
Posts:
71
Registered:
7/17/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 20, 2009 10:50 AM PDT
in response to: wizardofcrowds
|
|
|
Has this bug been squashed yet?
ie. When an instance is added to an ELB default state should be OutOfService until the instance passes the health check?
I ask as we had an instance accepting traffic for 10+ mins even though apache was clearly returning a 500 error.
The solutions above do not apply as we need to test application functionality by hitting apache.
|
|
Posts:
2,112
Registered:
7/10/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 20, 2009 1:12 PM PDT
in response to: filife
|
|
|
The solution suggested above is to increase the frequency and decrease the threshhold of the health checks. This will have the effect of causing ELB to discover sooner that the instance is not really InService.
Why won't this work for you?
|
|
Posts:
71
Registered:
7/17/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 20, 2009 3:34 PM PDT
in response to: Shlomo Swidler
|
|
|
Upon provisioning, the system boots up with apache started.
There is no code on the system.
It take 2-3 mins to fully provision our server(Retrieve codebase run puppet etc).
The final step is a wget to the localhost to a specific test URL.
Our healthcheck URL requires a file to exist on the local file system.
The healthcheck will fail until that file is created. That file is created only after the wget retrieves specific content.
We need apache running, but during the entire provisioning process our healthcheck is in a failed state.
Current I'm running at 10 second intervals and 2 fails to go unhealthy.
so 20 seconds. max I imagine.
Be nice if it just was disabled right off the bat. F5s Foundries and Nortel hardware LBs all wait for a successful healthcheck before enabling.
|
|
Posts:
15
Registered:
11/20/08
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Aug 20, 2009 3:55 PM PDT
in response to: filife
|
|
|
We are working on this fix. We appreciate your patience, and for bringing it to our attention.
|
|
Posts:
1
Registered:
10/28/09
|
|
|
|
Re: ELB: 3 minutes delay in detecting instance health
Posted:
Dec 4, 2009 4:44 AM PST
in response to: Erik@AWS
|
|
|
Hi, I have the same issue and this is seemingly still not working. What is the estimated ETA on this? Preferably in way that registers new server as "OutOfService" per default until the health check has been performed.
Thanks,
Elfar
|
|
|
|