1 of 55

Tales of 60+ billion req a day

java optimizations @ pocketmath

2 of 55

About

Liu Dapeng

Java Tech Lead

PocketMath

World's largest, 100% self-serve, mobile demand-side platform (DSP) for programmatic mobile.

3 of 55

What is OpenRTB (Real Time Bidding)

Exchange

yes! I will pay $0.01

ad needed

do you want it?

4 of 55

Challenges

Exchange

Thousands of advertisers

Throughput 65b req/day, 750k req/sec

Latency 100~200 ms, round trip

5 of 55

ELB (Elastic Load Balancer)

Transport Layer LB

AJP

6 of 55

ELB (Elastic Load Balancer)

Transport Layer LB

7 of 55

99 percentile latency

100ms

20ms

8 of 55

<Connector port="1443"

scheme="https"

secure="true"

SSLEnabled="true"

SSLCACertificateFile="ca.crt"

SSLCertificateFile="domain.crt"

SSLCertificateKeyFile="domain.key"

... />

<Listener className="org.apache.catalina.core.AprLifecycleListener"

SSLEngine="on" />

<Connector port="8080"

protocol="HTTP/1.1"

maxConnections="12000"

maxThreads="128"

... />

9 of 55

Party!

10 of 55

ExecutorService executor = ...

try {

future = executor.submit(new Task(...))

response = future.get(TIMEOUT, MILLISECONDS)

return response

} catch (TimeoutException e) {

return NO_BID

}

11 of 55

Servlet Threads

Worker Threads

12 of 55

Servlet Threads

Worker Threads

13 of 55

Servlet Threads

Worker Threads

14 of 55

Servlet Threads

Worker Threads

Timeout

15 of 55

Servlet Threads

Worker Threads

Timeout

16 of 55

ExecutorService executor = ...

try {

future = executor.submit(new Task(...))

response = future.get(TIMEOUT, MILLISECONDS)

return response

} catch (TimeoutException e) {

return NO_BID

}

17 of 55

Processing Pipeline

Stage1

Stage2

Stage3

Stage4

18 of 55

return new Task(...).call();

19 of 55

99 percentile latency

20ms

<10ms

20 of 55

21 of 55

Server Fever

Server CPU Usage

22 of 55

  • $ perf top
  • JDK Native -> GC
  • Attach JProfiler
  • Full GC

23 of 55

JVM Heap

Young Gen

Old Gen

24 of 55

JVM Heap

Young Gen

Old Gen

25 of 55

New JVM flag

-XX:NewSize=13G

26 of 55

Panadol.sh

$ jcmd org.apache.catalina.startup.Bootstrap GC.run

27 of 55

After the daily dose of panadol

28 of 55

29 of 55

Traffic is on the rise

30b+ -> 40b+

30 of 55

A New Problem

31 of 55

100 requests

50 responses in time

20 requests now

Internet

Exchange

PocketMath

32 of 55

33 of 55

Observation

  • More TCP connections
  • More spillover
  • Tomcat drops the connections too early!
  • keepAliveTimeout = 5min

34 of 55

When in doubt, read the doc!

35 of 55

Keep Alive Has A Counter!

<Connector port="8080"

...

maxKeepAliveRequests="-1"

... />

36 of 55

37 of 55

Time to Party?

38 of 55

Traffic is on the rise

40b+ -> 60b+

39 of 55

40 of 55

100 requests

50 responses in time

20 requests now

Internet

Exchange

PocketMath

41 of 55

$ netstat -s | grep ‘connections established’

42 of 55

43 of 55

The unusual log

Total time for which application threads were stopped: 0.0043330 seconds

Total time for which application threads were stopped: 4 seconds

-XX:+PrintGCApplicationStoppedTime

44 of 55

45 of 55

WTH is a Safepoint

  • GC
  • Optimization
  • Lock unbiasing

46 of 55

Safepoint cont.

  • blocked on a lock
  • blocked synchronized block
  • waiting on a monitor
  • parked
  • blocked on blocking IO

-XX:+PrintSafepointStatistics

http://psy-lob-saw.blogspot.sg/2015/12/safepoints.html

47 of 55

grep -E "^[0-9]+\\.[0-9]+" catalina.log

48 of 55

The Culprit

Instruction

Repo

Compiler

Engine

Requests

49 of 55

Quick Experiment

50 of 55

fix

51 of 55

52 of 55

53 of 55

The Matrix

Daily requests

30b

60b

2X

Bidding Servers

~60

47

78%

99% latency

100ms

single digit

10~20X

Auctions

Auction timeouts

Transfer errors

2,104,721,951

0.0% (6,300)

0.0% (5,764)

54 of 55

Fast and Furious

55 of 55