Webtools:Scalability: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Netscaler: Fixing broken links to NS pdf)
 
(9 intermediate revisions by 3 users not shown)
Line 4: Line 4:


==Infrastructure==
==Infrastructure==
===[http://knowledgehub.zeus.com/docs Zeus]===
Zeus is the replacement for Netscaler
===[http://www.citrix.com/English/ps2/products/product.asp?contentID=21679 Netscaler]===
===[http://www.citrix.com/English/ps2/products/product.asp?contentID=21679 Netscaler]===
''[http://people.mozilla.org/~mrz/ns/7.0/ Manuals and Release Notes]''
''[http://support.citrix.com/article/CTX112300 Manuals and Release Notes]''


One of the best ways to lower the load on a server is to reduce its traffic, which is essentially the goal of the netscaler.  The netscaler sits between the webservers and provides:
One of the best ways to lower the load on a server is to reduce its traffic, which is essentially the goal of the netscaler.  The netscaler sits between the webservers and provides:
Line 11: Line 16:
* SSL offloading:  SSL connections are made with the netscaler (which has hardware dedicated to setting up SSL connections) instead of the webservers. Connections to the webservers are always over http (if you're writing code that is expecting https, realize that the webheads never see the ssl connection)
* SSL offloading:  SSL connections are made with the netscaler (which has hardware dedicated to setting up SSL connections) instead of the webservers. Connections to the webservers are always over http (if you're writing code that is expecting https, realize that the webheads never see the ssl connection)


* Content Caching:  Outbound content (webhead -> internet) is cached according to rules on the netscaler.  To get the current set of rules, you'll need to talk to IT. The default set of rules is on page 336 of the [http://people.mozilla.org/~mrz/ns/7.0/NS_ICG_V2.pdf 2nd install guide]
* Content Caching:  Outbound content (webhead -> internet) is cached according to rules on the netscaler.  To get the current set of rules, you'll need to talk to IT. The default set of rules is on page 336 of the [http://support.citrix.com/article/CTX112300 2nd install guide]


* Some monitoring/balencing of the web servers. Webservers that respond faster are  given more traffic.  If a webserver becomes unresponsive, the netscaler will  discontinue sending requests to it until it comes back to life.
* Some monitoring/balancing of the web servers. Webservers that respond faster are  given more traffic.  If a webserver becomes unresponsive, the netscaler will  discontinue sending requests to it until it comes back to life.


=== Production Webserver Cluster ===
=== Production Webserver Cluster ===


Behind the Netscaler is a cluster of 12 webservers (+/- a few at any given time). Most sites are on one or more webservers. When developing code, realize that requests can go to any of the webservers at any time, so it's important to make your code independant of the specific server (use the db for sessions, etc.)
Behind the Netscaler is a cluster of 12 webservers (+/- a few at any given time). Most sites are on one or more webservers. When developing code, realize that requests can go to any of the webservers at any time, so it's important to make your code independant of the specific server (use the db for sessions, etc.)
* [https://nagios.mozilla.org/graphs/mpt/Systems/webapp_81.html Load graphs] (ldap login) for the webheads are available
* [https://nagios.mozilla.org/graphs/mpt/Systems/webapp_81.html Load graphs] (ldap login) for the webheads are available


Line 27: Line 31:


* A read-only slave is available, and is generally only a couple seconds (<10) behind the master.
* A read-only slave is available, and is generally only a couple seconds (<10) behind the master.
 
* When doing large batch jobs, or expensive queries that will lock db tables, it's best to use the read-only slave so the master can keep working.
* [https://nagios.mozilla.org/graphs/mpt/Systems/ Load graphs] (ldap login) for the db servers are available
* [https://nagios.mozilla.org/graphs/mpt/Systems/ Load graphs] (ldap login) for the db servers are available


Line 39: Line 43:


Currently, development is done on standalone virtual machines running all the software (mysql, apache, etc.).  Generally, development servers are only available in the VPN.
Currently, development is done on standalone virtual machines running all the software (mysql, apache, etc.).  Generally, development servers are only available in the VPN.


== Coding ==
== Coding ==
Line 58: Line 59:
This is a php accelerator that basically compiles the php, and then stores that value.  When the page is requested, it can skip recompiling and just serve what it already has.
This is a php accelerator that basically compiles the php, and then stores that value.  When the page is requested, it can skip recompiling and just serve what it already has.


* AMOv3 is using this, and we're relatively happy with it.  We had some strange issues with apache segfaulting ([https://bugzilla.mozilla.org/show_bug.cgi?id=375300 bug 375300) but after removing a file from eAccelerator's cache, it has stopped.  The file was doing nothing special, and it's still a mystery why it caused seg faults.
* AMOv3 is using this, and we're relatively happy with it.  We had some strange issues with apache segfaulting ([https://bugzilla.mozilla.org/show_bug.cgi?id=375300 bug 375300]) but after removing a file from eAccelerator's cache, it has stopped.  The file was doing nothing special, and it's still a mystery why it caused seg faults.


=== [http://pecl.php.net/package/memcache/ Memcache] ===
=== [http://pecl.php.net/package/memcache/ Memcache] ===


This has been a great app for us, since it's simple and effective.  Any data that is access often and can be hashed is a good candidate for memcache.
This has been a great app for us, since it's simple and effective.  Any data that is accessed often and can be hashed is a good candidate for memcache.


* In AMOv2 we stored the complete page output in memcache using the URL as a the key
* In AMOv2 we stored the complete page output in memcache using the URL as a the key
* In AMOv3 (remora) we're storing db query results using the query as a key
* In AMOv3 (remora) we're storing db query results using the query as a key
* [http://en.wikipedia.org/wiki/Memcached Additional Memcache Info]
* [http://en.wikipedia.org/wiki/Memcached Additional Memcache Info]
==Profiling==
If you're using up a lot of CPU on the web servers, profiling the code is a great way to tell where the bottlenecks are.  You should be able to get similar profiles no matter what machine you run on, so the development machines are fine.
===[http://xdebug.org/ Xdebug]===
* They provide [http://xdebug.org/docs-profiling.php documentation on using it]
===[http://pecl.php.net/package/apd APD]===
* [http://www.linuxjournal.com/article/7213 A good APD walkthrough]
I'm adding links to the two most popular php profiling tools because I've had mixed results with both.  If one is giving you seg faults, try the other.  Both generate files that can be read by [http://kcachegrind.sourceforge.net/cgi-bin/show.cgi KcacheGrind] - a good tool for visualizing the data.  Otherwise they both have command line utilities as well.
==Load Testing==
Once a site is written, it's a good idea to load test it to get an idea of how many hits per second it can handle.  Something to remember with all the programs is the infrastructure you're testing on.  If you're testing from one machine to one server, that will give you an idea what your code can do, but it's definitely not the same as a cluster of machines.  Same with server->database connections.
===[http://httpd.apache.org/docs/2.0/programs/ab.html ab]===
A pretty basic/simple benchmarking program.  It works, but it's important to realize that it isn't distributed, so you might be maxing out the source machine and not the server.
===[http://grinder.sourceforge.net/ Grinder]===
This is a good idea (distributed benchmarking), but we didn't get it to work as advertised. Overall disappointing - maybe revisit when it matures.
===[http://github.com/oremj/logreplay log_replay]===
This is a python script that oremj wrote.  Given a log, it will replay the hits.  This gives you the advantage of replaying an actual set of hits across multiple pages, giving you a good distribution of hits.  This is mostly useful if you have actually had people using the site.

Latest revision as of 21:00, 26 January 2011

Draft-template-image.png THIS PAGE IS A WORKING DRAFT Pencil-emoji U270F-gray.png
The page may be difficult to navigate, and some information on its subject might be incomplete and/or evolving rapidly.
If you have any questions or ideas, please add them as a new topic on the discussion page.

Scalability and Performance

This document is a short summary of our infrastructure and software for developing high performance web apps. When a new project is being considered/written, it should be planned with the following in mind, so it can scale well.

Infrastructure

Zeus

Zeus is the replacement for Netscaler

Netscaler

Manuals and Release Notes

One of the best ways to lower the load on a server is to reduce its traffic, which is essentially the goal of the netscaler. The netscaler sits between the webservers and provides:

  • SSL offloading: SSL connections are made with the netscaler (which has hardware dedicated to setting up SSL connections) instead of the webservers. Connections to the webservers are always over http (if you're writing code that is expecting https, realize that the webheads never see the ssl connection)
  • Content Caching: Outbound content (webhead -> internet) is cached according to rules on the netscaler. To get the current set of rules, you'll need to talk to IT. The default set of rules is on page 336 of the 2nd install guide
  • Some monitoring/balancing of the web servers. Webservers that respond faster are given more traffic. If a webserver becomes unresponsive, the netscaler will discontinue sending requests to it until it comes back to life.

Production Webserver Cluster

Behind the Netscaler is a cluster of 12 webservers (+/- a few at any given time). Most sites are on one or more webservers. When developing code, realize that requests can go to any of the webservers at any time, so it's important to make your code independant of the specific server (use the db for sessions, etc.)

  • Load graphs (ldap login) for the webheads are available


Database Servers

There are 4 (+/- a couple) database servers running MySQL.

  • A read-only slave is available, and is generally only a couple seconds (<10) behind the master.
  • When doing large batch jobs, or expensive queries that will lock db tables, it's best to use the read-only slave so the master can keep working.
  • Load graphs (ldap login) for the db servers are available

Staging Servers

Staging servers are non-virtual machines, but not on a cluster.

  • Sites can be setup to SVN up themselves via cron jobs so changes pushed to a tag can be seen automatically.

Development Servers

Currently, development is done on standalone virtual machines running all the software (mysql, apache, etc.). Generally, development servers are only available in the VPN.

Coding

Reasonably efficient code should always be a first step when considering performance issues (don't put queries in loops, etc.). Due to the large amount of traffic we get, we need to supplement our code with additional caching software.


APC

This is a php accelerator (caches compiled code).

  • We've had issues with segfaulting when this was enabled. Currently none of our projects use this product.

eAccelerator

This is a php accelerator that basically compiles the php, and then stores that value. When the page is requested, it can skip recompiling and just serve what it already has.

  • AMOv3 is using this, and we're relatively happy with it. We had some strange issues with apache segfaulting (bug 375300) but after removing a file from eAccelerator's cache, it has stopped. The file was doing nothing special, and it's still a mystery why it caused seg faults.

Memcache

This has been a great app for us, since it's simple and effective. Any data that is accessed often and can be hashed is a good candidate for memcache.

  • In AMOv2 we stored the complete page output in memcache using the URL as a the key
  • In AMOv3 (remora) we're storing db query results using the query as a key
  • Additional Memcache Info

Profiling

If you're using up a lot of CPU on the web servers, profiling the code is a great way to tell where the bottlenecks are. You should be able to get similar profiles no matter what machine you run on, so the development machines are fine.

Xdebug

APD

I'm adding links to the two most popular php profiling tools because I've had mixed results with both. If one is giving you seg faults, try the other. Both generate files that can be read by KcacheGrind - a good tool for visualizing the data. Otherwise they both have command line utilities as well.

Load Testing

Once a site is written, it's a good idea to load test it to get an idea of how many hits per second it can handle. Something to remember with all the programs is the infrastructure you're testing on. If you're testing from one machine to one server, that will give you an idea what your code can do, but it's definitely not the same as a cluster of machines. Same with server->database connections.

ab

A pretty basic/simple benchmarking program. It works, but it's important to realize that it isn't distributed, so you might be maxing out the source machine and not the server.

Grinder

This is a good idea (distributed benchmarking), but we didn't get it to work as advertised. Overall disappointing - maybe revisit when it matures.

log_replay

This is a python script that oremj wrote. Given a log, it will replay the hits. This gives you the advantage of replaying an actual set of hits across multiple pages, giving you a good distribution of hits. This is mostly useful if you have actually had people using the site.