Enterprise PHP – Caching Part 1
Elephants Never Forget and Neither Does Caching
You may have found that creating a working web application is much easier than getting it to scale well. If we’re speaking in terms of the LAMP stack, it’s probably more of a love-hate relationship; most will agree that PHP’s simplicity is an incredible pro when developing, but a con in regards to scalability. I believe there exists a preconceived notion that because, by default, the LAMP stack does not provide the enterprise features you would find in the J2EE space, that PHP simply cannot perform like a heavy-weight Java application. Caching, bytecode in particular, dispels that idea. Bytecode caching is one of the easiest and most common ways to greatly improve the performance and scalability of PHP applications. In web applications, you’ll find there are two typical versions of caching: object and output. Object caching allows a system to keep pieces of data that are expensive to calculate or retrieve in memory, which lowers the operating cost associated with repeated requests. Output caching, either full or partial page, works by capturing the HTML that an application yields which allows repeated requests to completely circumvent the application server. Because PHP is a scripting language, it allows for a third type: bytecode caching.
Architecture that is Worlds Apart
In order to better understand bytecode caching, I think it’s necessary to explain the fundamental differences between a more traditional “compiled” language such as Java, and a scripting language, in this instance PHP. The application lifecycle of a Java program involves the JVM loading class files of Java bytecode on-demand and either interpreting or JIT compiling them. Once a class has been loaded, it will stay in memory and not require the file to be accessed again. Java application servers handle many requests within the same process space, so once a block of code has been loaded, optimized or compiled that persists between requests. The price is paid once while the benefits carry forward. PHP has no state between requests. Because requests are not handled in the same process space*, at the end of a request everything the interpreter has loaded and processed is discarded. Therefore, each time there is a request every PHP file involved will be loaded from the disk, converted into Zend bytecode and then interpreted by the Zend Engine. Please allow that to digest for a moment. If you come from the enterprise space this might seem a bit crazy, but if you consider how Unix/Linux is designed, this makes a lot of sense. Creating a large, all-encompassing system is simply not how things are done. Conversely, creating small tools that do just one thing extremely well and piecing them together better accomplishes tasks. Because the Zend Engine, which ultimately runs your PHP application, is extremely modular developers are able to write extensions that fill-in missing or desired functionality, such as bytecode caching. Bytecode caches work by keeping a copy of the Zend bytecode outside of the PHP process space so it can be preserved between script executions, thus removing the need to unload disk files and parse them again.
Tools of the Trade
There are a number of bytecode caching modules available, including APC (Alternative PHP Cache), XCache, eAccelerator and Zend Optimizer+. The amazing thing about the nature of bytecode caching is that it is completely transparent to your application. There are many things in this world that claim to magically “make things better” without any effort, but bytecode caching actually delivers on that claim. I am focusing on APC because it was developed by the core PHP development team and has shown itself to be extremely compatible with the changes in the 5.X.X branch of PHP. Assuming you already have Apache2 installed with PHP5, then installation of APC is extremely simple.
Debian/Ubuntu
sudo apt-get install php-apc sudo /etc/init.d/apache2 restart
Fedora Core/RHEL
yum -y install php-pecl-apc /etc/init.d/httpd restart
Windows
- apc.shm_size – Amount of memory allocated to caching. Defaults to 32M.
- apc.ttl – Number of seconds an entry can idle in a slot that is needed. Idling is a measure of time since the last time it was accessed.
- apc.stat – Should APC stat a file before returning the cached version to see if it has been updated. Set to 1 for development, 0 for production.
- apc.max_file_size – The largest file you plan on caching. Defaults to 1M.
- apc.num_files_hint – An estimate on the number of files that will be cached. This helps APC optimize its memory use; this should be easy to determine.