|Anonymous | Login | Signup for a new account||2023-12-08 10:08 CST|
|Main | My View | View Issues | Change Log | Roadmap | Docs | Wiki | Repositories|
|Viewing Issue Advanced Details|
|ID||Category||Type||Reproducibility||Date Submitted||Last Update|
|0000296||[In-Portal CMS] Optimization||feature request||N/A||2009-09-16 00:56||2010-08-31 14:19|
|Reporter||Dmitry||View Status||public||Project Name||In-Portal CMS|
|ETA||none||Fixed in Version||Product Version||5.1.0|
|Target Version||Icebox||Product Build|
|Time Estimate||No estimate|
|Summary||0000296: Research and Create Optimization Plan|
We are in need to optimize our speeds.
Let's target to lower the loading page time - twice (2)!
Let's start with detailed profiling PHP, then MySQL and other aspects using ZEND Server/Profiler as the main tool for Benchmarking.
We definitely should use MemCached, but also consider Caching entire content on the files system.
Results, ideas and other notes can be added here and then moved into separate tasks.
|Steps To Reproduce|
|Tags||No tags attached.|
|Change Log Message|
We have very good starting point with our current memcache integration (see 0000107), but it still could be perfected.
Fallback for web servers without memcache
As fallback variant for web server without memcache we could implement memcache analog (without distributed data storage of course), where actual data will be stored in database and agent, that will clean expired records on regular basis. In case if one slow query (when calculating data) is replaced with one quick query (when retrieving cached data) that's still a performance boost (total query runtime is decremented). This of course will increment total amount of memory usage by script, because data, that will be retrieved from our pseudo-memcache will be stored locally in memory, but if this helps to decrease script runtime, then it's worth it.
Cached data expiration
When using memcache there is one slight problem: we can't exactly ask it about what data exactly being stored without knowing exact key under which data was stored for the first time. As a workaround we have implemented "serial number" system. Idea about them is, that we have set (as much as we need) of variables, called serial numbers (e.g. CacheRebuildSerial, PermissionRebuildSerial, MasterSerial, etc.) that are also stored in memcache. Serial number variables are stored with infinite expiration time (they never expired on their own), and variables, that use them also should be stored with same (infinite) expiration time. Each variable is a number, that is periodically incremented when related to it data is changed. When decision is made to store some data in memcache, then cached key name should also contain current serial number of requested type, for example user permission check result is calculated once and then is stored in cached variable named "UserPermission-<user_id>-<user_groups>-<category_id>-<PermissionRebuildSerial>". In case if user is added/removed to group, when "<user_groups>" part of key will differ from cached data and data will recalculated when asked (not the time user group list is changed). Old key will automatically be cleaned by memcache (not by expiration time) during regular garbage collection run. For cases, when some group permission is changed without changing user groups, then "PermissionRebuildSerial" variable will be incremented this way invalidation all data, that was stored before using it's previous value as part of cached variable name.
Why to cache
Caching decrements total script runtime, but sometimes increments memory amount used by script. When cached data is stored outside of main script (like in memcache), then memory amount used by script is not dramatically increased.
What to put in cache
In case if same database queries are executed from different places in system, then firstly we should determine if this is really needed and in case if it is we should place such database query result in case for later using. This way we directly decrement total database query count and decrease database server load. Also we could put executed template parts (pure html) or even whole executed templates (pure html) in cache. This way we decrement load on both web server and database server, because we simply get requested template part (or template) and don't calculate anything.
Problems during cache using
Usually data from database or files is cached and we should ensure, that in case, when source data from database or files is changed, then cached data is updated as well. This is really not easy task to implement for cases when we are caching too complex data patterns. It's easy to reset cache when group permissions are changed, like in example above, but it's hard in cases, when template gathers data from different places using different criteria and whole template is cached.
For example, we have template, that shows top-level categories/sections. And we cache it. It's logical that cache should be reset, when there is direct manipulation add/edit/delete with one of top-level categories/sections. Problem is, that when manipulating with categories we have no idea about what template does what with them. That's why we only can reset global category cache serial invalidating all cached templates at once.
Another example: we have template where links from category are shown. We should reset such cached template where links from given category is manipulated, but when manipulating with links we don't about how template operates with this data and reset all templates, that display links.
Based on mentioned above examples total template caching is not really effective if data is changed too often. But in case, when 9 of 10 visitors sees cached page, but one changes data on it resulting in cache reset it's still something good. Based all mentioned above it will be a good think to know what we should reset, during individual data changes. So we should somehow link template to data it displays (more accurate link is, longer cache will be used). There is simple linking logic for starters: collect all unit config prefixes being used during template parsing and place them for example in ThemeFiles table (in new UsedPrefixes column), where we have one record for each template in theme. This limits us to caching only templates on Front-End, but that's no big deal, because administrative console is too dynamic to cache anyway. Also we should use serials from all used on template unit config prefixes in cache key name for cached template storage. When something is changed, then we increment serial number associated with given unit config prefix. This way cache will be automatically reset. Prefixes such as "u" (current user), "lang" (current language), "theme" (current theme) are used on each template anyways, so user registrations/user profile changes, site translation changes, theme changes will reset all templates at one, but that's a good thing at any time, because we don't want to have cached template with outdated phrase translation.
Second step of caching is to define some special serials during template parsing, that will be also listed in ThemeFiles table in given template record. These serial number current values no doubt will be used during cache key forming to store cached template data. These additional serial numbers should be implemented already, this means, that we only could specify serials, that someone will sometimes increment automatically (like during specific data change). One of these special serials could be "Category-N-DataChange", where N is category id. This serial will be incremented only when data of N category of it's children will be changed. Initial set of special serials could be inspired by parameter we give to "InitList" and "PrintList" tags, like "parent_cat_id", "recursive" and so on. There are also some templates, that never should be cached as a whole template. These are templates, that show different content based on user, who is viewing them, for ex. private message list, user profile or topic/post list. There are some templates, that should take into account who is viewing them in global scale (logged-in user or guest). This way we will have two cached versions of template (for logged-in users and for guests) and this raises a storage problem, because we store data in ThemeFiles table, where there is only one record per template. Then we should move all out new data storage to other place, like PhrasesCache table for example. Also in cases, when we have custom tags, that for ex. display current user membership or username on each template, then we also could not cache these templates totally and it's one major problem with "advanced" theme, because we actually show current user's name and surname on each template in sidebox. Because of this we should divide all template into 3 parts: before tricky tag(-s), tricky tag(-s), after tricky tag(-s). This dividing also will fail, because of design/element/side box/content box template infrastructure, where actual dynamic piece is store in some included file and is not directly linked to target template.
So we are really stuck here with themes, where current user is shown on each template and we won't be creating cached template versions for each user on the site individually.
Also there could be special serials for non category-based data caching, but nothing comes to mind at the moment.
What we have already
We already have methods in kApplication class named setCache and getCache. They store given piece of data under given cache key and returns it. Cache is only alive during current script run and will be rebuild again during next script run. For start this decrements total database query count, but don't go beyond that, because cache is not available on next script run. I've tried to store that cache into memcache instead of current script and came across several problems:
- cache keys don't include serials of any king -> cache won't be reset on cache data change
- not all used cache keys include all variable parameters used in cached value calculation in their name
For example permission query result is cached not including user groups at the moment of caching and got never reset because of that. Another example: every category-based item and even categories have Filename field, where url-part representing them is stored. That url-part is used for building mod-rewrite links to that items. Besides queries used for retrieving these url-parts by given ID are major problem during printing list of items (each item have different id resulting one more query for each item). So these retrieved url-parts are also cached in case if we will build more then one link to the same item. And again such type of cache is not reset, when item's url-part is changed.
To parse mod-rewrite url we use from 3 to 6 queries on each page (to get language, theme, category, category item, template, etc.). There is some caching to database that remembers parsing result (variable set) for each given url and stores it in database. But this never got used, because in case when parsed later variable set will differ from previous parsed variable set for same url (t.e. link named "my_link" was deleted and another link named "my_link" was created later), then we don't know what cached variable set will be affected, because actual url is stored as md5 (for quick accessing) and parsed variable set is stored as serialized array. We probably could link given cache reset to 1st level (by unit config prefix) caching, for ex. we know, that url was parsed to ids of language, theme, category, link. When language/theme/category/link table will be changed (not necessarily record with parsed id) we could reset this cache. This will for example reset all urls containing link name, when new link is added to database. It's brutal, but it allows us to reset at least something. We can't be more specific in such type of cached data resetting here, because we most probably won't create an individual serial number for each record in database and since it's all will be stored in memcache we can't get all urls, that were associated with given link/category when it is changed/deleted).
Speeding up things
What always is slow is file inclusion and for template where a lot of data from different table is displayed, that class/template/unit config file inclusions slow down the process. Based on current "advanced" theme we have too distributed data divisions by templates, for ex. we could store all module side boxes into same template as separate DefineElement block tags. We also could merge all module content boxes same way. This will lower load on file system.
I don't think, that we should cache class definitions in any case, but we could cache whole unit configs and have master UnitConfigCacheSerial serial, to have ability to synchronize direct unit config file changes with it's cached version.
Also strange thing: empty template with static text (no tags) takes about 19 database queries (with agents) and 13 database queries (without agents) to show. That's strange, because why to run any database queries, when no data from database is shown. And this count is in case, when session is not auto-created as in releases before 5.0.1 version of In-Portal.
Engine, that calculates New/Hot/Pop/Pick/Featured records is totally non-effective, when we have around 90000 links for example (and only 200 of them are pop for example). I propose to create separate table, where we should store ids of New/Hot/Pop/Pick/Featured records and when required select them using one query. Checking table of 200 records is faster, then checking table with 90000 records any way. Data in that new table will be updated automatically, when record obtains/revokes one of predefined statuses. This will be most efficient for New/Hot/Pop statuses, because when in "Auto" mode, they are calculated on they fly using special formula.
About database query profiling I propose to pump in around 100000 records of data in each table and measure page loads then. This way it will be easier to detect, that something was improved during last optimization or not.
About PHP code optimization we should firstly subtract application initialization time from total script runtime and optimize each part separately. For example (this mostly in administrative console) each ajax query performs whole application initialization cycle, but it should not. Don't know what to strip from application initialization parts during ajax requests, because application don't know what ajax request will do to not-load some of the stuff.
About event processing: we have cool event processing engine, but amount of events being called each time is huge related to data amount being displayed. For example for empty template we call 104 events and method kEventManager::_getHooks is called 208 times (one time before event processing and one time after event processing). Function "getmicotime" is called 30 times with debugger turned off. It's really interesting who required time measuring during regular script run (with debugger turned off). Function "constOn" is called 36 times, that's kind of non-effective and should be replaced by direct check for constant value, because this decrements called function count and makes script run faster. Not big improvement, but still something. Function "array_merge_recursive2" is called 26 times. Maybe in some places we should use simple "array_merge" instead. Of course on empty template parsing file operations use most part of script runtime: 4 call times of PreloadConfigFile result in 163 milliseconds.
|When emulating memcache functionality we could use "SHOW TABLE STATUS" (Data_length column) query to get disk space, used by table. This way we could remove variables without expiration, when there are no room for storing new variables (room amount is defined via configuration variable)|
Related to inability to cache whole page because of user-specific contents we can define blocks in main template (won't work for includes), what won't be cached, but other parts of page will be cached.
Template compiled this way will be mostly html, but non-cachable parts will presented as php code. Only limitation to this idea, that non-cachable parts only could be located in main template, won't work for included template, because it will be really hard to trace what included template has non-cachable parts and how they should appear in resulting template.
Other interesting idea: we cache 100% for guests, but for logged in users we cache common page part, but user-specific is loaded later via ajax.
Not suitable for sites where there is many user-dependent content. But we could mix all such content into singe ajax request and return it in JSON format.
|2010-08-31 14:19||alex||version||=> 5.1.0|
|2010-03-02 17:57||Dmitry||Relationship added||parent of 0000592|
|2010-03-01 14:20||Dmitry||Relationship added||parent of 0000588|
|2010-02-11 14:34||Dmitry||Relationship added||parent of 0000107|
|2010-01-12 10:54||alex||Status||active => needs work|
|2010-01-12 10:54||alex||Target Version||5.1.0 => Icebox|
|2009-10-07 06:41||alex||Note Added: 0000829|
|2009-10-03 07:52||administrator||Status||reviewed and tested => active|
|2009-10-03 07:46||administrator||Priority||@60@ => normal|
|2009-09-24 15:12||alex||Note Added: 0000543|
|2009-09-23 11:43||alex||Note Added: 0000514|
|2009-09-16 17:58||Dmitry||Assigned To||andrew => alex|
|2009-09-16 09:07||alex||Note Added: 0000453|
|2009-09-16 00:57||Dmitry||Assigned To||=> andrew|
|2009-09-16 00:57||Dmitry||Status||active => reviewed and tested|
|2009-09-16 00:56||Dmitry||New Issue|
|Main | My View | View Issues | Change Log | Roadmap | Docs | Wiki | Repositories|
| Web Development by Intechnic|
In-Portal Open Source CMS