Detecting Performance Changes

For every new minor version of project (or every project release), developers should usually generate new batch of performance profiles with the same concrete configuration of resource collection (i.e. the set of collectors and postprocessors run on the same commands).These profiles are then assigned to the minor version to preserve the history of the project performance. However, every change of the project, and every new minor version, can cause a performance degradation of the project. And manual evaluation whether the degradation has happened is hard.

Perun allows one to automatically check the performance degradation between various minor versions within the history and protect the project against potential degradation introduced by new minor versions. One can employ multiple strategies for different configurations of profiles, each suitable for concrete types of degradation or performance bugs. Potential changes of performance are then reported for pairs of profiles, together with more precise information, such as the location, the rate or the confidence of the detected change. These information then help developer to evaluate whether the detected changes are real or spurious. The spurious warnings can naturally happen, since the collection of data is based on dynamic analysis and real runs of the program; and both of them can be influenced heavily by environment or other various aspects, such as higher processor utilization.

The detection of performance change is always checked between two profiles with the same configuration (i.e collected by same collectors, postprocessed using same postprocessors, and collected for the same combination of command, arguments and workload). These profiles correspond to some minor version (so called target) and its parents (so called baseline). But baseline profiles do not have to be necessarily the direct predecessor (i.e. the old head) of the target minor version, and can be found deeper in the version hierarchy (e.g. the root of the project or minor version from two days ago, etc.). During the check of degradation of one profile corresponding to the target, we find the nearest baseline profile in the history. Then for one pair of target and baseline profiles we can use multiple methods and these methods can then report multiple performance changes (such as optimizations and degradations).

_images/diff-analysis.svg

Results of Detection

Between the pair of target and baseline profile one can use multiple methods, each suitable for specific type of change. Each such method can then yield multiple reports about detected performance changes (however, some of these can be spurious). Each degradation report can contain the following details:

  1. Type of the change—the overall general classification of the performance change, which can be one of the following six values representing both certain and uncertain answers:

    No Change:

    Represents that the performance of the given uniquely identified resource group was not changed in any way and it stayed the same (within some bound of error). By default these changes are not reported in the standard output, but can be made visible by increasing the verbosity of the command line interface (see Command Line Interface how to increase the verbosity of the output).

    Total Degradation or Total Optimization:

    Represents an overall program degradation or optimization. The overall degradation or optimization report may actually be further divided into per-binary or per-file reports (e.g., a standalone report for mybin and its library mylib as done by Exclusive Time Outliers).

    Not in Baseline or Not in Target:

    Represents a performance change caused by new or deleted resources, e.g., functions that are newly introduced (resp newly missing) in the new project version. Reporting these changes is useful since even a simple function refactoring may introduce serious performance slowdown or speedup.

    Severe Degradation or Severe Optimization:

    Represents that the performance of resource group has severely degraded (resp optimized), i.e., got severely worse (resp better) with a high confidence. Each report also usually shows the confidence of this report, e.g. by the value of coefficient of determination (see Regression Analysis), which quantifies how the prediction or regression models of both versions were fitting the data.

    Degradation or Optimization:

    Represents that the performance of resource group has degraded (resp optimized), i.e. got worse (resp got better) with a fairly high confidence. Each report also usually shows the confidence of this report, e.g. by the value of coefficient of determination (see Regression Analysis), which quantifies how the prediction or regression models of both versions were fitting the data.

    Maybe Degradation or Maybe Optimization:

    Represents detected performance change which is either unverified or with a low confidence (so the change can be either false positive or false negative). This classification of changes allows methods to provide more broader evaluation of performance change.

    Unknown:

    Represents that the given method could not determine anything at all.

  2. Subtype of the change—the description of the type of the change in more details, such as that the change was in complexity order (e.g. the performance model degraded from linear model to power model) or ratio (e.g. the average speed degraded two times)

  3. Confidence—an indication how likely the degradation is real and not spurious or caused by badly collected data. The actual form of confidence is dependent on the underlying detection method. E.g. for methods based on Regression Analysis this can correspond to the coefficient of determination which shows the fitness of the function models to the actually measured values.

  4. Location—the unique identification of the group of resources, such as the name of the function, the precise chunk of the code or line in code.

If the underlying method does not detect any change between two profiles, by default nothing is reported at all. However, this behaviour can be changed by increasing the verbosity of the output (see Command Line Interface how to increase the verbosity of the output)

Detection Methods

Currently we support three simple strategies for detection of the performance changes:

  1. Best Model Order Equality which is based on results of Regression Analysis and only checks for each uniquely identified group of resources, whether the best performance (or prediction) model has changed (considering lexicographic ordering of model types), e.g. that the best model changed from linear to quadratic.

  2. Average Amount Threshold which computes averages as a representation of the performance for each uniquely identified group of resources. Each average of the target is then compared with the average of the baseline and if the their ration exceeds a certain threshold interval, the method reports the change.

  3. Exclusive Time Outliers which identifies outliers within the function exclusive time deltas. The outliers are identified using three different statistical techniques, resulting in three different change severity categories based on which technique discovered the outlier.

Refer to Create Your Own Degradation Checker to create your own detection method.

Best Model Order Equality

The Best Model Order Equality chooses the best model (i.e. the one with the highest coefficient of determination) as the representant of the performance of each group of uniquely identified resources (e.g. corresponding to the same function). Then each pair of baseline and target models is compared lexicographically (e.g. the linear model is lexicographically smaller than quadratic model), and any change in this ordering is detected as either Optimization or Degradation if the minimal confidence of the models is above certain threshold.

  • Detects: Order changes; Optimization and Degradation

  • Confidence: Minimal coefficient of determination of best models of baseline and target minor versions

  • Limitations: Profiles postprocessed by Regression Analysis

The example of the output generated by the BMOE method is as follows

* 1eb3d6: Fix the degradation of search
|    | * 7813e3: Implement new version of search
|   > collected by complexity+regression_analysis for cmd: '$ mybin'
|     > applying 'best_model_order_equality' method
|       - Optimization         at SLList_search(SLList*, int)
|           from: power -> to: linear (with confidence r_square = 0.99)
|
* 7813e3: Implement new version of search
|    | * 503885: Fix minor issues
|   > collected by complexity+regression_analysis for cmd: '$ mybin'
|     > applying 'best_model_order_equality' method
|       - Degradation          at SLList_search(SLList*, int)
|           from: linear -> to: power (with confidence r_square = 0.99)
|
* 503885: Fix minor issues

In the output above, we detected the Optimization between commits 1eb3d6 (target) and 7813e3 (baseline), where the best performance model of running time of SLList_search function changed from power model to linear. For the methods based on Regression Analysis we use the coefficient of determination (\(r^2\)) to represent a confidence, and take the minimal coefficient of determination of target and baseline model as a confidence for this detected change. Since \(r^2\) is almost close to the value 1.0 (which would mean, that the model precisely fits the measured values), this signifies that the best model fit the data tightly and hence the detected optimization is not spurious.

Average Amount Threshold

The Average Amount Threshold groups all of the resources according to the unique identifier (uid; e.g. the function name) and then computes the averages of resource amounts as performance representants of baseline and target profiles. The computed averages are then compared (by division , and according to the set threshold the checker detects either Optimization or Degradation (the threshold is 2.0 ratio for detecting degradation and 0.5 ratio for detecting optimization, i.e. the threshold is two times speed-up or speed-down)

  • Detects: Ratio changes; Optimization and Degradation

  • Confidence: None

  • Limitations: None

The example of output generated by AAT method is as follows:

* 1eb3d6: Fix the degradation of search
|    | * 7813e3: Implement new version of search
|   > collected by complexity+regression_analysis for cmd: '$ mybin'
|     > applying 'average_amount_threshold' method
|       - Optimization         at SLList_search(SLList*, int)
|           from: 60677.98ms -> to: 135.29ms
|
* 7813e3: Implement new version of search
|    | * 503885: Fix minor issues
|   > collected by complexity+regression_analysis for cmd: '$ mybin'
|     > applying 'average_amount_threshold' method
|       - Degradation          at SLList_search(SLList*, int)
|           from: 156.48ms -> to: 60677.98ms
|
* 503885: Fix minor issues

In the output above, we detected the Optimization between commits 1eb3d6 (target) and 7813e3 (baseline), where the average amount of running time for SLList_search function changed from about six seconds to hundred miliseconds. For these detected changes we report no confidence at all.

Exclusive Time Outliers

Detection method that is based on finding outliers in deltas of function exclusive (self) times (i.e., function duration without the duration of its callee functions). The exclusive time outliers method does not expect any pre-computed models and works on profiles generated by the Tracer collector.

We use three different methods for detecting the outliers:
  1. Modified z-score

  2. IQR multiple

  3. Standard deviation multiple

The outliers identified by the mod. z-score are regarded as Severe Optimization or Severe Degradation changes due to them being very distant from the expected values.

The outliers identified by the IQR multiple are regarded as ordinary Degradation or Optimization.

The outliers found by the stddev multiple are rather insignificant, thus we report them as only Maybe Degradation or Maybe Optimization.

This method utilizes two configuration values from the perun config:

  • degradation.location_filter: regex used to filter the checked locations (binaries),

  • degradation.cutoff: float value that defines the cut-off threshold for relative degradation rate (total location exclusive time delta in %)

Note that this method has certain limitations that stem from the usage of outliers. It might not work properly with certain distributions of delta values. However, we always report the Total Degradation or Total Optimization, thus even in such cases, the user is informed about the total change and may utilize other, more suitable, detection method (e.g., the Average Amount Threshold).

  • Detects: Exclusive time changes; Optimization and Degradation.

  • Confidence: IQR multiple for severe and ordinary changes and StdDev multiple for potential changes.

  • Limitations: Profiles collected by Trace Collector.

An example of the ETO method output

Python 3.11.0a7

...
at _ctypes.cpython-31:
└ 136.92ms (9.19%): time Total Degradation base: 1353.431 targ: 1490.351
    (with confidence N/A = 0.0)
at _ctypes_callproc:
└ 2.84ms (0.19%): time Degradation base: 589.177 targ: 592.02
    (with confidence IQR multiple = 5.48)
at _ctypes_get_fielddesc:
└ 52.9ms (3.55%): time Severe Degradation base: 76.473 targ: 129.378
    (with confidence IQR multiple = 110.46)
at _ctypes_init_fielddesc:
└ 77.95ms (5.23%): time Not in Baseline base: 0.0 targ: 77.953
    (with confidence IQR multiple = 162.98)
...

10 changes | +--
optimization(+), 3 degradations(-)

In the example above, we detected a Severe Degradation in function _ctypes_get_fielddesc compared to the previous version profile (v3.10.4). The absolute exclusive time difference is 52.9ms (from 76.473ms to 129.378ms) and the relative difference of 3.55% represents the overall slowdown of the program (in this case, the CPython ctypes library). The confidence is reported as the IQR multiple of 110.46.

Fast Check

The module contains the method for detection with using regression analysis.

This module contains method for classification the perfomance change between two profiles according to computed metrics and models from these profiles, based on the regression analysis.

Linear Regression

The module contains the method for detection with using linear regression.

This module contains method for classification the perfomance change between two profiles according to computed metrics and models from these profiles, based on the linear regression.

Polynomial Regression

The module contains the method for detection with using polynomial regression.

This module contains method for classification the perfomance change between two profiles according to computed metrics and models from these profiles, based on the polynomial regression.

Configuring Degradation Detection

We apply concrete methods of performance change detection to concrete pairs of profiles according to the specified rules based on profile collection configuration. By configuration we mean the tuple of (command, arguments, workload, collector, postprocessors) which represent how the data were collected for the given minor version. This way for each new version of project, it is meaningful to collect new data using the same config and then compare the results. The actual rules are specified in configuration files by degradation.strategies. The strategies are specified as an ordered list, and all of the applicable rules are collected through all of the configurations (starting from the runtime configuration, through local ones, up to the global configuration). This yields a list of rules (each rule represented as key-value dictionary) ordered by the priority of their application. So for each pair of tested profiles, we iterate through this ordered list and find either the first that is applicable according to the set rules (by setting the degradation.apply key to value first) or all applicable rules (by setting the degradation.apply key to value all).

The example of configuration snippet that sets rules and strategies for one project can be as follows:

degradation:
  apply: first
  strategies:
    - type: mixed
      postprocessor: regression_analysis
      method: bmoe
    - cmd: mybin
      type: memory
      method: bmoe
    - method: aat

The following list of strategies will first try to apply the Best Model Order Equality method to either mixed profiles postprocessed by Regression Analysis or to memory profiles collected from command mybin. All of the other profiles will be checked using Average Amount Threshold. Note that applied methods can either be specified by their full name or using the short strings by taking the first letters of each word of the name of the method, so e.g. BMOE stands for Best Model Order Equality.

Create Your Own Degradation Checker

New performance change checkers can be registered within Perun in several steps. The checkers have just small requirements and have to yield the reports about degradation as a instances of DegradationInfo objects specified as follows:

class perun.utils.structs.DegradationInfo(res: PerformanceChange, loc: str, fb: str, tt: str, t: str = '-', rd: float = 0, ct: str = 'no', cr: float = 0, pi: list[tuple[PerformanceChange, float, float, float]] | None = None, rdr: float = 0.0)[source]

The returned results for performance check methods

Variables:
  • result (PerformanceChange) – result of the performance change, either can be optimization, degradation, no change, or certain type of unknown

  • type (str) – string representing the type of the degradation, e.g. “order” degradation

  • location (str) – location, where the degradation has happened

  • from_baseline (str) – value or model representing the baseline, i.e. from which the new version was optimized or degraded

  • to_target (str) – value or model representing the target, i.e. to which the new version was optimized or degraded

  • confidence_type (str) – type of the confidence we have in the detected degradation, e.g. r^2

  • confidence_rate (float) – value of the confidence we have in the detected degradation

  • rate_degradation_relative (float) – relative rate of the degradation

to_storage_record() str[source]

Transforms the degradation info to a storage_record

Returns:

string representation of the degradation as a stored record in the file

You can register your new performance change checker as follows:

  1. Run perun utils create check my_degradation_checker to generate a new modules in perun/check directory with the following structure. The command takes a predefined templates for new degradation checkers and creates my_degradation_checker.py according to the supplied command line arguments (see Utility Commands for more information about interface of perun utils create command):

    /perun
    |-- /check
        |-- __init__.py
        |-- average_amount_threshold.py
        |-- my_degradation_checker.py
    
  2. Implement the my_degradation_checker.py file, including the module docstring with brief description of the change check with the following structure:

1"""..."""
2
3from perun.utils.structs import DegradationInfo
4
5
6def my_degradation_checker(baseline_profile, target_profile):
7    """..."""
8    yield DegradationInfo("...")
  1. Next, in the __init__.py module register the short string for your new method as follows:

 1--- /home/runner/work/perun/perun/docs/_static/templates/degradation_init.py
 2+++ /home/runner/work/perun/perun/docs/_static/templates/degradation_init_new_check.py
 3@@ -3,6 +3,7 @@
 4     short_strings = {
 5         "aat": "average_amount_threshold",
 6         "bmoe": "best_model_order_equality",
 7+        "mdc": "my_degradation_checker",
 8     }
 9     if strategy in short_strings.keys():
10         return short_strings[strategy]
  1. Preferably, verify that registering did not break anything in the Perun and if you are not using developer instalation, then reinstall Perun:

    make test
    make install
    
  2. At this point you can start using your check using perun check head, perun check all or perun check profiles.

  3. If you think your collector could help others, please, consider making Pull Request.

Degradation CLI

Command Line Interface contains group of two commands for running the checks in the current project—perun check head (for running the check for one minor version of the project; e.g. the current head) and perun check all for iterative application of the degradation check for all minor versions of the project. The first command is mostly meant to run as a hook after each new commit (obviously after successfull run o f``perun run matrix`` generating the new batch of profiles), while the latter is meant to be used for new projects, after crawling through the whole history of the project and collecting the profiles. Additionally perun check profiles can be used for an isolate comparison of two standalone profiles (either registered in index or as a standalone file).

perun check head

Checks for changes in performance between between specified minor version (or current head) and its predecessor minor versions.

The command iterates over all of the registered profiles of the specified minor version (target; e.g. the head), and tries to find the nearest predecessor minor version (baseline), where the profile with the same configuration as the tested target profile exists. When it finds such a pair, it runs the check according to the strategies set in the configuration (see Configuring Degradation Detection or Perun Configuration files).

By default the hash corresponds to the head of the current project.

perun check head [OPTIONS] <hash>

Arguments

<hash>

Optional argument

perun check all

Checks for changes in performance for the specified interval of version history.

The commands crawls through the whole history of project versions starting from the specified <hash> and for all of the registered profiles (corresponding to some target minor version) tries to find a suitable predecessor profile (corresponding to some baseline minor version) and runs the performance check according to the set of strategies set in the configuration (see Configuring Degradation Detection or Perun Configuration files).

perun check all [OPTIONS] <hash>

Arguments

<hash>

Optional argument

perun check profiles

Checks for changes in performance between two profiles.

The command checks for the changes between two isolate profiles, that can be stored in pending profiles, registered in index, or be simply stored in filesystem. Then for the pair of profiles <baseline> and <target> the command runs the performance check according to the set of strategies set in the configuration (see Configuring Degradation Detection or Perun Configuration files).

<baseline> and <target> profiles will be looked up in the following steps:

  1. If profile is in form i@i (i.e, an index tag), then ith record registered in the minor version <hash> index will be used.

  2. If profile is in form i@p (i.e., an pending tag), then ith profile stored in .perun/jobs will be used.

  3. Profile is looked-up within the minor version <hash> index for a match. In case the <profile> is registered there, it will be used.

  4. Profile is looked-up within the .perun/jobs directory. In case there is a match, the found profile will be used.

  5. Otherwise, the directory is walked for any match. Each found match is asked for confirmation by user.

perun check profiles [OPTIONS] <baseline> <target>

Options

-m, --minor <hash>

Will check the index of different minor version <hash> during the profile lookup.

Arguments

<baseline>

Required argument

<target>

Required argument