分析Cache访存模式对系统性能的影响

表1、普通矩阵乘法与及优化后矩阵乘法之间的性能对比

矩阵大小	100	500	1000	1500	2000	2500	3000
一般算法执行时间	0.005	0.622	5.177	25.763	51.578	116.024	193.515
优化算法执行时间	0.004	0.384	3.070	12.480	22.462	49.304	82.696
加速比speedup	1.337	1.620	1.686	2.064	2.296	2.353	2.340

加速比定义：加速比=优化前系统耗时/优化后系统耗时；

所谓加速比，就是优化前的耗时与优化后耗时的比值。加速比越高，表明优化效果越明显

分析原因：

传统的矩阵乘法算法通过遍历结果矩阵 c 的每一行和每一列来计算每个元素的值。在这种访问模式下，矩阵 a 的访问步长为 1，表现出良好的空间局部性，即连续访问的内存地址相邻，有利于缓存命中。

然而，矩阵 b 的访问步长为 size，意味着每次访问的内存地址间隔较大，导致缓存命中率较低。

为了优化缓存性能，我们可以采用一种改进的访问模式。通过遍历矩阵 a 的每个元素，将每个元素对结果矩阵 c 的贡献累加到对应位置，从而实现对矩阵 b 的连续访问，即步长为 1 的访问模式。这种优化策略有效地提高了矩阵 b 的缓存命中率，从而显著提升矩阵乘法的性能。

从这个矩阵乘法的例子中，我们可以看出访问模式对缓存性能具有非常显著影响。

Cache层次和L1cacheline测试代码

#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include<string.h>

std::random_device rd; // 随机数生成
std::mt19937 gen(rd());

std::vector<unsigned int> sizes{8, 16, 32, 64, 128, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 10240, 12288, 16384};
std::vector<int> strides{ 1,2,4,8,16,32,64,96,128,192,256,512,1024,1536,2048 };

void test_cache(int size){
    int n = size / sizeof(char);
    char *arr = new char[n];  //申请存储空间
    memset(arr, 1, sizeof(char) * n); //初始化为 1 
    std::uniform_int_distribution<int> num(0, n - 1); // 0-n-1的随机数
    std::vector<int> position;
    int cnt = 1 << 26;
    for (int i = 0; i < cnt; i++){
        position.push_back(num(gen));
    }
    int sum = 0;
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < cnt; i++){
        sum += arr[position[i]]; // 随机访问
    }
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> t = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

    double total = t.count();
    std::cout << "size = " << (size >> 10) << "KB, time = " << total << "s" << std::endl;

    delete[] arr;
}

void test_cache_line(){
    unsigned int size = (1 << 26);
    int n = size / sizeof(char);
    char *arr = new char[n];
    memset(arr, 1, n * sizeof(char));

    for (auto s : strides){
        int sum = 0;

        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < s; i++){
            for(int j = 0; j < n; j += s){
                sum += arr[j];
            }
        }
        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> t = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
        double total = t.count();
        std::cout << "stride = " << s << "Byte, time = " << total << "s" << std::endl;
    }
    delete[] arr;
}
int main(){
    for (auto s : sizes){
        test_cache(s * 1024);
    }
    test_cache_line();
    return 0;
}