Creating image histogram is slow

B-LechCode commented 2 months ago

Describe the bug When loading a big image, the calculation of the histogram is slow.

To Reproduce Load big image, wait until the histogram shows up

Expected behavior Calculation of histograms is an O(n) operation, it possibly could be implemented in a faster way.

Desktop (please complete the following information):

OS: Linux
[Linux] KDE, Debian 12, X11

**Suggestion: I could elaborate faster options if desired. We should think about type independent implementations based on the idea of variable internal image types

woelper commented 2 months ago

Thanks!

The relevant function can be found here:

https://github.com/woelper/oculante/blob/0ed1ca0a2ac300b6243bdd952eec9957cf161e88/src/utils.rs#L154

If you find any way to speed up the code, I'd be glad to know. Pixels are only iterated over once and sorted per channel. Maybe you have some ideas to improve that!

Also please note that it is not just the histogram that is created - there is also other information in the extended image information, such as exif and fully transparent pixels.

I should also note that image information is processed after the image has loaded. This is to not delay image loading by performing any other unrelated operations - this means the histogram will not be finished at the time the image is loaded.

awxkee commented 1 month ago

This is because you used the HashMap, hash map is very slow and generally do not needed for a histogram except images with 16+ bit depth and also for this bit depth your algorithm probably will fail because your approach will require at least u32::MAX ^ 3 RAM to compute hist.

I see your case is more for 8 bit, probably 16 and this definitely do better by hands because the amount of colors is also determined so (1 << bit_depth) - 1 is count of colors.

So for 8- bit images it is

    let hist_r: [u32; 256] = [0; 256];
    let hist_g: [u32; 256] = [0; 256];
    let hist_b: [u32; 256] = [0; 256];
    for image in rgba_image.as_bytes().chunks_exact(3) {
        hist_r[image[0]] += 1;
        hist_g[image[1]] += 1;
        hist_b[image[2]] += 1;
    }

And for 8+ bit images

    let hist_r: [u32; u16::MAX as usize] = [0; u16::MAX as usize];
    let hist_g: [u32; u16::MAX as usize] = [0; u16::MAX as usize];
    let hist_b: [u32; u16::MAX as usize] = [0; u16::MAX as usize];
    for image in rgba_image.as_bytes().chunks_exact(3) {
        hist_r[image[0]] += 1;
        hist_g[image[1]] += 1;
        hist_b[image[2]] += 1;
    }

Computing a unique count of colors is more tricky, at least you have to check that your HashSet do not grow to max ( 4 GiB ) for 8 bit images, since not every device may support this.

For computing distinct amount of colors there are few options: 1 - you may reduce maximum possible colors to limited set from 8 bit to 6 bit, allocate limited store, and shift every color right by 2, and keep in mind that your pallete all the time less than real.

    const LIMITED_MAX: usize = (1 << 6) * (1 << 6) * ( 1 << 6);
    let distinct_colors: [u32; LIMITED_MAX] = [0; LIMITED_MAX];
    let limited_rgba = (r >> 2 ) | ((g >> 2) << 6) | ((b >> 2) << 12) | ((a >> 2) << 18);

2 - Remove default hasher from hashset ( it is definitely slow ) because you do not need this and the color is the hash itself, and somehow solve issues with probably 4 GiB RAM usage for full hist.

Something like this ones

#[derive(Debug)]
pub struct Hist {
    pub map: HashSet<u32, U32Hasher1>,
}

pub struct U32Hasher1(pub u32);

impl std::hash::Hasher for U32Hasher1 {
    #[inline(always)]
    fn finish(&self) -> u64 {
        u64::from(self.0).wrapping_mul(0x517cc1b727220a95)
    }
    #[inline(always)]
    fn write_u32(&mut self, i: u32) {
        self.0 = i;
    }

    fn write(&mut self, _bytes: &[u8]) {
        unimplemented!()
    }
    fn write_u8(&mut self, _i: u8) {
        unimplemented!()
    }
    fn write_u16(&mut self, _i: u16) {
        unimplemented!()
    }
    fn write_u64(&mut self, _i: u64) {
        unimplemented!()
    }
    fn write_u128(&mut self, _i: u128) {
        unimplemented!()
    }
    fn write_usize(&mut self, _i: usize) {
        unimplemented!()
    }
    fn write_i8(&mut self, _i: i8) {
        unimplemented!()
    }
    fn write_i16(&mut self, _i: i16) {
        unimplemented!()
    }
    fn write_i32(&mut self, _i: i32) {
        unimplemented!()
    }
    fn write_i64(&mut self, _i: i64) {
        unimplemented!()
    }
    fn write_i128(&mut self, _i: i128) {
        unimplemented!()
    }
    fn write_isize(&mut self, _i: isize) {
        unimplemented!()
    }
}

impl std::hash::BuildHasher for U32Hasher1 {
    type Hasher = Self;

    #[inline(always)]
    fn build_hasher(&self) -> Self {
        Self(0)
    }
}

3 - Try implement and use probabilistic filters that answer on question is it in set? like these two Cuckoo and Bloom. Those two might be more effective especially for RAM usage since they are optimized to index data.

4 - Also I might prefer to just approximate amount of permutation using combinatorics from hist, this would be the fastest method and probably the most unprecise

The main performance degradation I think almost definitely in distinct color counting.

B-LechCode commented 1 month ago

Thank you for elaborating! I thought about using direct array access, too. Calculation of a histogram is really fast with this method. For 32 Bit float images I would map it to a histogram with 10 to 16 bit index size - calculation of the bin index would then only add a multiplication.

Regarding distinct colors: In my opinion using lower precision is a good idea. When I have the time to take some measurements, I will have a look at the code to check whether hist or distinct colors is the slower part.

Edit: I did a split on hist and distinct colors and the Results for a 12000 by 6000 pixel image were : 5.09s for histogram and transparent pixels 3.82s for distinct colors (652516)

awxkee commented 1 month ago

So hist easily can be replaced with direct array that will be really fast.

If some branching and lowering precision is acceptable there are possibility to some improvement again. The main photo images and etc even it is RGBA, in fact they are RGB with constant alpha, so, it's possible to effective scan an image if it is with constant alpha almost at copy speed and if it is then use RGB565 which exactly fits into u16::MAX

Here is the scan loop, it is necessary to scan by lines to fail early if alpha is not constant. And this loop easily can be converted into SSE, NEON. However LLVM not able to vectorize it automatically

fn test_image_alpha(image: &[u8], width: u32, height: u32) -> bool {
    if image.len() < 4 {
        return true;
    }
    let first_alpha = image[3];
    let mut sums: u32 = 0;
    let mut offset = 0usize;
    for _ in 0..height {
        let lane = &image[offset..];
        for x in 0..width {
            // x^x always = 0
            sums += first_alpha.bitxor(lane[x as usize*4 + 3]) as u32;
        }
        // we're not care about the result, if alpha is constant it must be 0
        if sums != 0 {
            return false;
        }
        offset += width as usize * 4;
    }
    sums == 0
}

And if alpha is constant then calculating distinct amount is easy.

    let rgb_565 = ((b >> 3) << 11) | ((g >> 2 ) << 5) | (r >> 3);
    const LEVEL: usize = u16::MAX as usize;
    let distinct: [u16; LEVEL] = [0; LEVEL];
    distinct[rgb_565] += 1;

However different path for RGBA still be required

B-LechCode commented 1 month ago

@awxkee As far as I could see, the alpha channel is not considered in distinct color counting, therefore a direct rgb565 calculation could be used.

What do you think of a 256 sized array holding HashSets as a full precision alternative? This could speed things up because of smaller Sub-Sets?

awxkee commented 1 month ago

This is always better to know what exactly stored in full precision.

Actually it may be even easier even than RGB u8 calculation.

For example you can fit CIE LAB into u8 bins sized 100, 184, 200 with precision loss less than 1%, And Jzazbz you almost have no chance to renormalize into u8. So I think if any of your value can be easily renormalized into 0..255, 0..100 with 0-2% precision loss it should be fine.

awxkee commented 1 month ago

If you want to generic full precision hist alternative you could use 16 bit bins size. This will cover most cases. In any way, once you used HashMap/HashSet you're slowing down memory access at least 10 times. Anything with fixed size array will be more practical for this task

B-LechCode commented 1 month ago

Update: I implemented it with the array approach. The loop with histogram and transparent pixels now took 169.15ms. The distinct color runtime dropped about one second just by using u32 instead of the rgba struct.

Will dive deeper soon :)

Edit: By doing: `let mut pix32: Vec<> = img.pixels().map(|p| (p.0[0] as u32) << 16 | (p.0[1] as u32) <<8 | p.0[2] as u32).collect(); pix_32.par_sort_unstable();

cc = 1; { let mut old_val = pix_32[0]; for pix32 in pix_32.iter() { if(pix32 != old_val) { cc += 1;
} old_val = pix32; } }`

The runtime reduced to roughly 700 ms. I know - this can only be applied if we have enough memory. For grayscale there are even simpler options.

awxkee commented 1 month ago

Looks nice! You may try also this one, this probably should be faster however will require 1GiB RAM.

fn scan_colors_flat(image: &[u8]) -> u64 {
    const FIXED_RGB_SIZE: usize = 1 << 24;
    let mut flat_map= vec![0u32; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos = (chunk[0] as usize) << 16 | (chunk[1] as usize) << 8 | chunk[2] as usize;
        flat_map[pos] += 1;
    }

    let mut full_colors = 0u64;
    for &intensity in flat_map.iter() {
        if intensity > 0 {
            full_colors += 1;
        }
    }

    full_colors
}

awxkee commented 1 month ago

And if you wish to try reduced colors there is an option with simply degrading less perceptual blue channel and done same

fn scan_colors_flat(image: &[u8]) -> u64 {
    const FIXED_RGB_SIZE: usize = 1 << 23;
    let mut flat_map= vec![0u32; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos = (chunk[0] as usize) << 15 | (chunk[1] as usize) << 7 | (chunk[2] as usize >> 1);
        flat_map[pos] += 1;
    }

    let mut full_colors = 0u64;
    for &intensity in flat_map.iter() {
        if intensity > 0 {
            full_colors += 1;
        }
    }

    full_colors
}

And even more faster with saturating arithmetics and 16 Mb of ram

fn scan_colors_sat(image: &[u8]) -> u64 {
    const FIXED_RGB_SIZE: usize = 1 << 24;
    let mut flat_map= vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos = (chunk[0] as usize) << 16 | (chunk[1] as usize) << 8 | (chunk[2] as usize);
        flat_map[pos] = flat_map[pos].saturating_add(1);
    }

    let mut full_colors = 0u64;
    for &intensity in flat_map.iter() {
        if intensity > 0 {
            full_colors += 1;
        }
    }

    full_colors
}

awxkee commented 1 month ago

I think saturating method best one at the moment even it is without any limitations. However if needed extending to counting with alpha still is not easy. Perhaps it can be even faster and suitable for rgba if we’ll use binary mask as accumulator :) Instead of u8 calculate binary offset and set bit to one if there are color :)

awxkee commented 1 month ago

Here it is. Quite more ugly and slower however able to count exact with alpha

fn scan_colors_binary(image: &[u8]) -> u32 {
    const FIXED_RGB_SIZE: usize = 1 << 29;
    let mut flat_map = vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos: u32 = u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]);
        let full = pos >> 3;
        let remainder = pos - full;

        let bit = 1 << remainder;
        flat_map[full as usize] |= bit;
    }

    let mut full_colors = 0u32;
    for intensity in flat_map.chunks_exact(4) {
        let color = u32::from_le_bytes([intensity[0], intensity[1], intensity[2], intensity[3]]);
        full_colors += color.count_ones();
    }

    full_colors
}

However RGB scan is excellent

fn scan_colors_binary_rgb(image: &[u8]) -> u32 {
    const FIXED_RGB_SIZE: usize = 1 << 21;
    let mut flat_map = vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos: u32 = u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]);
        let full = pos >> 3;
        let remainder = pos - (full << 3);

        let bit = 1 << remainder;
        flat_map[full as usize] |= bit;
    }

    let mut full_colors = 0u32;
    for intensity in flat_map.chunks_exact(4) {
        let color = u32::from_le_bytes([intensity[0], intensity[1], intensity[2], intensity[3]]);
        full_colors += color.count_ones();
    }

    full_colors
}

B-LechCode commented 1 month ago

That is the way I wanted to do it, too. Unfortunately, it crashes with "SIGSEGV: invalid permissions for mapped object" I passed it with let cc2 = scan_colors_flat(img.as_bytes());

Do you have an idea?

awxkee commented 1 month ago

Have no idea. Maybe ensure that image is RGBA and you have enough memory to allocate ?

   let rgba_image = img.to_rgba8();
    let rgba_bytes = rgba_image.as_bytes();

B-LechCode commented 1 month ago

Fixed it - I declared the LUT as an array, which was too much for the stack.

awxkee commented 1 month ago

Surprised that it signalling protected zone writing instead of stack overflow. This methods except scan_colors_binary_rgb definitely won't fit into stack.

B-LechCode commented 1 month ago

I compared your "scan_colors_flat" with my bitwise implementation "scan_colors_flat2":

fn scan_colors_flat2(img: &RgbaImage) -> u64 { const FIXED_RGB_SIZE: usize = 24; const SUB_INDEX_SIZE: usize = 5; let SUB_INDEX_MASK = 2u32.pow(SUB_INDEX_SIZE as u32)-1; const MAIN_INDEX_SIZE: usize = 1 << (FIXED_RGB_SIZE-SUB_INDEX_SIZE); let mut flat_map = vec![0u32; MAIN_INDEX_SIZE];

for chunk in img.pixels() {
    let pos = (chunk.0[0] as u32) << 16 | (chunk.0[1] as u32) << 8 | chunk.0[2] as u32;
    let pos_main = pos>>SUB_INDEX_SIZE;
    let pos_sub = pos & SUB_INDEX_MASK;
    flat_map[pos_main as usize] = flat_map[pos_main as usize] | 1 << pos_sub;
}

let mut full_colors = 0u64;
for &intensity in flat_map.iter() {
    full_colors += intensity.count_ones() as u64;
}

full_colors

}

The results: Mine: 208.56ms Yours: 296.14ms

With a smaller image (6152x4877): Mine: 72.59ms Yours: 69.82ms

With a even smaller image (913x924): Mine: 3.99ms Yours: 18.76ms

So, speed depends on pixel count, but both options are dramatically faster. I'd stick to the bitwise calculation for saving memory.

PS: How can I do that perfect markup that you use for your code?

awxkee commented 1 month ago

If you wish to test my bitwise implementation this is the ones scan_colors_binary_rgb. scan_colors_flat were just first attempt, good but not the best one. They should be comparable

Here it is. Quite more ugly and slower however able to count exact with alpha

fn scan_colors_binary(image: &[u8]) -> u32 {
    const FIXED_RGB_SIZE: usize = 1 << 29;
    let mut flat_map = vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos: u32 = u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]);
        let full = pos >> 3;
        let remainder = pos - full;

        let bit = 1 << remainder;
        flat_map[full as usize] |= bit;
    }

    let mut full_colors = 0u32;
    for intensity in flat_map.chunks_exact(4) {
        let color = u32::from_le_bytes([intensity[0], intensity[1], intensity[2], intensity[3]]);
        full_colors += color.count_ones();
    }

    full_colors
}

However RGB scan is excellent

fn scan_colors_binary_rgb(image: &[u8]) -> u32 {
    const FIXED_RGB_SIZE: usize = 1 << 21;
    let mut flat_map = vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos: u32 = u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]);
        let full = pos >> 3;
        let remainder = pos - (full << 3);

        let bit = 1 << remainder;
        flat_map[full as usize] |= bit;
    }

    let mut full_colors = 0u32;
    for intensity in flat_map.chunks_exact(4) {
        let color = u32::from_le_bytes([intensity[0], intensity[1], intensity[2], intensity[3]]);
        full_colors += color.count_ones();
    }

    full_colors
}

To do markup just add rust after ``` here is it ```rust

awxkee commented 1 month ago

There is no any plan to count colors in u64. I understood this later, since in 8-bit it is still maximum colors of RGB image is 2^24 then u32 is enough and u64 may suffer on 32-bit systems.

And probably you can do

        let pos: u32 =  chunk.0 & 0x00ffffff;

to determine your pos if color stored as u32. This probably will save some time on LLVM optimization. However it may sometimes get this by itself.

awxkee commented 1 month ago

Here is NEON version of popcnt if needed. However speed increasing not so much.

unsafe fn popcnt_neon(image: *const u8, amount: usize) -> u32 {
    let full_iter = amount >> 6;
    let mut acc = vdupq_n_u32(0);
    for i in 0..full_iter {
        let lanes = vld1q_u8_x4(image.add(i << 6));
        acc = vaddq_u32(acc, vpaddlq_u16(vpaddlq_u8(vcntq_u8(lanes.0))));
        acc = vaddq_u32(acc, vpaddlq_u16(vpaddlq_u8(vcntq_u8(lanes.1))));
        acc = vaddq_u32(acc, vpaddlq_u16(vpaddlq_u8(vcntq_u8(lanes.2))));
        acc = vaddq_u32(acc, vpaddlq_u16(vpaddlq_u8(vcntq_u8(lanes.3))));
    }
    let start = full_iter << 6;
    let rem = amount - start;
    let mut full_colors =
        vget_lane_u64::<0>(vpaddl_u32(vadd_u32(vget_low_u32(acc), vget_high_u32(acc)))) as u32;
    if rem > 0 {
        let safe_slice = slice::from_raw_parts(image.add(start), rem);
        let iterator = safe_slice.chunks_exact(4);
        for intensities in iterator.clone() {
            let color = u32::from_le_bytes([
                intensities[0],
                intensities[1],
                intensities[2],
                intensities[3],
            ]);
            full_colors += color.count_ones();
        }

        for &intensity in iterator.remainder() {
            full_colors += intensity.count_ones();
        }
    }

    full_colors
}

fn scan_colors_binary_rgb(image: &[u8]) -> u32 {
    const FIXED_RGB_SIZE: usize = 1 << 21;
    let mut flat_map = vec![0u8; FIXED_RGB_SIZE];

    for chunk in image.chunks_exact(4) {
        let pos: u32 = u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]);
        let full = pos >> 3;
        let remainder = pos - (full << 3);

        let bit = 1 << remainder;
        flat_map[full as usize] |= bit;
    }
    let mut full_colors = unsafe { popcnt_neon(flat_map.as_ptr(), flat_map.len()) };

    full_colors
}

B-LechCode commented 1 month ago

Thank you, @awxkee. I combined your "scan_colors_binary_rgb" function with my implementation and integrated it into the loop which calculates the histogram. It is interesting that masking was slower than shift and substraction for the bit shift.

It takes 224 ms to calculate everything for the mentioned test image.

awxkee commented 1 month ago

Glad to hear it helps. I may only note that for u32 hist if you’re expecting huge images saturating arithmetics will be required since fully white large image will overflow.

awxkee commented 1 month ago

Thank you, @awxkee. I combined your "scan_colors_binary_rgb" function with my implementation and integrated it into the loop which calculates the histogram. It is interesting that masking was slower than shift and substraction for the bit shift.

It takes 224 ms to calculate everything for the mentioned test image.

Perhaps you did not the best benchmarking. LLVM understood idea in full and generates absolutely identical code for both methods.

Here you can check codegen.

B-LechCode commented 1 month ago

Perhaps you did not the best benchmarking. LLVM understood idea in full and generates absolutely identical code for both methods.

Here you can check codegen.

Thanks for checking, then it was just the uncertainty of my benchmark. Nevertheless - our solution works fast :) For future benchmarking, I need to do multiple runs and maybe I shouldn't use a notebook in bed for it.

Glad to hear it helps. I may only note that for u32 hist if you’re expecting huge images saturating arithmetics will be required since fully white large image will overflow.

That's true: Currently the maximum sized image would be sqrt(2^32 -1) -> 65535 x 65535 for the histogram. The Application limits the image size to the maximum texture size, which was always 16k x 16k on the platforms I tested.

I will address that in the Big Images branch, there I'm working on overcoming that limitation, since I often work with images with a width of 8 to 16 times the limit.

woelper commented 1 month ago

Thanks again for your work, it is crazy fast now!

B-LechCode commented 1 month ago

Thanks again for your work, it is crazy fast now!

Thanks, it was a pleasure :)

woelper / oculante

Creating image histogram is slow #416