Getting Started With CUDA C on an Nvidia Jetson: A Meaningful Algorithm

Mohammed Billoo●May 11, 2024

In the previous blog post, Getting Started With CUDA C on an Nvidia Jetson: GPU Architecture, I described how the underlying architecture of a GPU makes it ideal for graphics image processing. In the prior blog post, Getting Started With CUDA C on an Nvidia Jetson: Hello CUDA World!, we implemented a simple array addition algorithm using a GPU and a CPU. Unfortunately, we observed no meaningful performance advantage of the GPU implementation over the CPU implementation. In this blog post, we will learn about and implement an algorithm more suitable for a GPU and observe a meaningful performance improvement compared to the naive CPU implementation.

In this blog post, we will implement an algorithm to "blur" a large image, which is 1000 pixels by 1000 pixels, such as the one below (Note: the image below has been scaled down, but if you're interested in the original image, please reach out to me at mab@mab-labs.com):

This article is available in PDF format for easy printing

"Blurring" an image simply involves averaging the image to be less sharp. The output of our implementation should result in the following image:

The following steps outline the process for blurring an image:

Select an appropriate box size to calculate the average.
Place the box at the top left of the image.
Sum the pixels that overlap the image and the box (the overlap is important since the box may result in some pixels residing outside the image).
Divide the sum with the total number of pixels in the box.
Replace the pixels in the box with the calculated average.

The following snippet shows the GPU kernel implementation:

1. __global__ void blur_image(unsigned char *in, unsigned char *out, int w, int h)
2. {
3.     int col = blockIdx.x * blockDim.x + threadIdx.x;
4.     int row = blockIdx.y * blockDim.y + threadIdx.y;
5.
6.     if (col < w && row < h) {
7.         int pix_val = 0;
8.         int pixels = 0;
9.
10.        for (int blur_row = -BOX_SIZE; blur_row < BOX_SIZE + 1; ++blur_row) {
11.            for (int blur_col = -BOX_SIZE; blur_col < BOX_SIZE; ++blur_col) {
12.                int current_row = row + blur_row;
13.                int current_col = col + blur_col;
14.
15.                if (current_row > -1 && current_row < h && current_col > -1 && current_col < w) {
16.                    pix_val += in[current_row * w + current_col];
17.                    pixels++;
18.                }
19.        }
20.
21.        out[row * w + col] = (unsigned char) (pix_val / pixels);
22.    }
23. }

The following bullet points outline relevant portions of the implementation:

Lines 3 and 4 calculate the location of the image associated with the particular CUDA core. Recall from the previous blog posts that the image needs to be stored in memory to simplify this calculation.
Line 6 ensures the CUDA core operates on a pixel within the image, since the CUDA framework simply executes the kernel function across the specified number of CUDA cores in the kernel invocation.
The for loops on lines 10 and 11 iterate over the pixels in the imaginary box over which we calculate the average.
Lines 12 and 13 calculate the row and column index inside this imaginary box.
The if statement in line 15 ensures that the pixel in the imaginary box is inside the actual image.
Lines 16 and 17 sum the total pixel value and increments the pixel count.
Finally, line 21 calculates the average pixel value and places it in the output array, representing the blurred image.

(Note: The CPU implementation and logic to consume an image in JPEG or PNG format are exercises left to the reader.)

If we compare the performance results of a GPU-based implementation vs. a sample CPU-based implementation on an image 1000 pixels by 1000 pixels large, we notice that the GPU-based implementation is 1000 times faster! Similarly, the larger the image, the faster the GPU-based implementation since we perform more operations on the GPU itself.

You might also like... (promoted content)

Check out Memfault's New Sandbox!

New Research Report: The State of IoT Software Development

Comments

Comments
Write a Comment

Select to add a comment

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers:

Choose a Username

E-Mail (Work, School or ieee)

First Name

Last Name

Employer

Job Title

Country

State

Password

Confirm Password

By checking this box, I agree with the terms of use and privacy policy By checking this box, I consent to receive occasional emails from the *Related sites and their partners. I understand that these emails will only contain relevant information and that I can unsubscribe at any time.

Getting Started With CUDA C on an Nvidia Jetson: A Meaningful Algorithm

Sign in

You might also like...

About Mohammed Billoo

Popular Posts by Mohammed Billoo

Popular Blogs Series

Free PDF Downloads

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group