Seam carving is a technique
for content-aware resizing, which changes the size of an image while
attempting to preserve the interesting content of the image. The basic
idea behind the technique is to remove low-energy seams: paths of pixels
that are similar to their surroundings. A removed seam can be either vertical or
horizontal, to reduce either the width or height of an image.
For example, heres a panorama I took a few years ago near Spruce Knob, WV.
The image is super wide, but we can carve it down.
Heres an animation of the process, generated by the
seam-carver I implemented recently
as a parallel benchmark for
mpl.
Getting real parallel speedups in this implementation turned out to be an
interesting challenge. For typical images, which are fairly small,
I found that the most obvious approach for parallelizing the algorithm was not
able to extract enough parallelism, because it required too small of a
grain size.
So, I designed a new way of parallelizing the algorithm which works much
better on small images. This post presents some of the details.
The Seam Carving Algorithm
The idea behind seam-carving is to consider all possible seams in an image,
select the one which is least important, and remove it. A seam is a path of
pixels that cuts across an image, touching one pixel in each row (or column)
along the way. When we remove a seam, we reduce either the width or height
of the image by 1. We can repeat this process as many times as desired to
shrink the image.
Well base the importance of seams on a notion of pixel energy, which
can be defined in
many different ways.
For seam-carving, the
color difference gradient
seems to work well, which measures how different a pixels color is in
comparison to its surroundings. The energy of a seam is just the
cumulative energies of all pixels touched by the seam. With this setup,
the seam with minimum energy should be unimportant and safe to remove from
an image.
To compute minimum seam energies, we can use a simple
dynamic programming
algorithm. For simplicity here, lets focus on vertical seams. Let
\(M(i,j)\) be the minimum energy of
any partial seam that ends at row \(i\) and column \(j\), and let
\(E(i,j)\) be the energy of the pixel at that position. These minimum seam
energies are related by the following equation:
\[M(i,j) = E(i,j) + \min(M(i-1,j-1), M(i-1,j), M(i-1,j+1))\]
A picture might help. For each position, we look at the minimum partial
seams that end above it, pick the smallest, and then add the next pixel.
Once \(M\) has been fully computed, we can find the minimum overall seam by
working backwards, from bottom to top, by walking upwards along the path of
minimum seam energies starting at the smallest in the bottommost row.
Finding the minimum seam in parallel
Row-major order (doesnt work)
The dependencies within the equation for \(M\) appear to offer a lot of
parallelism. Perhaps the most immediately obvious approach is to compute
\(M\) in row-major order,
where first we do all of row 0, and then all of row 1, etc. With this ordering,
the values within a row can be computed entirely in parallel, because each
\(M(i, j)\) only depends on three values \(M(i-1, \ldots)\) from the previous row.
However, theres a major problem in practice with row-major order:
typical images are only at most a few thousand pixels wide, and each pixel
only requires a small amount of work. Thats not enough work (per row)
to parallelize effectively! Well need a
granularity
of at least, say, a thousand pixels to amortize the cost of parallelism itself.
This pretty much rules out using row-major order for typical, small images.
Is there a better to compute \(M\) without reducing the grain size?
Triangular-blocked strips
To extract more parallelism without reducing the grain size, we need a
better way to partition up the work. This is something that probably
could be done automatically (e.g. take a look at
Rezaul Chowdhurys
work), but here well just try to design something ourselves.
What are all of the dependencies for a single value \(M(i,j)\)? Visually,
the dependencies form a triangle with the pointy-end pointing down:
We can use triangles like these to split up the work in a nice way. First,
imagine grouping adjacent rows into strips. Within one strip,
cover the top of the strip with a bunch downward-pointing triangles. The
leftover space will look like a bunch of upward-pointing triangles, each
of which has all of its dependencies already satisfied (this is not immediately
obvious but drawing a picture will help). So, we
can process one strip in two parallel rounds, where first we do a bunch of
downward-pointing triangles in parallel, and then we do a bunch of
upward-pointing triangles to fill in the leftover space.
Heres a picture using triangles that are 6 pixels wide at the base. Note that
if we use triangles of even base-width, then these tile naturally with no
overlap.
How wide should each triangle be in practice? Recall that were shooting for a
target granularity of maybe one or two thousand pixels. This corresponds to
triangles with base-widths of somewhere between 64 (1056 pixels in the triangle)
and 90 (2070 pixels in the triangle).
So, in a smallish image, e.g. 1000 pixels wide, we can fit
at least 11 triangles horizontally, suggesting a maximum possible
speedup of 11x or more. This is a big improvement over the row-major strategy,
which (using a granularity of 1000 pixels) would not be able to extract any
parallelism from such an image.
Implementation and performance
With mpl, I implemented both the row-major
and triangular strategies to compare their performance. The code for the
triangular strategy is available in the examples/src/seam-carve/ subdirectory
of the mpl repo. Ill leave the row-major
code as an exercise to the reader. (Its not too bad!)
Here are some performance numbers for
removing 10 seams from an image of size approximately 2600×600. I did a little
bit of tuning, and ended up choosing a row-major granularity of 1000 pixels,
and a triangular base-width granularity of 88 (1980 pixels).
Strategy | P = 1 | P = 10 | P = 20 | P = 30 |
---|---|---|---|---|
row-major | 0.66 | 0.38 | 0.38 | 0.38 |
triangular | 0.78 | 0.17 | 0.12 | 0.09 |
On 1 processor (P = 1), the row-major strategy is faster by about 15%, but
on more processors quickly hits a maximum speedup of about 2x, seeing no
additional improvement above 10 processors.
In contrast, the triangular strategy continues to get faster as the number
of processors increases, with self-speedups of about 5x on 10 processors,
6.5x on 20, and 9x on 30.
Despite having a mild amount of overhead on 1 processor, we can see that
the triangular strategy provides significant gains. On higher core counts,
it is as much as 4x faster than the row-major strategy.
Discussion
One might wonder: why are the speedups we got so far away from theoretical
limits? With a base-width of 88, shouldnt we be able to get up to 30x speedup
on an image of width 2600? Well, unfortunately, there are a large number of
barriers to achieving such good speedup in practice: ineffective cache
utilization, memory bandwidth bottlenecks, non-optimal scheduling and
many more.
In the case of seam carving, there are two major limiting factors.
- Seam carving is largely memory-bound, i.e., for each memory access it
only does a small amount of work before the next access. To get around this
bottleneck, we might need better cache utilization, and so we would have to
more carefully lay out the values of \(M(i,j)\) in memory. (For this
implementation, I used a simple flat-array layout, where \(M(i,j)\) lives at
index \(iW + j\). We could instead try storing these values in a
hierarchical structure, indexing first by strip, and then by triangle, to
better improve locality.) - Seam carving has high span, performing a small amount of work in
between each synchronization. For example in the row-major strategy, we use
a barrier between each row, and in the triangular strategy, we use two
barriers per strip. This can cause the scheduler to perform too many thread
migrations. To mitigate this, one thing we could try is to let triangles
overlap slightly, to increase the size of each strip. This trades a small
amount of duplicated work for fewer barrier synchronizations overall, which
could prove to be hugely beneficial.