02_Photo Object Detection

Code Summary: Tiled YOLO Object Detection on Large Images

Here’s the deal, I’ve been taking photos for nearly 10 years. Amounting for just over 30k photos from various sources (Iphone, Sony, Nikon, Etc). Wouldn’t it be interesting to tag these photos to create a searchable repo for what’s being captured? Maybe, here we go.

This script performs object detection on a high-resolution image by splitting it into tiles, running YOLO inference on each tile, and merging the results back into a single set of detections.

Overview

Uses Ultralytics YOLO models (yolo11n and yolo11x) for object detection. (Your mileage my vary, but when the largest yolo model is only 100 mbs, why not use it. We arent pushing to prod.)
Handles large images by tiling them into fixed-size patches. (This is important due to YOLO resizing the images to 640x640, That majority of my photos were taken with full frame camera, thus much larger.)
Reassembles detections into original image coordinates.
Removes duplicate detections using Non-Maximum Suppression (NMS).
Visualizes the final detections on the original image.

Key Steps

1. Model Initialization

Loads two YOLO models:
- yolo11n.pt (lightweight)
- yolo11x.pt (high-accuracy, used for inference)
Uses the Ultralytics YOLO API for prediction.

2. Image Preparation

Loads a JPEG image from disk. (There are other file formats, preprocessing for that is beyond the scope of what im trying to show here. Maybe for the next post.)
Converts it to RGB and rotates it 90 degrees for correct orientation. (When opened, PIL auto assigns an orientation, this impacts the quality of the object detection. IE., boats are typically sitting on the water.)

3. Image Tiling

Splits the large image into fixed-size tiles (default: 640×640 pixels).
Supports configurable overlap between tiles (set to 0 here). (When tiling, you are left with alot of edge images, slicing objects and adding little value. When you photograph something, it tends to be in the middle, duh.)
Stores each tile along with its (x, y) offset in the original image.

4. Tile-Based Inference

Runs YOLO predictions independently on each tile.
Extracts bounding boxes, confidence scores, and class IDs.
Translates tile-relative bounding box coordinates back to original image coordinates.

5. Cross-Tile Non-Maximum Suppression

Combines all detections from every tile.
Applies class-aware NMS using torchvision.ops.nms to remove duplicate detections caused by tiling.
Uses an IoU threshold of 0.5.

6. Confidence Filtering

Filters final detections by confidence score (keeps detections with confidence > 0.50). (Adjustable, probably should be 80%.)

7. Visualization

Draws bounding boxes and class labels on the original image using OpenCV.
Displays the annotated image using PIL.

Output

A single annotated image showing detected objects across the full original image.
Detections are:
- Merged across tiles
- De-duplicated
- Filtered by confidence

Use Case

This approach is well-suited for:

High-resolution photography
Allows collection of image objects
Any image too large for direct YOLO inference at native resolution

It preserves detection accuracy while avoiding GPU memory constraints.

Lets do some imports.

from ultralytics import YOLO
from PIL import Image
import numpy as np
import torch
import cv2


model_n = YOLO("yolo11n.pt") # Small
model_x = YOLO("yolo11x.pt") # XL
path = 'E:\\8_Life\\2025_08_14_France\\DSC00304.JPG'

Functions applied.

def tile_image(image: Image.Image, tile_size: int, overlap: int):
    """Splits an image into overlapping tiles."""
    img_width, img_height = image.size
    tiles = []
    
    # Iterate over the image with a stride (tile_size - overlap)
    stride = tile_size - overlap
    for y in range(0, img_height, stride):
        for x in range(0, img_width, stride):
            # Ensure the tile doesn't go out of bounds
            # Adjust coordinates for the last tiles to fit within the image
            left = x
            top = y
            right = min(x + tile_size, img_width)
            bottom = min(y + tile_size, img_height)

            # If the adjusted tile is smaller than the required size,
            # you might want to adjust its start coordinates to maintain the size
            # For simplicity here we just use the bounds
            if right - left < tile_size:
                left = img_width - tile_size
            if bottom - top < tile_size:
                top = img_height - tile_size
            
            # Crop the tile
            tile = image.crop((left, top, right, bottom))
            tiles.append({'tile': tile, 'coords': (left, top)})
            
            # Stop if reached the end of the row
            if x + tile_size >= img_width:
                break
        # Stop if reached the end of the column
        if y + tile_size >= img_height:
            break
            
    return tiles


def predict_on_tiles(model, tiles):
    """Runs predictions on image tiles and records original coordinates."""
    all_detections = []
    for tile_info in tiles:
        tile_img = tile_info['tile']
        left, top = tile_info['coords']
        
        # Run inference
        # Ultralytics predict function returns a list of Results objects
        results_list = model.predict(source=tile_img, verbose=False)
        
        for results in results_list:
            boxes = results.boxes.xyxy.cpu().numpy()
            confs = results.boxes.conf.cpu().numpy()
            clss = results.boxes.cls.cpu().numpy()

            # Rescale bounding box coordinates to the original image
            for box, conf, cls in zip(boxes, confs, clss):
                x1, y1, x2, y2 = box
                original_x1 = x1 + left
                original_y1 = y1 + top
                original_x2 = x2 + left
                original_y2 = y2 + top
                all_detections.append([original_x1, original_y1, original_x2, original_y2, conf, cls])
                
    return np.array(all_detections)


# 3. Remove Duplicates using Non-Maximum Suppression (NMS)
def apply_nms(detections, iou_threshold=0.5):
    """Applies Non-Maximum Suppression to filter duplicate detections."""
    if len(detections) == 0:
        return []

    # Convert to a format suitable for NMS (xyxy, confidence, class)
    boxes = detections[:, :4]
    scores = detections[:, 4]
    classes = detections[:, 5]
    
    # Convert numpy arrays to torch tensors
    boxes_tensor = torch.from_numpy(boxes)
    scores_tensor = torch.from_numpy(scores)
    classes_tensor = torch.from_numpy(classes)

    # We need class-aware NMS or apply NMS per class
    unique_classes = torch.unique(classes_tensor)
    keep_indices = []

    for cls in unique_classes:
        cls_indices = (classes_tensor == cls).nonzero(as_tuple=False).flatten()
        cls_boxes = boxes_tensor[cls_indices]
        cls_scores = scores_tensor[cls_indices]
        
        # Apply NMS
        keep = torch.ops.torchvision.nms(cls_boxes, cls_scores, iou_threshold)
        keep_indices.append(cls_indices[keep])
    
    keep_indices = torch.cat(keep_indices, dim=0)
    
    # Return filtered detections
    return detections[keep_indices.cpu().numpy()]


large_image = Image.open(path).convert("RGB").rotate(90)
TILE_SIZE = 640  # Standard YOLO input size
OVERLAP = 0 # No overlap

Fancy print statements.

print("Tiling image...")
tiles = tile_image(large_image, TILE_SIZE, OVERLAP)
print(f"Generated {len(tiles)} tiles.")
print("Running predictions on tiles...")
all_detections = predict_on_tiles(model=model_x, tiles=tiles)
print(f"Found {len(all_detections)} total detections before NMS.")
print("Applying Non-Maximum Suppression...")
final_detections = apply_nms(all_detections, iou_threshold=0.50)
print(f"Found {len(final_detections)} unique detections after NMS.")

Tiling image...
Generated 35 tiles.
Running predictions on tiles...
Found 34 total detections before NMS.
Applying Non-Maximum Suppression...
Found 29 unique detections after NMS.

# Filter Detections:
filtered_arr = final_detections[final_detections[:, 4] > 0.50]

img_cv2 = cv2.cvtColor(np.array(large_image), cv2.COLOR_RGB2BGR)
for det in filtered_arr:
    x1, y1, x2, y2 = map(int, det[:4])
    conf, cls = det[4], det[5]
    label = f"{model_x.names[cls]}: {conf:.2f}"
    cv2.rectangle(img_cv2, (x1, y1), (x2, y2), (0, 255, 0), 2)
    cv2.putText(img_cv2, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

Image.fromarray(img_cv2, 'RGB').show()

Lets take a look!

Output Image

TODO

~~Remove extra code~~
~~Split up code blocks~~
~~Personalize description~~
Spell check
Looks like YOLO doesn’t detect flags.

Home

Last Updated: 2026-01-27