
Getting Started with gemini-2.5-flash-image: A Practical Guide
A hands-on guide to image generation and compositing with gemini-2.5-flash-image including a compression trick that cuts your input token cost by 90%.
I was building a FastAPI microservice for image compositing — the goal was to keep it modular enough that any upstream service could drop it in without much modification. A single endpoint: send a background scene and a product image, get back a composited result. Alongside the engineering work, I had a genuine question I wanted to answer: is generative AI actually good at this kind of task in a real pipeline, or does it only hold up in demos? Image compositing is concrete enough to test properly — either the product looks like it belongs in the scene or it does not. gemini-2.5-flash-image turned out to be the right tool for the job. It can receive images as input and return a modified image as output, which means compositing and scene-aware placement happen in a single API call with no separate inpainting model. This guide covers what I landed on: how to structure inputs, how to write prompts that produce consistent results, and a compression trick that cuts input token costs significantly.
Why gemini-2.5-flash-image
The traditional approach to AI-assisted image compositing involves multiple steps: a background removal model, a placement heuristic, a color correction pass, a shadow generation function. Each stage introduces its own failure modes and tuning requirements. gemini-2.5-flash-image collapses all of this into a single multimodal generation step — you describe what you want, provide the reference images, and the model handles the rest.
Flash is also the right model tier for this work. It's fast enough for interactive feedback loops, cheap enough to run at catalog scale, and capable enough to understand spatial relationships and lighting from scene images. If you've been using GPT-4o for vision tasks and putting up with the latency and cost, Flash is worth a direct comparison.
gemini-2.5-flash-image supports IMAGE in the response_modalities config - meaning the model can return an image directly, not just describe one. This is what enables image editing and compositing in a single call.
Setup
pip install google-genai Pillow opencv-pythonimport os
import base64
import cv2
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])Encoding Images for the API
The SDK doesn't accept PIL Image objects directly. You need to encode the image to base64, decode it back to raw bytes, and pass those bytes via types.Part.from_bytes(). This is the only image input method used throughout this guide.
def _encode_image_to_base64(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def _image_part(path: str, mime_type: str = "image/png") -> types.Part:
b64 = _encode_image_to_base64(path)
return types.Part.from_bytes(
data=base64.b64decode(b64),
mime_type=mime_type,
)The encode-then-decode pattern may look redundant — and for a single local file, it is. It pays off when images are arriving from an upstream service as base64 strings, which is common in API pipelines. Keeping the interface consistent means you can swap the source without changing anything at the call site.
Text to Image
The simplest use case: generate an image from a text description. Specify IMAGE in response_modalities and iterate through the response parts to find the binary output.
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=(
"High-end commercial product photography. A luxury glass perfume bottle "
"centered on a polished white Carrara marble surface. Warm, diffused studio "
"lighting from camera-left casting a soft gradient shadow to the right. "
"Photorealistic render. Clean background, no clutter. Shallow depth of field."
),
config=types.GenerateContentConfig(
response_modalities=["IMAGE", "TEXT"],
),
)
for part in response.candidates[0].content.parts:
if part.inline_data:
with open("output.webp", "wb") as f:
f.write(part.inline_data.data)Prompt specificity matters more than length. Naming the surface material (Carrara marble), the light source position (camera-left), and the shadow direction gives the model enough spatial context to produce a consistent result across multiple generations.
Image to Image: Compositing
This is where Flash earns its place. Pass a background scene and a product image as separate parts, then describe the compositing task. The model understands spatial context — it identifies surfaces, infers lighting direction, and places the product with environmental awareness rather than just pasting it on top.
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=[
_image_part("background.png"),
_image_part("product.png"),
(
"Two inputs are provided: [1] a rendered 3D retail environment, "
"[2] a product photograph with a transparent background.\n\n"
"Composite the product into the retail scene with full photorealism:\n"
"- Place the product on the most contextually appropriate surface "
"(shelf, counter, table, or floor display)\n"
"- Scale it proportionally to surrounding objects so it reads as "
"physically present in the scene\n"
"- Adapt the product lighting to match the scene - adjust color "
"temperature, highlight direction, and shadow softness to the "
"dominant light source\n"
"- Render a soft contact shadow beneath the product base, shaped "
"to its footprint\n"
"- Feather the product edges to eliminate any halo or hard-cutout "
"artifacts\n"
"- Preserve every detail of the product label, material, and color "
"exactly - do not alter the product itself, only its integration "
"into the scene"
),
],
config=types.GenerateContentConfig(
response_modalities=["IMAGE", "TEXT"],
),
)
with open("composited.png", "wb") as f:
for part in response.candidates[0].content.parts:
if part.inline_data:
f.write(part.inline_data.data)The numbered references ([1], [2]) anchor each instruction to the correct input image. The API doesn't require this, but it reduces ambiguity when the model is reasoning about two images with different roles. The explicit constraint on preserving product details really matters — without it, the model will occasionally reinterpret the product's shape or color to better match the scene.
That last constraint took me longer to land on than I expected. Early versions of the prompt kept producing results where the product edges looked slightly clipped — not a hard cutout, but off enough that the composite looked pasted rather than placed. I rewrote the instruction several times trying to isolate what was causing it. At one point I was close to scrapping the approach entirely and falling back to an OpenCV inpainting pipeline. What eventually fixed it was the explicit feathering instruction combined with the strict preservation constraint. Removing either one brought the clipping back.
Cutting Input Cost with Compression
Gemini bills image input by token count, which scales directly with pixel area. A 4K background image carries a significant token cost. The model doesn't need that resolution to understand surface context, lighting direction, or available placement zones — those are all semantic properties that read fine at 768px. Downscaling before the API call and upscaling after is the single most effective way to cut costs.
from PIL import Image
from typing import Tuple
def compress(
path: str,
max_px: int = 768,
out: str = "tmp_compressed.png",
) -> Tuple[int, int]:
"""Downscale to max_px on the longest side. Returns original (width, height)."""
img = Image.open(path)
original_size = img.size
img.thumbnail((max_px, max_px), Image.LANCZOS)
img.save(out, "PNG")
return original_size768px is the practical floor for compositing tasks - enough for the model to read surface type, light direction, and object scale. Dropping below 512px starts to lose fine detail that the model needs to match lighting accurately. For text-to-image generation with no reference input, resolution does not apply.
With both images compressed, the full compositing call becomes:
original_size = compress("background.png", out="bg_small.png")
compress("product.png", out="product_small.png")
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=[
_image_part("bg_small.png"),
_image_part("product_small.png"),
"... compositing prompt ...",
],
config=types.GenerateContentConfig(
response_modalities=["IMAGE", "TEXT"],
),
)
with open("composited_small.png", "wb") as f:
for part in response.candidates[0].content.parts:
if part.inline_data:
f.write(part.inline_data.data)Upscaling the Output
The model returns output at its native generation resolution. To restore the original dimensions, resize using OpenCV with INTER_CUBIC interpolation. Cubic upscaling preserves edge contrast better than bilinear on photographic output — this matters here because the composited edges and shadows are the most visually sensitive parts of the result.
def upscale_result(
src: str,
original_size: Tuple[int, int],
dst: str,
) -> None:
"""Upscale back to original resolution using cubic interpolation."""
img = cv2.imread(src)
resized = cv2.resize(img, original_size, interpolation=cv2.INTER_CUBIC)
cv2.imwrite(dst, resized)
upscale_result("composited_small.png", original_size, "composited_final.png")cv2.resize() expects (width, height) - the same order as PIL Image.size. If you derive original_size from a numpy array (img.shape), the order is (height, width, channels) and you will need to reverse it before passing to resize.
Putting It Together
The full pipeline — compress, call, upscale — fits in a single function:
def composite_product(
background_path: str,
product_path: str,
output_path: str,
max_px: int = 768,
) -> None:
original_size = compress(background_path, max_px, "bg_small.png")
compress(product_path, max_px, "product_small.png")
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=[
_image_part("bg_small.png"),
_image_part("product_small.png"),
(
"Two inputs: [1] a rendered 3D retail environment, "
"[2] a product photograph with a transparent background.\n\n"
"Composite the product into the retail scene with full photorealism:\n"
"- Place the product on the most contextually appropriate surface\n"
"- Scale proportionally to surrounding objects\n"
"- Match lighting: color temperature, highlight direction, shadow softness\n"
"- Render a soft contact shadow shaped to the product footprint\n"
"- Feather edges to eliminate halo artifacts\n"
"- Preserve product label, material, and color exactly"
),
],
config=types.GenerateContentConfig(
response_modalities=["IMAGE", "TEXT"],
),
)
with open("composited_small.png", "wb") as f:
for part in response.candidates[0].content.parts:
if part.inline_data:
f.write(part.inline_data.data)
upscale_result("composited_small.png", original_size, output_path)What to Expect
For common product categories — fragrance, footwear, apparel, accessories — placement accuracy is high. The model reliably identifies shelves, tables, and counters, and scales the product appropriately. Lighting is where results vary most: scenes with a single dominant light source composite well, but environments with complex, multidirectional lighting occasionally produce mismatched shadows.
The compress-and-upscale approach introduces a small fidelity tradeoff on fine product details — label text, material texture — that upscaling alone doesn't fully recover. For catalog workflows where an artist reviews output before it goes live, this is acceptable. For fully automated pipelines where output goes directly to production, test your specific product categories before committing to a max_px floor.
As for the original question — whether generative AI is actually capable of this in a real pipeline — the answer is yes, but with an honest caveat. Scene understanding and lighting matching were better than I expected going in. Where it fell short was consistency: the same prompt on the same images will not always produce the same result. For a workflow where a human reviews before anything ships, that is manageable. For a fully automated flow, you need validation logic around it. The FastAPI microservice ended up being genuinely plug-and-play across different systems, which was the goal. Whether the generative compositing step is right for a given use case still depends on the tolerance for variability in the output.