Brainstorming an AI-Powered Photo Critic Service

Posted by Aug on February 29, 2024

Abstract:
This post outlines initial brainstorming for an AI-powered photo critic service. It proposes a multi-stage approach to analyze and critique uploaded photographs, starting with basic metadata extraction, followed by deeper image analysis using Convolutional Neural Networks (CNNs) for scene/object recognition, and richer understanding with CLIP for image captioning and aesthetic evaluation. Finally, a Large Language Model (LLM) would synthesize these findings to generate a constructive critique.

Estimated reading time: 4 minutes

I’ve been mulling over an interesting idea: what if you could get an AI to act as your personal photo critic? You’d upload a photograph, and an AI system would analyze it, offering constructive feedback and thoughtful critiques. It’s a fun concept to explore, and it got me thinking about how one might technically approach building such a service.

Here’s a sketch of my initial brainstorming:

The Core Challenge: Choosing the Right AI Models and Approach

Setting aside the application architecture for a moment (for quick prototyping, something like npx create-llama might be a starting point), the most critical decision is how to structure the AI processing. We’re all hearing about powerful multi-modal LLMs that can understand images, audio, and text. However, from my experience, it’s often more effective to pre-process and refine the information an LLM receives, rather than just throwing raw data at it.

You could think of an AI Photo Critic as a specialized form of Retrieval Augmented Generation (RAG). The system needs specific domain knowledge about what constitutes a “good” or “bad” photo from various artistic and technical standpoints. It needs to know what elements to focus on in an image. Before even attempting a critique, it would be highly beneficial for the system to first understand the photo’s content (people, food, pets, landscape) and its general attributes.

My thinking is to use a multi-phased approach for metadata extraction and analysis, moving from simple data to increasingly complex understanding:

Phase 1: Basic Metadata Extraction This is the low-hanging fruit – information often embedded directly in the image file:

  • Filename, image resolution (dimensions)
  • Camera make and model (from EXIF data)
  • Dominant colors, overall color palette
  • GPS coordinates (if available in EXIF data)

Phase 2: Deeper Image Analysis with Convolutional Neural Networks (CNNs) CNNs are a classic tool in computer vision and could help extract more detailed visual features:

  • Scene Recognition: What’s generally in the photo (e.g., a monument, a beach, a forest)? What’s the broader setting (e.g., the Eiffel Tower, a cityscape)?
  • Object Detection: Identify specific objects within the scene.
  • Location Guessing: Infer location (e.g., “Paris”) if prominent landmarks are recognized (and GPS isn’t available).
  • Face Detection & Analysis: If faces are present, detect them, potentially analyze features, and even attempt to recognize expressions (happy, sad, etc.).

Phase 3: Richer Understanding with CLIP (Contrastive Language-Image Pre-training) CLIP, developed by OpenAI, excels at connecting images and text. It could be used for:

  • Image Captioning: Generate a descriptive sentence or two about the image.
  • Aesthetic Evaluation/Classification: Attempt to classify the photo’s style (e.g., impressionistic, minimalist, street photography) or even relate it to known artists or photographic styles.
  • Photo Type Classification: Categorize the photo (e.g., landscape, urban, portrait, still life).
  • Outlier Detection: Identify unusual subject matter or unconventional compositions that might be worth commenting on.

The information gathered from these initial phases (Basic Metadata, CNN analysis, CLIP analysis) would form a rich “context” to be passed to a sophisticated, instruction-following LLM.

Phase 4: Generating the Critique with a Large Language Model (LLM) Finally, an LLM would take all this structured information and:

  • Generate a Coherent Critique: Synthesize the findings into a human-readable critique, explaining what works well and what could be improved.
  • Reason about Aesthetics: Discuss elements like composition, lighting, use of complementary colors, leading lines, etc.
  • Understand Cultural Significance: If relevant (e.g., the photo is of the Eiffel Tower), incorporate notes on its cultural context.
  • Enable Interactive Dialogue: Ideally, allow for a back-and-forth conversation where the user can ask follow-up questions about the critique.

Further Investigation Needed (My “Todos”)

This is all very high-level, of course. The next step would be to investigate specific models and technologies:

  • CNN Models: Explore well-established architectures like VGGNet, ResNet, or Inception for scene and object recognition.
  • Multi-Modal LLMs / Vision Language Models (VLMs): For the final critique generation and potentially some earlier phases, look into models like LLaVA, CogVLM, or potentially Google’s Gemini (formerly Bard capabilities). Perhaps a specialized VLM could even handle some of the CNN/CLIP tasks.
  • CLIP Implementation: OpenAI’s CLIP is a strong candidate for the image-text understanding phase. It might even be sufficient for a basic prototype to get captioning and some classification working.

Building a truly insightful AI photo critic is a complex challenge, but breaking it down into these phases makes it feel more approachable. It’s an exciting intersection of computer vision, natural language processing, and a touch of artistic sensibility!