Pulkit Gera पुलकित गेरा

What does it mean to win?

2025-07-21T00:00:00+00:00

What does it mean to win?
What does it mean to win?
What does it mean to win?

Every time I come home, I feel like I have stepped into an old memory of mine.

The streets look the same, the same shops selling the same food. I don’t even find it hard to navigate. It weirdly makes me happy and sad at the same time as I feel the city didn’t grow up with me while I had so many changes in my life. I feel I have just slipped into the same lifestyle that I had before leaving and nothing changes. It’s one of the most fascinating thoughts to me.

I visited Shadab to eat biryani. It was insane how it tasted exactly the same and was still the best biryani I have had. 15 years ago, my dad brought it home. 8 years ago, I shifted to Hyderabad and 3 years ago I left. Yet the smell, the taste, the cushions everything was exactly the same.

Visiting home gets me a chance to catch up with everyone. Some comment I look fatter while some notice I have a new hairstyle. However one thing that’s constant is that everyone still feels the same no matter how much they have progressed. Although the initial nervousness holds both of us back, after a while it feels that it’s the same guy only this time he is wearing chic shoes instead of torn slippers like in college. However talking to them makes you realize how similar things still are even though the metrics might have changed.

Suddenly they are now chasing cars instead of 9.0 GPAs or they are worried that they are stuck in a loop instead of being sad about getting low grades in a project. Talking to them makes me feel all of us have that small kid inside of us who just wants to be a part of a group, play with friends and make their parents proud. Meeting them is like travelling in an alternate timeline since our paths diverged after school or college and we chased different things. It just makes me wonder if I took the wrong path and choice and if I should have led the life that way they did.

What does it mean to win?

When we were younger, the answer was easy. Study hard for two years, get into a good college, and everything will work out. That’s what we were told. And since everyone around us was chasing the same goal, it felt like the only way forward.

But life didn’t stay that simple.

Now, people are on completely different journeys. Some are making big money in top companies. Others are doing master’s degrees or PhDs while carrying student loans. Some are starting families. Some are building startups. Some are planning world trips, while others are thinking of moving back home. Everyone is now scrambling to find a partner which was actively discouraged throughout our life.

Everybody is confused. No matter what decision they take, it always feels like you are doing something wrong. You try to optimize for everything and end up failing cause you cant have it all. Even though you might be living the life you dreamt, you still feel like falling behind. That maybe they should have done things differently.

Instagram and Linkedin have made the situation far worse because everyone only sees each other’s highlights. You don’t just see one person’s win. You see five in a row. Someone’s engagement. Someone’s product launch. Someone’s promotion. Someone’s Europe trip. And without even realising it, your brain merges all of them together and compares that supercut to your one quiet life. It makes you feel like you’re losing a race you never signed up for. A win for someone might not be important for someone else.

On top of that everyone feels that life stops at 30 so all accomplishments must be done before that. In the previous generation people got married young and then built their life together. Now, we’re expected to “fix” everything about ourselves first — finances, fitness, careers, emotional wounds — and only then find someone who’s equally fixed.

There’s this idea that once we get X, our life is sorted. Over the years, I have found this to be the source of all my unhappiness. You see while this idea helps us focus and drop everything else, we forget to enjoy our own life. We justify every suffering that once we achieve X everything would be worth it. However nobody has ever felt a relief on reaching X because they are busy chasing another Y. Meanwhile, life moves on. Your parents grow older. Your health starts to shift. The habits you ignored begin to catch up. And suddenly, life becomes one long to-do list.

A win is supposed to feel like freedom. So why doesn’t it?

There’s an old Greek myth about a man named Sisyphus. He was punished by the gods for cheating death and was condemned to roll a heavy boulder up a hill, only for it to roll back down every time he reached the top. Over and over, forever. Conventional wisdom says that’s eternal suffering.

Albert Camus, a French philosopher, has this absurd idea that Sisyphus is happy. Because even in the absurdity of his fate, Sisyphus owns his struggle. He accepts that there is no final meaning, and still chooses to push the boulder.

Of course, this doesn’t mean that suffering is noble or that people in pain should just “accept it.” That would be cruel. But there is something quietly powerful in choosing to keep going, in finding purpose not in a final destination but in the act itself.

You see, life will go by and sometimes it’s going to feel great and sometimes it’s terrible. However most people often get trapped in the past or the future. The past has happened, it won’t hurt you anymore. The future will happen regardless of us worrying about it. The only thing that can change is the present. So if we don’t stay alive in the present, the regrets would just keep adding up.

Funnily enough even the great people in history also remained trapped in this. Franz Kafka, one of the most influential writers of the 20th century, died before most of his work was even published. He believed he had failed. Vincent van Gogh painted over 2,000 pieces, but sold only one painting in his entire life. He died feeling like his life had amounted to nothing.

What does it mean to win?

Perhaps to find the answer, we first need to be kind to ourselves. Maybe we should start by looking at ourselves the way we look at our best friend. Acknowledge the hardships we’ve overcome. Notice how much we’ve grown. Recognize the effort it takes just to keep moving forward every day. And most importantly, stop treating ourselves like a constant project under construction.

Virat Kohli is one of the greatest cricketers of all time. But for years, he was stuck on 70 centuries. Most players never even get close to that number, but the noise around him was loud. He admitted later that he was struggling with mental health. No matter how hard he tried, one ball would always get him out. Then came that game against Afghanistan. In his mind, he had let go. He stopped chasing the hundred. And suddenly, it happened. He hit a six, and reached century number 71, after almost three years. But instead of roaring in celebration, he laughed. In disbelief. Because at that moment, the weight lifted. That, in a career full of legendary innings, was the one that freed him. He scored 12 more centuries after that.

That’s what a win is supposed to feel like. A culmination of all the pain, the silence, the trying. Whether you get that by doing hard things, or challenging yourself or by forming connections, as long as we keep trying I feel that is a win. It may not be flashy or even noteworthy, but as long as it’s liberating, it’s a win.

In Avatar-the last airbender, Uncle Iroh asks Zuko “Who are you? And what do you want?” I think it’s worthwhile writing the answer to this every now and then. It might genuinely help us center and find what are the things we want to achieve this year. Instead of getting swayed away by instagram/linkedin posts, we should really think about what we want to get done.

Chasing X will never give us happiness, but enjoying the journey will. We need to find happiness in the chase only then will the win feel like a win. For that we need to remain in the present as much as possible.

What does it mean to win?

For me, its the moment when it works. That instant, when I forget everything else and I look at it in wonder and amazement. When I don’t care about anything else in the world and just stare at it. That’s the moment when I feel I made it. That for me is to win. That is what I keep chasing in every hard task I attempt, that feeling.

Winning is not about the world. There is no metric to define it. Sometimes it’s getting through a hard week. Sometimes it’s being honest with yourself. Sometimes it’s laughing again after a long time. And sometimes, it’s letting go of all the pressure and just playing for the love of it, like Kohli did that day.

We don’t need to chase someone else’s definition of success. We don’t need to live on timelines set by people who don’t know our story. The truth is, this is your life. You are the one who gets to write it.

Maybe Winning is just choosing to live, care and keep going on your own terms. And if you are doing that, you are already winning.

Spice Terminator- SO101 Arm That Wipes Plates

2025-06-16T00:00:00+00:00

Setting
Meet the Team
Toolbox
Build Phase
Demo Day Chaos
Presentation
Awards
Final Thoughts

Cleaning dishes is genuinely one of the most annoying yet unavoidable parts of the day. You come home, tired after work. Maybe you gather enough energy to cook yourself a decent meal. But the moment you’re done eating, the last thing you want to do is clean up.

So… what if you could teach a robot to do it for you?

What if you could sit in your room, watch it in action, and make sure it doesn’t break something?

What if you could deploy a bunch of them in kitchens or factories, train them remotely, and just monitor the results?

That’s what we built — and it ended up winning us Most Novel Solution at the HuggingFace LeRobot hackathon.

We call it Spice Terminator — a robot powered by imitation learning that wipes plates squeaky clean.

Setting

Hugging Face organized the LeRobot Worldwide Hackathon 2025 — a global celebration of AI-powered robotics. It was open to all, from hardcore robotics folks to curious tinkerers. The London chapter, hosted by SoTA, was where we joined in. We took part in the Imitation Learning track, where the challenge was to teach robots how to complete real-world tasks through demonstrations. Our chosen mission: teaching it to clean dishes.

Meet the Team

Our team had a diverse background — different skill sets, but all aligned in motivation.

I work as an Applied Scientist at Flawless AI, building lip sync tools for film production.

William, a seasoned founder and product manager, brought structure and clarity to our project.

Sanan, a Robotics Master’s student at UCL, had hands-on experience with vision-language architectures.

Atharv and Bill, both Research Engineers, had already worked with the SO-101 embodiment and brought the hardware expertise to the table.

We had access to two pairs of SO-101 arms — one acting as the leader, the other as the follower. The follower had a camera mounted, and we also had an Intel RealSense camera mounted overhead.

For William and me, this was the first time working with embodied agents in the physical world. And I won’t lie — I felt like a kid in a magic show. Watching the robot arms move and respond, I was just standing there with my jaw dropped. It felt unreal.

Toolbox

The SO‑101 is the upgraded version of the SO‑100 arm, developed by RobotStudio and Hugging Face. It’s designed to work with the open-source LeRobot library and costs around $130 (excluding 3D-printed parts). Affordable and open — a nice combo. For intelligence, we leaned on vision-language models. These models understand both visuals and instructions, and can generalize actions across different robot embodiments. We used NVIDIA-Groot 1.5 and SmolVLA for training our robot. Our setup included:

A leader-follower SO-101 pair
A wrist-mounted camera on the follower
An overhead Intel RealSense camera

The environment? A plate with red sauce on it, a bowl with a wet sponge, and some clutter we threw in to make things harder like spoons, cans, tissues, whatever we could find. The task sounded simple on paper: pick up the sponge, wipe the sauce off the plate, and put the sponge back in the bowl. But in practice? Way harder than it looks

Challenges for this robot arm would be:

Pick the sponge correctly from the edge which was slippery.
Align the sponge with the plate based on how it picked it up.
Sauce could spread while wiping, so the robot needed to track progress.
Sometimes the robot arm would occlude the top camera thus no visibility.
Finally, it had to decide whether the plate was clean “enough” and we have no precise metric, just learning.

We even moved the plate mid-task, added more sauce, and tested how well it adapted. The robot had to generalize where the plate, sponge, and bowl were placed and ignore other objects.

Build Phase

This was my favorite part. During planning, Will took the lead — outlining goals, defining OKRs, and keeping us grounded. He proposed an amazing devtool idea: a browser-based interface to visualize and control the robot. It turned out to be a huge asset.

On the day of the hackathon:

Atharv and Bill handled hardware setup.
Sanan calibrated the robots with laser-sharp focus (seriously, no false moves).
Me and Bill worked on designing the experiment space and prepping props.
Sanan and Atharv began exploring the VLA models and got the initial pipelines and environments running.
Will developed the Web interface and tested thoroughly

Me and Bill took charge of collecting the demonstration data. I started by teleoperating the leader arm — guiding it to pick the sponge, wipe the plate, and return the sponge to the bowl. After the first 25 episodes, Bill took over, and honestly, he was a natural. He was super smooth with the arm, and by the end, we had collected 75 episodes across varying levels of difficulty. Then I worked on sorting the data and fine-tuning Gr00t1.5 model while Sanan finetuned SmolVLA. Atharv then ensured that the inference code worked and it didnt break the robot. He also took care of anything going wrong.

There were hiccups, of course — dependency hell, broken imports, and unexpected json naming. But eventually, the robot learned to clean the plate.

The first time it worked, we just froze. Then the cheers came. Seeing it actually execute a full clean felt like magic.

Demo Day Chaos

Now, as with any real robotics demo, things didn’t go perfectly. Once we hit “record,” the robot started acting up:

Picked the bowl instead of the sponge.
Wiped the wrong part of the plate.
Got stuck mid-task.
Didnt want to leave the plate.

But after some careful adjustments, it worked. It picked the sponge, wiped the sauce, and returned the sponge. We clapped every single time it worked. It was like watching a close game on TV with fingers crossed.

We even tested it on a greasy paper lunchbox top — and it wiped that too!

Presentation

During the showcase, every team brought their A-game. They built projects like :

Robot pipetting liquid into test tubes and picking the whole stack.
Drawing with a custom robot arm.
Robots that played chess.
Arm that picked bread and put it into a toaster.
Arm that picked garbage off a conveyor belt.
Robot that scanned the environment and figured the shortest route.

People designed and 3D printed custom grippers, vision rigs, and environments. The level of execution across the board was genuinely inspiring. Every single project was jaw dropping and honestly it was a privilege to see it live.

Awards

We ended up tying for Most Novel Solution. While our problem was simpler, the creativity and robustness of our approach really stood out. Everyone brought something unique. But I think what helped us stand out was:

Tight team roles
Working, adaptive demo
Clear, testable problem

I had learned from a previous hackathon how critical it is to optimize the demo, not just the codebase. That definitely paid off this time.

Final Thoughts

This was easily one of the best weekends I’ve had in a long time. The team vibe was spot-on — collaborative, efficient, and always fun. Huge shoutout to the SOTA organizers for creating such a welcoming, high-energy atmosphere. The food was great, the people were amazing, and the projects were next level. They were super helpful and gave us access to Ori GPUs and any hardware we wanted at a moments’ whim. I’m already looking forward to the next LeRobot hackathon.

VEEDxFal Hackathon- Building Global Ad Campaigns with GenAI

2025-06-10T00:00:00+00:00

Mission Begins
Why I Joined
Meet the Team
Toolbox
Right Problem
Pivot
Mad Men
Our Product
Construction
Pitch Time
What We Learned
Final Thoughts

What if Tom Cruise could pitch Corona in a London pub—then instantly switch scenes and charm a Mexican plaza with the same beer, all in flawless Spanish? That was our pitch: one product, infinite markets. Fully automated, hyper-localized ad campaigns—copy, visuals, voiceovers, and video—crafted by GenAI. No agencies. No weeks of back and forth. Just input your product, pick a region, and watch your global ad empire unfold. The mission was bold. The clock was ticking.

Mission Begins

VEED, FAL.ai, ElevenLabs, Photoroom, and Sieve recently conducted a 24-hour Gen AI Hackathon. There were over 300 participants from various backgrounds competing to create a product using the various Gen AI APIs provided by the sponsors.

Why I Joined

Over the last few years, I have spent time developing Gen AI algorithms and have always been amazed at how quickly products emerge from SOTA research. I was curious to see how people used these APIs to build actual products and find viable paths to profitability.

Meet the Team

My team had a diverse background. I work as an Applied Scientist at Flawless AI, developing solutions for lip sync in movie assets. Rodolfo has experience building AI tools for top online marketplace sellers across markets like Mexico, Brazil, Italy, the US, and the UK. Mujing, on the other hand, is a seasoned frontend engineer and hackathon veteran. This mix gave us a strong foundation to cover multiple angles of product development.

Toolbox

We had access to a wide variety of APIs provided by the sponsors. FAL hosted numerous APIs for text-to-image, text-to-video, image-to-video, and more. VEED served as an abstraction layer over many of them. What amazed me was how accessible they made SOTA models. Many participants didn’t even write code—they used the APIs as Lego blocks, and built genuinely cool things. Having spent hours in the past setting up GitHub repos just to run inference, this felt like a huge leap.

Right Problem

While others started building quickly, we spent a lot of time brainstorming. It was intimidating to see how fast some teams were moving. People would often ask us what we were building, and for a while, our answer was an awkward, “No idea yet.”

We wanted to build something meaningful—something that could evolve into a viable product rather than a gimmick. Ideas we discussed included a Duolingo-style video chat app, an AI avatar real estate agent, a tool to generate marketing images from a few product photos, or even a sports analyst. I also wanted to explore legal tech since it’s so tedious and time-consuming—but none of us had domain expertise in law.

Pivot

Rodolfo brought up the idea of generating custom landing pages from user input. This resonated. His experience across global markets made us think—what if we built a system that could generate full marketing campaigns tailored for different regions? Something that would automatically produce ad copy, promotional images, product visuals, even videos—based on the product and the target market.

Mad Men

Based on my multiple rewatches of Mad Men, I kept thinking about how much effort goes into a single ad campaign. Typically, there’s a team of copywriters, graphic designers, video editors, and more—all working together to brainstorm, prototype, and refine the messaging and visuals. This process can take weeks for just one campaign, let alone multiple campaigns tailored for different international markets.

Our product would drastically reduce the prototyping and iteration time by generating a first-pass campaign that professionals could then refine and personalize further.

Our Product

The user provides a product image, basic description, and market (e.g., UK or Mexico).
An agent generates ad copy.
Using this and the product image, we generate appropriate backgrounds and use Flux Kontext Pro to place the product.
The look and tone of the generated visuals change based on the target region.
For video content:
- The user selects a celebrity or actor and provides a script.
- ElevenLabs is used to generate voice audio.
- Flux Kontext renders the actor holding the product.
- Hunyuan Avatar API animates the image + audio into a final video ad.

For example, Tom Cruise might promote Corona with “mates” in London and with “amigos” in Spanish—completely different cultural tones.

Construction

What struck me was how fast this all came together. Back in college, I hand-coded frontends and REST APIs. This time, Mujing introduced me to fast prototyping platforms and visual dev tools that saved hours. Within a few hours, we had a working pipeline.

But not everything went smoothly. Around 5am, during the final integration push, half the visual pipeline broke. Backgrounds didn’t render properly, the audio sync was glitching, and API rate limits started hitting us. We thought we might not even have a demo to show. Mujing, running on zero sleep, sat calmly at 6am and rewrote parts of the flow to stitch everything back together. Watching that come together felt like a movie montage moment—our very own last-minute hackathon miracle.

Another interesting thing was how others seemed more focused on just getting something out rather than worrying about the big picture. It was fast, unconstrained—probably the true spirit of indie hacking. Even the craziest ideas were just a few clicks away. While I was getting caught up thinking about the current limitations of video models and questioning how good or bad they were, Saba told us not to worry about it—just build. The tech would catch up. “Don’t get stuck thinking why it won’t work—focus on how amazing it can get.” That mindset shift really stuck with me.

Pitch Time

After staying up for 30 hours straight, it was time to pitch. We had only two minutes to present. We rehearsed around 20 times to make sure we were crisp and tight. The atmosphere was electric. People from all walks of life—indie hackers, product managers, research students—came with different goals. Some wanted virality, some chased monetization, others just wanted to experiment. A lot of ideas revolved around monetizing Instagram and TikTok content.

What We Learned

We were quietly confident but didn’t make the podium. Saba, VEED’s CEO, mentioned that some teams under-built while others over-built. What ultimately stood out was the product experience. In hindsight, we focused too much on explaining our positioning. Our “wow” moment came too late in the pitch. Meanwhile, winning teams had immediate visual appeal and clear storytelling.

Final Thoughts

Even so, the experience was incredibly enriching. Both Rodolfo and Mujing were fantastic to work with and brought unique perspectives. I still believe our product had great potential, but I now understand the importance of how you present—not just what you build.

A huge thank you to the organizers for creating such a supportive and high-energy environment. The food, the vibes, the people—everything was top-notch. Looking forward to the next one!

Let There Be Light! Diffusion Models and the Future of Relighting

2024-10-10T00:00:00+00:00

Solving Relighting
Diffusion Models
DiLightNet
Neural Gaffer
Relightful Harmonization
Multi-Illumination Synthesis
Lightit
Takeaways
References

Relighting is the task of rendering a scene under a specified target lighting condition, given an input scene. This is a crucial task in computer vision and graphics. However, it is an ill-posed problem, because the appearance of an object in a scene results from a complex interplay between factors like the light source, the geometry, and the material properties of the surface. These interactions create ambiguities. For instance, given a photograph of a scene, is a dark spot on an object due to a shadow cast by lighting or is the material itself dark in color? Distinguishing between these factors is key to effective relighting. In this blog post we discuss how different papers are tackling the problem of relighting via diffusion models. Relighting encompases a variety of subproblems including simple lighting adjustments, image harmonization, shadow removal and intrinsic decomposition. These areas are essential for refining scene edits such as balancing color and shadow across composited images or decoupling material and lighting properties. We will first introduce the problem of relighting and briefly discuss Diffusion models and ControlNets. We will then discuss different approaches that solve the problem of relighting in different types of scenes ranging from single objects to portraits to large scenes.

Solving Relighting

The goal is to decompose the scene into its fundamental components such as geometry, material, and light interactions and model them parametrically. Once solved then we can change it according to our preference. The appearance of a point in the scene can be described by the rendering equation as follows:

Rendering Equation from sourceMost methods aim to solve for each single component of the rendering equation. Once solved, then we can perform relighting and material editing. Since the lighting term L is on both sides, this equation cannot be evaluated analytically and is either solved via Monte Carlo methods or approximation based approaches. An alternate approach is data-driven learning, where instead of explicitly modeling the scene properties it directly learns from data. For example, instead of fitting a parametric function, a network can learn the material properties of the surface from data. Data-driven approaches have proven to be more powerful than parametric approaches. However they require a huge amount of high quality data which is really hard to collect especially for lighting and material estimation tasks.

Datasets for lighting and material estimation are rare as they require expensive, complex setups such as light stages to capture detailed lighting interactions. These setups are accessible to only a few organizations, limiting the availability of data for training and evaluation. There are no full-body ground truth light stage datasets publicly available which further highlights this challenge.

Diffusion Models

Computer vision has experienced a significant transformation with the advent of pre-training on vast amounts of image and video data available online. This has led to the development of foundation models, which serve as powerful general-purpose models that can be fine-tuned for a wide range of specific tasks. Diffusion models work by learning to model the underlying data distribution from independent samples, gradually reversing a noise-adding process to generate realistic data. By leveraging their ability to generate high-quality samples from learned distributions, diffusion models have become essential tools for solving a diverse set of generative tasks.

One of the most prominent examples of this is Stable Diffusion(SD), which was trained on the large-scale LAION-5B dataset that consists of 5 billion image text pairs. It has encoded a wealth of general knowledge about visual concepts making it suitable for fine-tuning for specific tasks. It has learnt fundamental relationships and associations during training such as chairs having 4 legs or recognizing structure of cars. This intrinsic understanding has allowed Stable Diffusion to generate highly coherent and realistic images and be used for fine tuning to predict other modalities. Based on this idea, the question arises if we can leverage pretrained SD to solve the problem of scene relighting.

So how do we fine-tune LDMs? A naive approach is to do transfer learning with LDMs. This would be freezing early layers (which capture general features) and fine tuning the model on the specific task. While this approach has been used by some papers such as Alchemist (for Material Transfer), it requires a large amount of paired data for the model to generalize well. Another drawback to this approach is the risk of catastrophic forgetting, where the model losses the knowledge gained during pretraining. This would limit its capability on generalizing across various conditions.

Another approach to fine-tuning such large models is by introducing a ControlNet. Here, a copy of the network is made and the weights of the original network are frozen. During training only the duplicate network weights are updated and the conditioning signal is passed as input to the duplicate network. The original network continues to leverage its pretrained knowledge.

While this increases the memory footprint, the advantage is that we dont lose the generalization capabilities acquired from training on large scale datasets. It ensures that it retains its ability to generate high-quality outputs across a wide range of prompts while learning the task specific relationships needed for the current task.

Additionally it helps the model learn robust and meaningful connections between control input and the desired output. By decoupling the control network from the core model, it avoids the risk of overfitting or catastrophic forgetting. It also needs significantly less paired data to train.

While there are other techniques for fine-tuning foundational models - such as LoRA (Low-Rank Adaptation) and others - we will focus on the two methods discussed: traditional transfer learning and ControlNet. These approaches are particularly relevant for understanding how various papers have tackled image-based relighting using diffusion models.

DiLightNet

Introduction

This work proposes fine grained control over relighting of an input image. The input image can either be generated or given as input. Further it can also change the material of the object based on the text prompt. The objective is to exert fine-grained control on the effects of lighting.

Method

Given an input image, the following preprocessing steps are applied:

Estimate background and depth map using off the shelf SOTA models.
Extract mesh by triangulating the depth map
Generate 4 different radiance cues images. Radiance cues images are created by assigning the extracted mesh different materials and rendering them under target lighting. The radiance cues images act as basis for encoding lighting effects such as specular, shadows and global illumination.

Once these images are generated, they train a ControlNet module. The input image and the mask are passed through an encoder decoder network which outputs a 12 channel feature map. This is then multiplied with the radiance cues images that are channel wise concatenated together. Thus during training, the noisy target image is denoised with this custom 12 channel image as conditioning signal.

Additionally an appearance seed is provided to procure consistent appearance under different illumination. Without it the network renders a different interpretation of light-matter interaction. Additionally one can provide more cues via text to alter the appearance such as by adding “plastic/shiny metallic” to change the material of the generated image.

Implementation

The dataset was curated using 25K synthetic objects from Objaverse. Each object was rendered from 4 unique views and lit with 12 different lighting conditions ranging from point source lighting, multiple point source, environment maps and area lights. For training, the radiance cues were rendered in blender.

The ControlNet module uses stable diffusion v2.1 as base pretrained model to refine. Training took roughly 30 hours on 8x NVIDIA V100 GPUs. Training data was rendered in Blender at 512x512 resolution.

Results

This figure shows the provisional image as reference and the corresponding target lighting under which the object is relit.

This figure shows how the text prompt can be used to change the material of the object.

This figure shows more results of AI generated provisional images that are then rendered under different input environment light conditions.

This figure shows the different solutions the network comes up to resolve light interaction if the appearance seed is not fixed.

Limitations

Due to training on synthetic objects, the method is not very good with real images and works much better with AI generated provisional images. Additionally the material light interaction might not follow the intention of the prompt. Since it relies on depth maps for generating radiance cues, it may fail to get satisfactory results. Finally generating a rotating light video may not result in consistent results.

Neural Gaffer

Introduction

This work proposes an end to end 2D relighting diffusion model. This model learns physical priors from synthetic dataset featuring physically based materials and HDR environment maps. It can be further used to relight multiple views and be used to create a 3D representation of the scene.

Method

Given an image and a target HDR environment map, the goal is to learn a model that can synthesize a relit version of the image which here is a single object. This is achieved by adopting a pre-trained Zero-1-to-3 model. Zero-1-to-3 is a diffusion model that is conditioned on view direction to render novel views of an input image. They discard its novel view synthesis components. To incorporate lighting conditions, they concatenate input image and environment map encodings with the denoising latent.

The input HDR environment map E is split into two components: E_l, a tone-mapped LDR representation capturing lighting details in low-intensity regions, and E_h, a log-normalized map preserving information across the full spectrum. Together, these provide the network with a balanced representation of the energy spectrum, ensuring accurate relighting without the generated output appearing washed out due to extreme brightness.

Additionally the CLIP embedding of the input image is also passed as input. Thus the input to the model is the Input Image, LDR Image, Normalized HDR Image and CLIP embedding of Image all conditioning the denoising network. This network is then used as prior for further 3D object relighting.

Implementation

The model is trained on a custom Relit Objaverse Dataset that consists of 90K objects. For each object there are 204 images that are rendered under different lighting conditions and viewpoints. In total, the dataset consists of 18.4 M images at resolution 512x512.

The model is finetuned from Zero-1-to-3’s checkpoint and only the denoising network is finetined. The input environment map is downsampled to 256x256 resolution. The model is trained on 8 A6000 GPUs for 5 days. Further downstream tasks such as text-based relighting and object insertion can be achieved.

Results

This figure compares the relighting results of their method with IC-Light, another ControlNet based method. Their method can produce consistent lighting and color with the rotating environment map.

This figure compares the relighting results of their method with DiLightnet, another ControlNet based method. Their method can produce specular highlights and accurate colors.

Limitations

A major limitation is that it only produces low image resolution (256x256). Additionally it only works on objects and performs poorly for portrait relighting.

Relightful Harmonization

Introduction

Image Harmonization is the process of aligning the color and lighting features of the foreground subject with the background to make it a plausible composition. This work proposes a diffusion based approach to solve the task.

Method

Given an input composite image, alpha mask and a target background, the goal is to predict a relit portrait image. This is achieved by training a ControlNet to predict the Harmonized image output.

In the first stage, we train a background control net model that takes the composite image and target background as input and outputs a relit portrait image. During training, the denoising network takes the noisy target image concatenated with composite image and predicts the noise. The background is provided as conditioning via the control net. Since background image by itself are LDR, they do not provide sufficient signals for relighting purposes.

In the second stage, an environment map control net model is trained. The HDR environment map provide lot more signals for relighting and this gives lot better results. However at test time, the users only provide LDR backgrounds. Thus, to bridge this gap, the 2 control net models are aligned with each other.

Finally more data is generated using the environment map ControlNet model and then the background ControlNet model is finetuned to generate more photo realistic results.

Implementation

The dataset used for training consists of 400k image pair samples that were curated using 100 lightstage. In the third stage additional 200k synthetic samples were generated for finetuning for photorealism.

The model is finetuned from InstructPix2PIx checkpoint The model is trained on 8 A100 GPUs at 512x512 resolution.

Results

The figures show results on real world test subjects. Their method is able to remove shadows and make the composition more plausible compared to other methods.

Limitations

While this method is able to plausibly relight the subject, it is not great at identity preservation and struggles in maintaining color of the clothes or hair. Further it may struggle to eliminate shadow properly. Also it does not estimate albedo which is crucial for complex light interactions.

Multi-Illumination Synthesis

Introduction

This work proposes a 2D relighting diffusion model that is further used to relight a radiance field of a scene. It first trains a ControlNet model to predict the scene under novel light directions. Then this model is used to generate more data which is eventually used to fit a relightable radiance field. We discuss the 2D relighting model in this section.

Method

Given a set of images X_i with corresponding depth map D (that is calculated via off the shelf methods), and light direction l_i the goal is to predict the scene under light direction l_j. During training, the input to the denoising network is X_i under random illumination, depth map D concatenated with noisy target image X_j. The light direction is encoded with 4th order SH and conditioned via ControlNet model.

Although this leads to decent results, there are some significant problems. It is unable to preserve colors and leads to loss in contrast. Additionally it produces distorted edges. To resolve this, they color-match the predictions to input image to compensate for color difference. This is done by converting the image to LAB space and then channel normalization. The loss is then taken between ground-truth and denoised output. To preserve edges, the decoder was pretrained on image inpainting tasks which was useful in preserving edges. This network is then used to create corresponding scene under novel light directions which is further used to create a relightable radiance field representation.

Implementation

The method was developed upon Multi-Illumination dataset. It consists of 1000 real scenes of indoor scenes captured under 25 lighting directions. The images also consist of a diffuse and a metallic sphere ball that is useful for obtaining the light direction in world coordinates. Additionally some more scenes were rendered in Blender. The network was trained on images at resolution 1536x1024 and training consisted of 18 non-front facing light directions on 1015 indoor scenes.

The ControlNet module was trained using Stable Diffusion v2.1 model as backbone. It was trained on multiple A6000 GPUs for 150K iterations.

Results

Here the diffuse spheres show the test time light directions. As can be seen, the method can render plausible relighting results

This figure shows how with the changing light direction, the specular highlights and shadows are moving as evident on the shiny highlight on the kettle.

This figure compares results with other relightable radiance field methods. Their method clearly preserves color and contrast much better compared to other methods.

Limitations

The method does not enforce physical accuracy and can produce incorrect shadows. Additionally it also struggles to completely remove shadows in a fully accurate way. Also it does work reasonably for out of distribution scenes where the variance in lighting is not much.

Lightit

Introduction

This work proposes a single view shading estimation method to generate a paired image and its corresponding direct light shading. This shading can then be used to guide the generation of the scene and relight a scene. They approach the problem as an intrinsic decomposition problem where the scene can be split into Reflectance and Shading. We will discuss the relighting component here.

Method

Given an input image, its corresponding surface normal, text conditioning and a target direct shading image, they generate a relit stylized image. This is achieved by training a ControlNet module.

During training, the noisy target image is passed to the denoising network along with text conditioning. The normal and target direct shading image are concatenated and passed through a Residual Control Encoder. The feature map is then used to condition the network. Additionally its also reconstructed back via Residual Control Decoder to regularize the training.

Implementation

The dataset consists of Outdoor Laval Dataset which consist of outdoor real world HDR panoramas. From these images, 250 512x512 images are cropped and various camera effects are applied. The dataset consists of 51250 samples of LDR images and text prompts along with estimated normal and shading maps. The normals maps were estimated from depth maps that were estimated using off the shelf estimators.

The ControlNet module was finetuned from stable diffusion v1.5. The network was trained for two epochs. Other training details are not shared.

Results

This figure shows that the generated images feature consistent lighting aligned with target shading for custom stylized text prompts. This is different from other papers discussed whose sole focus is on photorealism.

This figure shows identity preservation under different lighting conditions.

This figure shows results on different styles and scenes under changing lighting conditions.

This figure compares relighting with another method. Utilizing the diffusion prior helps with generalization and resolving shading disambiguation.

Limitations

Since this method assumes directional lighting, it enables tracing rays in arbitrary direction. It requires shading cues to generate images which are non trivial to obtain. Further their method does not work for portraits and indoor scenes.

Takeaways

We have discussed a non-exhaustive list of papers that leverage 2D diffusion models for relighting purposes. We explored different ways to condition Diffusion models for relighting ranging from radiance cues, direct shading images, light directions and environment maps. Most of these methods show results on synthetic datasets and dont generalize well to out of distribution datasets. There are more papers coming everyday and the base models are also improving. Recently IC-Light2 was released which is a ControlNet model based upon Flux models. It will be interesting which direction it takes as maintaining identities is tricky.

References

No Chill, Just PhD; A Brutally Honest Application Journey

2022-08-08T00:00:00+00:00

Into the research-verse
Maple dreams
Leap of faith
Cold hard truths
Submission Time
Suit up
Plan B
(Im)possible

It was around 11 pm as I stared at my laptop anxiously at the TOEFL checkout page. I was waiting while my mom found a card that worked for international transactions. I felt like a bowler at the top of his run-up who had visualized all possible futures but was still nervous about it. After all, if everything went to plan, it would mean leaving all my friends and family behind and venturing into a new country. It would mean eating cuppa noodles and earning peanuts for the next four to five years while my friends made fat stacks. It was a considerable risk, but “if you keep holding onto yesterday, what will you be tomorrow?”

Getting admission to a Ph.D. program at CMU is not impossible. One of my batchmates got in. But it’s really, really hard. While most of you bested JEE to enter IIIT-H, the competition for CMU is international. Your competition consists of a rich, white American student with perfect referees, recommendations, and resources. A French student with multiple collaborative projects. A student from a premier Chinese university with multiple A* conference publications. While you are an enthusiastic Indian student with one decent publication and one killer recommendation. Is that enough to take on the world, find a perfect advisor and get into one of the most competitive programs in the world?? Let me help you understand through my journey, which is not CMU but its German counterpart, MPI-INF, Saarbrucken.

Into the research-verse

My journey started when I joined IIIT Hyderabad in 2017 in the B.Tech+MS program. All I knew was the PCM taught in the coaching centers and had no clue what research entailed or even meant. All my batchmates were engrossed in competitive programming for the first two years or solving issues in open source projects. While I enjoyed the latter, I didn’t get selected for GSoC twice, which broke my spirit. At the same time, I was introduced to machine learning by one of my seniors. Although he made it sound fascinating, I was pretty disappointed as it looked closer to statistics instead of computer science.

At the end of my second year, I interned with DreamVU and joined CVIT under Prof. P.J Narayanan. At this time, I spent a lot of time exploring deep learning and computer vision and came across this link. This was the moment when I fell in love with computer vision. Generating photorealistic faces from a bunch of random numbers felt too good to be true, but here I was refreshing the website and seeing faces generated at my whim. Mind you, this was 2019. The quality wasn’t as good as it is today, but even then, it was unbelievable.

Unfortunately, everything wasn’t exactly rosy for me. For starters, the group worked on the intersection of vision and graphics, a field that was just taking baby steps. The work was fascinating, but the learning curve was way too steep. Reading one paper took 2-3 days since I had to first understand the jargon and previous works to understand what was going on. It took me an entire semester to catch up with it. It was exhausting and soul-crushing, but the promise of amazing results kept me going. I worked with another senior who held my hand and helped me understand how to read papers and set up projects.

However, I had no projects at the end of my third year. My “single-degree” friends had fancy internships, whereas the rest of my “dual-degree” friends either had ongoing projects or were preparing for internships. It was a low point for me. I had to make a decision and commit to it. At this time, another paper dropped, which reignited my passion for the field. I sat down with my advisor and decided on a project. Life seemed set again, and I made research my number one priority.

Except for a couple of months later, it all came crashing down as the project went out of our scope. But I was more determined than ever. I spent hours reading papers and listening to webinars on these topics. Luckily, my senior came up with an exciting idea, and we followed it up.

Maple dreams

In the meantime, instead of applying for industry internships, I applied to the MITACS program, a funded research internship program for Canadian universities. My objective was to explore research as, at this point, I decided to work as a Research Engineer or Scientist. I was interviewed by Prof. Jean-François Lalonde, who also worked in the same domain. My interview went reasonably well, and keeping my SOP aligned to his interests worked in my favor. I got my result in December, and after a long, long time, I finally got a win. I began to believe again.

With a month to go before the deadline for the conference, I went back to college and synced up with my senior. Amidst the covid restrictions and coursework, we worked days and nights and converted one working example into an ICCV submission. It was one of the most challenging times in my college life and also the most fruitful one since I learned the nuances of writing a paper firsthand. A couple of months later, my remote internship started, and finally, my life felt back on track.

Results came back from the conference in the middle of my fourth-year summer, and our paper had been rejected. I wasn’t too disappointed, but another paper had also scooped us. We decided to iron out some issues and submit them to the next conference before our paper became irrelevant. We tweaked the paper and were surprised that the paper that scooped us was performing worse on our setting. Although the updated paper got accepted at ICVGIP, my advisor wasn’t happy with my efforts, and it was a wake-up call for me to push my limits.

On the other hand, my internship went well. I enjoyed working with Prof Jean and realized that research is approached differently in other places. For them, a publication is merely a checkpoint toward solving a more significant problem. It was an interesting way to look at things and intrigued me. At the end of the internship, he offered me a chance to continue working on the problem and include it in my thesis. I discussed this with my advisor, and he agreed to collaborate.

Leap of faith

This was when I had my moment of truth. Ultimately placement season was coming up, and it was clashing with the dates of Ph.D. deadlines. I had a long discussion with my advisors and many seniors about this, and I had to choose. It was a tough decision because I didn’t feel ready to start a Ph.D. myself and wasn’t even sure if I would get one. I realized that only working in industry research groups would add any significant value, and it was worth a shot applying in this cycle. At the end of the day, I had to get three killer LoRs and solid research projects on my SOP, and working as an SDE won’t change my situation. And then, I took the first step, sat out of my college placements, and started preparing for TOEFL. I was scared and anxious because I was betting a lot on the back of very little. But Miles Morales gave me enough motivation to take a leap of faith.

Last week of September 2021. I gave TOEFL, which was surprisingly straightforward, and the speaking section was the only tricky section. While I hadn’t realized it, I had already committed a huge mistake. Most people had approached professors at different universities from August onwards. Some had preliminary interviews, which had a huge say in their final results. On the other hand, I had just made a shortlist of 3 universities where I would apply, namely UMD, CMU, and Cornell. Thanks to lengthy discussions with my seniors, who made me realize that I would need to at least apply to ten universities in the US and make a priority list depending on which universities were safe, moderate, or ambitious.

Cold hard truths

Logistics

I found that in American and some Canadian Universities, you apply to the department, not the professor. An admission committee reviews your application, and if you are applying for a Ph.D., they may send your material to the professor mentioned. But ultimately, it’s the university’s decision if they want to admit you or not. Hence it becomes crucial that your GPAs and TOEFL/GRE scores clear the bar and that you have recommendations, preferably from people who are in some way associated with that university. In addition, a Ph.D. program is five years long, where you have to take courses for the first two years, and you may get a different advisor than you intended.

Further, you only get funded for nine months, and for three months, you must find some internships to feed yourself. Teaching Assistantships are extremely important since they help you with extra cash and are considered vital contributions to the university. Hence it is essential to have some experience with it beforehand. Application fees are pretty high and cost around 100 dollars per application.

In contrast, Ph.D. programs typically last three to four years in Europe. The project proposals are decided beforehand, and many of them are funded by the industry hence there is less academic freedom compared to the US. Further, you apply to the professor here and not to the university. The professor directly conducts all interviews and usually guides you on funding and contract. In Germany and France, it is typically funded by the government. In the UK, the funding is typically given by the industry. In the UK, one is also expected to pay a substantial fee per semester, which isn’t the case elsewhere.

Furthermore, TAship is not as crucial for the selection and is a means of extra money. Also, you are funded all year round and only start internships in your pre-final year. In addition, there are programs like IMPRS and ELLIS which hire students. IMPRS is an MPI-Germany initiative where you are assigned two advisors from different groups. ELLIS, on the other hand, is a pan-Europe program where you are assigned two advisors from two different countries. They are very competitive programs, and you have multiple rounds of interviews before getting selected. In 2022, around 1300 students applied, and only 60 got selected.

Personally, it didn’t matter much which continent I went to, as the labs I was targeting were doing excellent work. Finding a supportive advisor and group instead of a branded university was far more important. However, it is essential to consider pay, cost of living, and lifestyle since one would not be earning much. It is much easier to survive in Bath than London on the same pay, for instance. This was also the reason I was hesitant to apply to China and South Korea since the language and cultural barriers would be a bridge too big to cross.

Letter of Recommendation

Another thing I realized was that “Letter of Recommendation” determines your admission chances. Three letters are required in US/Canada, and for most European countries, two are enough. These letters are mailed directly by the professors. The letters must come across as genuine since it is straightforward to smoke out the fake ones. Here is an excellent guide on what your letter should be like while applying.

Further, these letters must come from well-established and active professors. Connections matter a lot. Anybody would more likely select a student recommended by someone they know. I can’t emphasize enough how big a role the LoRs play. Inform your professor in advance (ideally two months before the deadline) since they might get requests from multiple students, and each student might apply to ten different programs. Another thing to note is that some professors might only give you letters for two-three programs.

Statement of Purpose

Finally, the “Statement of Purpose” is where you sell yourself. This is where you need to convince the admission committee and the professors that you have what it takes, and they should bet on you. Ph.D. is a paid position, and they invest a lot of time and resources in you. This document should outline how you alone are capable of meeting their expectations. Ph.D. is becoming an expert in a super-specific topic, and it’s imperative that your SoP reflects that. Your SoP is not a biography. It needs to be the story of your topic. In a nutshell, these are the questions your SoP should tackle

What is the area of study?
Who are you? What makes you qualified to tackle this subject?
What exactly is the problem? What can be done about it?
Why do you want to work with this professor?
How can you contribute to the school?

Don’t start with how it has been your childhood dream to do a Ph.D. and how Iron Man inspired you. Professors are looking for people with a vision and not a backstory. Show them that you have that vision. Hype yourself up and prove to them that despite your average publication record, you deserve a chance to work with the top research groups in the world. Here are some links to writing SoPs guides; you can check mine out here. I had to rewrite my SoP at least three times and get it checked by many helpful seniors. There are programs by some universities like Cornell, Washington, Toronto, etc., that are willing to review your SoP. There is a mentorship program by SIGGRAPH which I found very helpful and useful. I was linked with a senior from Stanford who helped me shape my SoP. In my experience, there is no better resource other than your your seniors for help. And the best part is they are more excited about your wins than you are.I asked 5-6 seniors for helping me shape my documents and prepare for the interviews.

The more specific your SoP is, the more genuine it looks. In addition, some universities like Michigan ask for a personal statement. It is a short document that outlines your story. The focus is on your skills and habits; you can mention your hardships and difficulties here.

Submission Time

At the start of December, I had all my documents ready. I had applied to ELLIS already and was preparing to finish the US/Canada universities applications. However, amongst all the chaos, I got confused about Cornell’s admission deadline and missed out on my dream school. I was devastated but then ensured that the remaining applications were correctly filled. Each time I had to submit TOEFL scores, I had to pay 20 dollars extra. By 17th December, I had filled all the forms and was waiting for my advisor to submit his letters. Professors usually get extra time to submit the letters. I waited for the Christmas break to end and then cold-emailed professors in Europe with my documents. By the first week of January, I had applied to 10 universities in North America and emailed 16 different European professors.

Reading your SoP an infinite number of times gives you a sense of achievement, but I was also a nervous wreck. The next three months were the most stressful months of my college life. GradCafe posted regular updates of which places were giving out admits, and I was anxiously waiting. Every day without an update felt like a knife to the back. I got rejected by ELLIS in their screening round. It was very disappointing since my application didn’t even reach the professors. A few days later, I started getting replies from some European Professors. Some weren’t willing to take me, some had already filled the position, some asked me to apply later in May, and some thankfully asked me to schedule an interview. Inspired by this success, I emailed all the American/Canadian professors I applied to, hoping they might take a second look at my application.

Suit up

I had my first Ph.D. interview in mid-January. I applied to the Visual Computing and AI group at MPI-INF. It consisted of 5 professors tackling the same more significant problem from different angles. Since I was comfortable with all of them, I emailed all of them separately. My first interview was taken by Prof Vladislav Golyanik and Prof Christian Theobalt. I was asked to make a presentation and present my past work. Luckily, both were happy with my work, and my research interests matched theirs. I was very nervous, but all questions were asked about my work, and then we discussed the logistics of my Ph.D. there. I had more interviews lined up, and at this point, I felt invincible and at the top of the world.

Soon, my world came crashing down. I was rejected from Washington and CMU robotics programs. Both these programs had an insanely high number of applicants. Washington accepted 60 students from a pool of 2400 applications, while CMU took 35 from 786 applications for their robotics program. Toronto had similar statistics as well. Since I didn’t get any interview calls from North America, I realized that it was highly unlikely that I had any chance there. It was a low point for me, and I started making backup plans.

After a couple of weeks of excruciating wait, I got the offer. All the hard work of the last five years culminated at that moment. My job was not done as I had to submit my second paper and thesis, but this was my biggest achievement to date. The next few months were brutal. I didn’t receive any more interview calls and got rejected from everywhere. It made me realize that I am nowhere close to where I am supposed to be and that I had a long, long way to go.

Plan B

In the meantime, I had been looking for pre-doc positions, preferably in computer vision as a plan B. Very few companies worked on it in India, like TCS, Adobe, Verisk, and Mercedes. I got accepted at TCS and was fascinated by the projects they did in 3D vision. They also paid well. I got rejected from Adobe and Microsoft, while I turned down Google and Verisk after I got my offer. Further, I was planning on emailing and finding funded RA positions across the globe. Many seniors had done the same and got into remarkable places.

(Im)possible

So coming back to the question, “Can you get into a Phd program at a top university?”. In my experience, with my credentials, it’s a definite no. But with a couple more publications, who knows? Getting into a Ph.D. programme is extremely challenging and depends on many factors. However, the most important factor will be the match between your and theirs research interests. They are betting 3-5 years on you and hence you need to show proof that you deserve it. Its hard, really really hard. But it all comes down to whether you are passionate enough to risk it all. You will get a chance to write an award-winning paper. Like most things in life, it boils down to persistence and determination. So go for it. Don’t think you aren’t good enough. Don’t think you cant do it. Nobody won a test match at Gabba for 35 years, yet a young Indian team dared to dream and beat a dominant Australia at home. You cracked JEE, you thrived at IIIT-H, you are more than good enough to do research at any lab you want.

Relighting and Material Editing with Implicit Representations

2021-07-04T00:00:00+00:00

Implicit Representation
NeRF
Rendering Equation
NeRV
NeRD
NeRFactor
Takeaways
References

One of the most significant challenges of Computer Vision has been learning the scene from the image. If we can understand the scene and represent it somehow, we will be able to view the scene from new points. This is called Image-Based Rendering (IBR).

The idea is to generate a 3D reconstruction from 2D images and generate unique views. Further, if we wish to retrieve the material and lighting of the scene among other properties, it is referred to as Inverse Rendering.

There are many different ways to represent 3D objects. Classical methods include representing it as a mesh, voxel, or point cloud. These have been extensively studied over the years and have their advantages and disadvantages. Typically they are memory intensive and cannot represent highly detailed objects/scenes or require much computation. While Point clouds can scale well, they usually falter with defining surfaces.

We discuss a new class of representations called implicit representations, which are making a lot of noise for all the right reasons. One of them is called the very famous Neural Radiance Fields or NeRF, which has produced over 15-20 variants within the last year itself. NeRF is fantastic at representing an entire scene and view it from any point. However, we cannot edit the scene in any fashion. So we further go through the variants which can perform relighting and material editing while using NeRF as the scene representation.

Implicit Representation

Recently there has been a new class of representations called implicit representations. The difference mainly is that here we are learning a function that describes the geometry. For example, for a circle, the implicit equation is f(x,y) = x^2+y^2-R^2, where R is the circle’s radius. For any point (x,y), we know if it is on the circle, inside the circle, or outside the circle. Thus, given many points and the information about its position w.r.t circle, we can estimate the circle’s radius.

Similarly, we extend this same idea for 3D as well. We know if points are on, inside, or outside a particular surface, and thus we can estimate our object surface. And what better function approximators are there than neural networks, which are “universal function approximators.”

There are 2 classes of implicit representations depending upon how we want to render the scene. Surface representations aim to find the surface of the object and the corresponding color. In contrast, the volumetric representations do not explicitly look for a surface but instead, try to model the depth of that point and its corresponding color.

Implicit surface representations include Occupancy networks and Signed Distance Fields(SDF). The idea here is that we have a neural network that predicts for a given points its position w.r.t to the object, i.e., if it is on the surface, inside the object, or outside the object. Therefore when we shoot a ray and sample points on it, the network learns their position w.r.t the object. Using this, we can then sample points closer to the surface and find the surface.

The main difference between the occupancy network and the signed distance field is that the occupancy network gives a binary answer. If the point is outside, then it is 0, and if it is inside, it is 1 and on the surface, the value is 0.5. On the other hand, signed distance fields give the distance of the point from the surface. Thus our job would be to find all the points satisfying f(x) = 0. We get positive values inside the object and negative values outside. We can find the surfaces using ray marching or sphere tracing methods. The color of the surface similarly can be found by having a network output for a particular 3D point. While they are pretty famous, they only work where the rays can interact with the surface.

Other works are using implicit surface representations like SRN, Differentiable Volumetric Rendering, PiFU, etc.

NeRF

Instead of finding surfaces of all the objects, we can instead perform volumetric rendering. This is where NeRF and its variants come into the picture. The idea is that instead of learning a surface, we learn the entire volume, which would include not only the objects but also the effects of the medium. Neural Volumes was one of the first works to encode the scene but encoded in a voxel-based representation which is not scalable.

NeRF, on the other hand, uses MLPs to encode the scene. For a ray shot from every pixel, we sample points on the ray. Now every point has a 3D location and a corresponding viewing direction. We pass this 5D vector and obtain the corresponding color and volumetric depth. We do this for all the samples on the ray and then composite them together to get a pixel color. NeRF has two networks, one coarse, which samples points uniformly on the ray, and the other finer network, which does everything the same except we use importance sampling. What it means is that we sample more the points which have more depth, i.e., the objects. Taking viewing direction helps to model view-dependent effects such as specular effects, for eg, reflection of shiny surfaces. Compositing this information uses classic volume rendering techniques, which give us the final image.

So far, the methods that we have read can represent the scene’s geometry well and in a reasonably memory-efficient manner. However, as we have noticed, these methods directly learn and predict the color of a specific point of the surface or scene. Thus it directly bakes in the material and lighting effects, which we cannot edit. Thus although these networks can perform view synthesis pretty well, they cannot change the lighting on the scene or the object’s material.

Rendering Equation

Before we move ahead, let us understand how computer graphics model material and lighting. Consider a scene with one light source, some objects, and a camera. Now we want to know what a point on the object looks like. We can use some good old physics to compute this. By using energy balance, at a particular point we can say that :

I.e. The difference between the power leaving an object, and the power entering it, is equal to the difference between the power it emits and the power it absorbs. In order to enforce energy balance at a surface, exitant radiance Lo must be equal to emitted radiance plus the fraction of incident radiance that is scattered. Emitted radiance is given by, Le and scattered radiance is given by the scattering equation, which gives

Do not worry if it looks too technical. At a particular point, we are summing up the contribution of light reflected across a hemisphere. Factor f is called the bidirectional reflectance distribution function or BRDF which tells us how much power will be reflected and absorbed for a particular material. BRDF tells us the properties of a material. There are many models of BRDF like Cook-Torrance, Disney, etc. If BRDF is different for every point, like in a texture, we call it Spatially Varying BRDF or SVBRDF.

There is another version called the surface version of the rendering equation, which we will be referring to in the future as well:

Here p’ is our surface, and p is the observer surface or camera. p’’ is the surface from where the light ray is coming on p’, A is all of the surfaces. G is the geometric coupling term which stands for:

V is the visibility function which is one of the surfaces that can see each other else 0.

Now that we understand how material and lighting are modeled, we can understand the various threads of works done to give us material and lighting editing capabilities in implicit representations.

NeRV

Neural Reflectance and Visibility Fields for Relighting and View Synthesis or NeRV attempt to relight the scene with multiple point light sources. In NeRF, we assume that no point sampled on the ray reflects light. However, since we want to perform relighting, we need to model how each point will react to the direct and indirect illumination. Thus instead of each point being an emitter, now we need to compute the reflectance function at each point.

So, to begin with, Replace NeRF’s radiance MLP with two MLPs: a “shape” MLP that outputs volume density σ and a “reflectance” MLP that outputs BRDF parameters for any input 3D point. The BRDF model used by the method models it with a 3-dimensional albedo vector and a roughness constant.

Now we can compute the per-point reflectance function analytically for each point along the ray. We would need to query the visibility for each corresponding point the ray hits after hitting one point. However, this operation is very, very expensive, and this is only for direct illumination. For indirect illumination, we need to keep doing the same thing recursively. So instead, what we do is we have a Visibility MLP and Distance MLP. The visibility MLP computes the visibility factor at a given point, whereas the Distance MLP computes the termination point of ray after one bounce.

So to sum up, here is what happens:

Sample each ray and query the shape and reflectance MLPs for the volume densities, surface normals, and BRDF parameters
Shade each point along the ray with direct illumination. Compute this by using the Visibility and BRDF values predicted by corresponding MLP at each sampled point.
Shade each point along the ray with indirect illumination. Use the predicted endpoint and then compute its effect by sampling along that ray and combining the contribution of each point.
Combine all these quantities like in NeRF to get the results

NeRV is designed to work with multiple point light sources precisely. Training is compute-intensive. Once trained, we can modify the BRDF parameters and do material editing for the entire scene.

TLDR; NeRV uses a Shape MLP to predict volume, a BRDF MLP to predict albedo and roughness, Visibility MLP to predict visibility at each point, and Distance MLP to predict termination of ray after one bounce. The results are combined via the rendering equation for each point and then composited together like NeRF using classical volumetric rendering techniques.

NeRD

Neural Reflectance Decomposition or (NeRD) incorporates Physically-based Rendering or PBR into the NeRF framework. As discussed earlier, the color at a point is the integral over the hemisphere of the product of incoming lighting and SVBRDF. A point could be dark due to material, occlusion, or it is surface normal pointing away. All these considerations are not taken into factor by NeRF as it bakes in the radiance.

NeRD has 2 MLPs, namely the sampling MLP and the decomposition MLP. The sampling MLP outputs a view-independent but illumination-dependent color and the volume density of the scene. Like NeRF, the points are uniformly sampled on the ray in this network, and the volume density is used to importance sample points on the objects in the second network. The final ingredient in the sampling network is the illumination for that particular image. Instead of passing environment light, we pass spherical Gaussian representation of it. Spherical Gaussians are analogous to Fourier Transform in 2D. The reason we do this is that we cannot compute the rendering equation analytically. So instead, we convert it into its Spherical Gaussians form where the integral converts to a product operation. Now we can quickly evaluate the equation. So we are learning illumination, volume, and illumination color from the sampling network.

The decomposition network extends the second network of NeRF. Along with color and volume density, we also compute a vector and pass it through another small autoencoder to output the BRDF parameters of the object. Here the BRDF parameters are different from NeRV as the model outputs albedo, metallic, and roughness. The autoencoder is there to optimize the training and improve results. Finally, we combine the outputs like NeRF and pass them through classical volume rendering to output the image.

NeRD makes the color of the scene view independent and learns the lighting and the BRDF properties. Once learned, it is straightforward to model relighting as we know how illumination is combined to give color.

TLDR; NeRD decomposes the scene and learns the illumination and BRDF parameters of the scene separately. The two networks of NeRF are augmented to learn view independent and illumination dependent color, and once trained, it is straightforward to perform relighting.

NeRFactor

NeRFactor is very different from all the works we have seen so far as it distills the trained NeRF model, which no other work in this line has done so far as NeRF works on the entire volume while this works on surface points. It can perform free-viewpoint relighting as well as material editing.

First, we train a NeRF network on the scene. We then keep the coarse network and freeze its weights. We then train a BRDF MLP on the MERL dataset. The MERL dataset contains reflectance functions of 100 different materials. Then we initialize a normal map and a visibility map using the predicted volume from the pretrained NeRF. These maps are very noisy, and hence instead of freezing them, we take them as initializations.

Now we first predict where the ray will hit the surface. Using that as an input, we train 4 MLPs, namely Light Visibility MLP, BRDF Identity MLP, Albedo MLP, and Normal MLP. Since we have our initializations for visibility and normals, they are called pretrained. Now we input the surface points and get the outputs. The BRDF MLP outputs a latent vector z which will be used for material editing. The albedo network handles the diffuse color component. We also estimate lighting for each surface point. NeRFactor can separate shadows from albedo by explicitly modeling light visibility and synthesize realistic soft or hard shadows under arbitrary lighting conditions. All the outputs are then combined like in NeRF and rendered using classical volumetric rendering.

NeRFactor does not predict the BRDF parameters. Instead, it learns a latent vector that can be easily used to render material edits. On top of that, it is not taking points anywhere on the volume but only on the object’s surface.

TLDR; NerFactor uses a trained NeRF to initialize normal and visibility maps and a trained BRDF MLP to learn the latent vector representation. It then searches for the points on the surface of the object and learns its various parameters. After learning, we can perform relighting and material editing.

Takeaways

We go through 3 methods that have empowered NeRF with atleast relighting capabilities. NeRV does this by computing the effects of direct and indirect illumination at each point and approximates visibility and ray termination using MLP. On the other hand, NeRFactor decomposes by first finding the object’s surface and then learns the lighting and BRDF parameters(in this case, a latent vector representation). NeRD is somewhere in the middle where its decomposition network computes the surface normal of the object using weighted sampling and uses it to render the scene but still runs for all points in the volume.

We observe that more and more methods are going towards a surface representation to gain more control over the editing of the scene as we are not very concerned with what happens to the medium. Very excited to see which direction this field takes two more papers down the line.

References

Yes, you can also watch anime

2020-08-03T00:00:00+00:00

Audience Choice
Editor’s Choice

With the Marvel Cinematic Universe making it socially acceptable for 30-year-olds to wear Captain America costumes, there has never been a better time to be a comic book fan. However, roll back the years and there used to be a time when even 13-year olds reading comics were often looked down upon. That said, a similar culture has arisen however: not from the west, but the eastern side of the world. It is notably known as anime culture.

Anime is basically an abbreviation of animation which is crudely also referred to as cartoon. However, that is where the similarities end between the east and the west. Anime is a multi-billion dollar industry in Japan and is more celebrated than the live-action industry. It is something that is ingrained in their culture as one can see that in their version of Times Square, known as Akihabara, in Tokyo. So you must be wondering, everyone has watched anime when they lost their first tooth, and I am not going to watch Doremon, Shinchan and Pokemon. And sure, they are probably the most well-known anime. But, there is so much more to this medium and it is incredible, the kind of groundbreaking stories that have been narrated. Fun Fact: The Matrix was inspired by another anime called Ghost in the shell.

“Ok, I’ll give it a go, but what do I watch?’’ Well everyone has a personal top 10, so I won’t bias you here with mine. We had a poll in IIIT-H’s Anime & Manga facebook group and got our results. (Note: Some of them are changed for the ease of first time watchers).

Audience Choice

10) Kaguya Sama - Love is War

One thing that you need to realize with anime is that there are very few anime which do not have a high school in it and this one is no different. Kaguya Sama is a show about two high school students who are pretty much in love and are from very different backgrounds. Is the problem the economic status? Not at all. It’s about who confesses first. A fun show where they go to nonsensical lengths in order to make each other confess. Watch it for the side characters and the Chika dance (you’ll know what I mean).

9) Haikyuu

Haikyuu is a story of a volleyball team trying to win tournament after tournament. Now, I know what you are thinking; why would I watch a volleyball show when I have no interest in it. Well, the show embodies the best of sports itself. The drama, the camaraderie, the chills, the heartbreak all packaged in a very realistic portrayal. It probably is the only anime which comes close to a real-world sports experience. This show single-handedly increased the number of kids playing volleyball in Japan by 200%. So yeah, it’s good. And I assure you, you will be shouting with joy when Tsukki blocks Ushijima.

8) Demon Slayer

Imagine one day coming back home to find your entire family getting slaughtered by demons and only your sister is alive who, “wait for it”, gets converted to a demon. Sounds like a bad day huh? Demon Slayer is an anime which has won fans all over but what really stands out is that it’s not a revenge story. With animation to blow you out of (or with) water and an unforgettable opening, do let me know about the chills you get after episode 19.

7) One-Punch Man

“Just a hero for fun”, that’s what he says when asked about his backstory. One Punch Man is one hell of a parody show which makes fun of all your superhero cliches and turns right on them. Our hero stays true to his words and never requires more than a punch to win. Normally that would set up a very boring story but this one is an expert on subverting expectations. And I am not even starting on its God Tier Animation and fight sequences. It tackles very interesting dilemmas which come with being unknown and being the strongest man in the universe. A true masterpiece.

6) Steins Gate

Time travel either makes for a very interesting story or a very confused one. Fortunately for our group of misfits, it’s a masterpiece. Steins Gate is about Okabe Rintarou messing with his in house microwave time machine and helping his friends achieve their dreams. However, there are consequences to everything and once that catches up to Okabe he has to make some very difficult decisions to save all his loved ones. This one has a wholesome story which quickly turns depressing. 2nd highest-rated show on MyAnimeList (MAL) until very recently.

5) Naruto

Ok, all right. Finally one anime which you have heard about because of all those Naruto Sasuke memes. But in its essence, it is truly a wonderful show. Although notorious for its stupid fillers, Naruto is a show that teaches you a lot. Especially about how to beat a villain by talking. It’s a show with probably the best fight of all time (Lee vs Gaara) and overall is a fun enjoyable watch. With villains having more popularity than Naruto at times (Yeah Itachi), close hand to hand combat that is rarely seen in anime fights, a huge cast with a huge diversity of powers and moves, and a story about an outcast becoming the leader, full of heart and comedy. All I am going to say is, believe it.

4) Code Geass

Big Giant Robots? Check. Political Conspiracy? Check. Smart and Charming Hero? Check. Love Triangles? Check, check check. Code Geass is one of those shows that hits like a truck. It is a show about Lelouch, the Prince of Britannia who’s out for revenge against his father and acquires a power which makes people obey him. The story is notorious for its cliffhangers and yes, anything that can happen, happens in this show. However, what this show is really famous for is its iconic ending which, frankly, none of us saw coming.

3) Fullmetal Alchemist: Brotherhood

No top 10 anime list is ever complete without this show and for good reason. Two brothers are trying to bring their mother back to life using Alchemy but at a precious cost. Not only do they fail, but one also loses his arm and leg while the other has his soul attached to a metal body. Sounds depressing? It gets worse. However, it’s not all gloom and doom and is probably the most balanced show out there. With the lines between villain and heroes often getting blurred and ethical dilemmas coming up every now and then, it’s a show that has something for everybody. One of those rare shows that never lets you get bored in its runtime, and with a soundtrack to die for, no wonder it has been the highest-rated show on MyAnimeList (MAL) for the last 10 years.

2) Attack on Titan

By the looks of it, it is a show about large Titans eating humans and our struggle to survive. 25 episodes in and I’m sure you’ll agree when I say that’s the tip of the iceberg. What sets it apart is that there is nothing that you can predict in this show. Every episode keeps you guessing and your views about the characters are constantly challenged. If you liked the vast world and unpredictability of Game of Thrones, you won’t be disappointed.Each season recontextualizes the entire show and upon rewatch you realize just how wrong you were. Every season is in itself a different show yet they all connect so well, and once you get to know what’s in the basement, the world will not be the same again. Shinzou wou Sassagayeo!!!

1) Death Note

Alright then, you know this one. Everyone has watched it and everyone knows that Light had a potato chip. Death Note was the anime which changed everyone’s view about anime. Light Yagami is one of the brightest students in the country and one day he gets a notebook which kills a person just by writing their name in it. So is he right in killing criminals or does he deserve to go to jail? With a musical score to remember and a battle of wits never seen before, this is one show which never gets old.

These 10 shows are not necessarily the best of anime. However, what I can assure you is that once you have watched a few of them completely, your worldview would definitely change.

Now, being a writer I wouldn’t be doing my job if I didn’t throw in some of my own suggestions, which I shall here call the Editor’s choice. So some of them are as follows -

Editor’s Choice

1) Kimi no Na wa (Your Name)

Taki and Mitshuha are two high school students who randomly wake up in each other’s bodies and live in extremely different places. The question? Do they ever meet each other? The animation? Well it makes me want to live in that world. The soundtrack? Hard to forget even after years. It’s one movie which makes you laugh and cry at the same time. The realization of the twist when it hits is second to none and only a few anime movies can claim to give the same feeling of satisfaction and joy as this movie. A must watch especially if you want to start watching anime.

2) Neon Genesis Evangelion

There are some anime which redefine the genre and have an impact like no other. Today it is considered as the founding father of modern anime because of how it set character tropes. Humanity is attacked by aliens and its last hope is a 14-year-old boy, Shinji, who must fight it with his robot. Sounds like a Transformers movie? Only that it descends into a study of the human mind and has consequences people still don’t understand after 25 years. Tip: Watch End of Evangelion to understand what happened in the last 2 episodes

3) Psycho Pass

If you have grown up on the likes of Steven Spielberg and his Sci-Fi movies, you’ll definitely love this one. A show about a mass murderer on the loose who cannot be detected by a system which predicts who will commit a murder coupled with some stylish music and ethical dilemmas, this show is one which is guaranteed to shock you as well as mesmerize you with its plot.

4) Hunter x Hunter

Hunter x Hunter is widely considered to be the best shounen (action) anime of all time. It’s a show about Gon searching for his father while becoming a Hunter (not the one that kills animals). It is a story of him overcoming various tests and challenges in his pursuit. What really sets this apart is the variety of challenges and the way the power system in this series works. While many series would cheat with rules, this doesn’t, which sets up some of the most interesting fights ever(Gon vs Hisoka, Kurapika vs Uvogin)

5) Nichijou - My Ordinary Life

Short stories about the daily lives of 3 high school girls. What could be interesting about this? Well possibly everything. With over the top animations of mundane things like not knowing how to order coffee at a starbucks to boarding the train alone and leaving your friends behind accidently, this anime never fails to make us laugh. Watch it for the not so ordinary life of yukko and her friends.

6) Cowboy Bebop

Set in 2077, this is about the space adventures of our 4 main characters: Spike, Faye, Jet and Edward. Each episode shows them tackling some adventure in the weirdest corners of space along with their poor food plans. While each episode is independent, it is iconic for its vast array of amazing music ranging from jazz to blues and how it tackles with loss. Cowboy bebop isnt a coming of age story but an epilogue for our characters and you will definitely cry when “Call me Call me” starts playing.

So that was one long article on some of the shows that we as a community have come to love and cherish. Most of them are now available on Netflix and we have only scratched the surface. Anime as a medium will continue to grow and evolve and some stories are hard to present in live-action. So what are you waiting for??