<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-03-27T19:58:17+00:00</updated><id>/feed.xml</id><title type="html">Vrroom’s Blog</title><subtitle>Blog of the Vrroom</subtitle><entry><title type="html">Harnessing Weak Supervision to Isolate Sign Language Communicators in Crowded News Videos</title><link href="/blog/2024/08/11/sign-detection.html" rel="alternate" type="text/html" title="Harnessing Weak Supervision to Isolate Sign Language Communicators in Crowded News Videos" /><published>2024-08-11T00:00:00+00:00</published><updated>2024-08-11T00:00:00+00:00</updated><id>/blog/2024/08/11/sign-detection</id><content type="html" xml:base="/blog/2024/08/11/sign-detection.html"><![CDATA[<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		tex2jax: {
			inlineMath: [['$', '$']]
		}
	});
</script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p><em>This post is part of a new foundation that we are trying to create – <a href="https://longtailai.org/">Longtail AI Foundation</a>, that works towards increasing accessibility for the hearing impaired. It describes our first steps towards creating a large-scale dataset – <code class="language-plaintext highlighter-rouge">isl-500</code> for training bi-directional translation models between English and Indian Sign Language (ISL).</em></p>

<h1 id="introduction">Introduction</h1>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/sign_detect.png" width="500px" />
</td>
</tr>
<caption style="caption-side: bottom;">Our Hand-Signer detection model is able to detect hand-signers in News Videos with (a) multiple people present (b) in multiple views. The hand-signers are marked by a bounding box.</caption>
</table>
</div>

<p>I believe that we can solve continuous sign language translation convincingly, if we have access to truly large-scale datasets. Compare with Audio Transcription, by considering Table 6. from <a href="https://cdn.openai.com/papers/whisper.pdf">Whisper</a>, the state-of-the-art audio transcription system. Their lowest word-error rate (WER) of 9.9, requires 680K hours of noisy transcribed audio data. It is unlikely that such a dataset will ever exist for Sign Language Translation; the footprint of the hearing-impaired on the Internet is far too small.</p>

<p>This is not a problem, since the table also shows that with 6.8K hours, the model achieves a word error rate of 19.6. This is <em>pretty impressive</em> considering that OpenAI evaluated their models on datasets that, for one weren’t seen during training but secondly, and more importantly, came from a variety of sources and were made by entirely different groups.</p>

<p>This is in stark constrast with current SL research where most papers report numbers on the validation sets of the same dataset used in training (commonly used datasets are CSL-Daily for Chinese SL and Phoenix-2014T for German SL. These datasets are on the order of 100 hours). While over years, the numbers improve, they themselves aren’t indicative of how these SL Translation systems perform in-the-wild and whether they bring meaningful impact to the lives of the hearing-impaired.</p>

<p>All this is to say, that we need to build a 5000 hour scale dataset for Sign Language Translation and we are good to go. But where can we find this data? Luckily news broadcasters often include special news segments for the hearing-impaired. In these segments, a hand-signer gesticulates words while simultaneously, an anchor reads out the news. By detecting the hand-signer and pairing their gestures with audio transcriptions, we can start making progress on this problem.  The steps to do detect hand-signers are described below.</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/overview_sign_detect.png" width="400px" />
</td>
</tr>
<caption style="caption-side: bottom;">Overview of our System</caption> 
</table>
</div>

<p>We first run a human pose estimation model <a href="https://github.com/IDEA-Research/DWPose">DWPose</a> on news videos that we collected from YouTube. We design heuristics to label to pose sequences. These heuristics are described in detail later in this post. Some of these heuristics may abstain from labelling while others may not be applicable at test time. We use <a href="https://github.com/snorkel-team/snorkel">Snorkel</a> to aggregate these heuristics and assign probabilistic labels on an unlabelled training set. Internally, Snorkel weighs agreement/disagreement between heuristics on the entire unlabelled training set to assign probabilistic labels. We use these probabilistic labels to train a hand-signer classifier on pose sequences. Since Snorkel is only applicable during training time, it’s necessary to train the classifier so that we can use it later to detect hand-signers.</p>

<h1 id="snorkel-making-sense-of-weak-labels">Snorkel: Making sense of Weak Labels</h1>

<p>Let me be up front. I don’t know how Snorkel works. But let me use this space to pass on some intuition, regarding why such a system may improve the quality of heuristic labels. Let’s say we have 5 heuristics A, B, C, D and E. A, B and C agree on everything but are <em>random</em> i.e. they label by flipping a coin. D and E, on the other hand, are pretty accurate.</p>

<p>Consider using majority vote to combine these heuristics. It’s evident that since A, B and C always vote the same way, they represent the majority. All the valuable information from heuristics D and E is lost in this exercise.</p>

<p>My guess is that Snorkel improves on majority vote by detecting correlations between heuristics and avoids double, or in this case, triple counting certain heuristics. Thus, its probabilistic labels are more reliable than the heuristics themselves.</p>

<h1 id="heuristics-for-classifying-hand-signers">Heuristics for classifying Hand-Signers</h1>

<p>Here I describe the heuristics I developed for classifying pose sequences.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">num tracks</code>: This heuristic counts the number of humans simultaneously tracked in a sequence. If there are many people tracked, it’s unlikely that any one of them is a signer. This is a really coarse heuristic, but the table below shows that it is better than random (which will have an error rate of 0.5 for this binary classification problem).</li>
  <li><code class="language-plaintext highlighter-rouge">video path</code>: This heuristic assigns labels based on the data source. For example, if the sequence came from a Sign Language dictionary, we can be sure that the tracked person is indeed a hand-signer. This heuristic has 0 error rate but has poor coverage as well. It highlights a common tradeoff in designing heuristics: it is hard to design heuristics that label all samples in the dataset with low error rate.</li>
  <li><code class="language-plaintext highlighter-rouge">legs visible</code>: While going through News Videos, I found that most hand-signers are either sitting, or are too close to the camera, so that their legs are not visible. Using pose estimation confidence scores, we can check this and assign a label.</li>
  <li><code class="language-plaintext highlighter-rouge">only one person</code>: This is similar to the <code class="language-plaintext highlighter-rouge">num track</code> heuristic. If there is only one person tracked, it’s likely that the person is a hand-signer.</li>
  <li><code class="language-plaintext highlighter-rouge">bounding box</code>: This heuristic checks if the bounding box of a person is too small compared to video dimensions. If so, it’s unlikely that the person is a hand-signer.</li>
  <li><code class="language-plaintext highlighter-rouge">movement</code>: This heuristic measure how much a person is moving their hands. If it is above a certain threshold, they are probably a hand-signer. This heuristic is pretty good, already achieving pretty low error rate and 100% coverage.</li>
  <li><code class="language-plaintext highlighter-rouge">chest level</code>: This heuristic simply checks whether a person’s hands cross their chest. If so, they maybe a hand-signer.</li>
</ul>

<table>
  <thead>
    <tr>
      <th>Heuristic</th>
      <th>Error Rate (dev) ↓</th>
      <th>Error Rate (test) ↓</th>
      <th>Coverage (dev) ↑</th>
      <th>Coverage (test) ↑</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>num tracks</td>
      <td>0.28</td>
      <td>0.27</td>
      <td>1.00</td>
      <td>1.00</td>
    </tr>
    <tr>
      <td>video path</td>
      <td>0.00</td>
      <td>0.00</td>
      <td>0.02</td>
      <td>0.02</td>
    </tr>
    <tr>
      <td>legs visible</td>
      <td>0.19</td>
      <td>0.15</td>
      <td>1.00</td>
      <td>1.00</td>
    </tr>
    <tr>
      <td>only one person</td>
      <td>0.05</td>
      <td>0.03</td>
      <td>0.35</td>
      <td>0.37</td>
    </tr>
    <tr>
      <td>bounding box</td>
      <td>0.23</td>
      <td>0.21</td>
      <td>1.00</td>
      <td>1.00</td>
    </tr>
    <tr>
      <td>movement</td>
      <td>0.05</td>
      <td>0.08</td>
      <td>1.00</td>
      <td>1.00</td>
    </tr>
    <tr>
      <td>chest level</td>
      <td>0.15</td>
      <td>0.16</td>
      <td>1.00</td>
      <td>1.00</td>
    </tr>
  </tbody>
</table>

<p><strong>Table 1:</strong> Labelling Functions: Error Rate (1 − Accuracy) and Coverage (the fraction of data points they don’t abstain from labelling) of different labelling functions. Note that our dev and test sets are class balanced i.e. they have equal number of hand-signers and non-hand-signers. The dev set was used to tune the thresholds in the heuristics.</p>

<h1 id="evaluation">Evaluation</h1>

<p>I compared our classifier trained on Snorkel’s probabilistic labels against few reasonable baselines. The simplest one is a majority vote, where among all the non-abstaining heuristics, we choose the majority vote. From the table, we can confirm our intuition that majority vote is not good. In fact, it’s worse than our best heuristic (<code class="language-plaintext highlighter-rouge">movement</code>).</p>

<p>To control for the effect of Snorkel, we also trained the classifier on labels from the <code class="language-plaintext highlighter-rouge">movement</code> heuristic. It is not surprising that this doesn’t meaningfully impact the error rate. In fact, had it done so, I’d be more likely to attribute it to a bug in my code than to anything else.</p>

<p>Training on Snorkel’s probabilistic labels produces the best error rate. They are half of our best-performing heuristic (both on the dev and test set, incidentally).</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Error Rate (dev) ↓</th>
      <th>Error Rate (test) ↓</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>majority vote</td>
      <td>0.13</td>
      <td>0.12</td>
    </tr>
    <tr>
      <td>movement heuristic</td>
      <td>0.05</td>
      <td>0.08</td>
    </tr>
    <tr>
      <td>movement heuristic + classifier</td>
      <td>0.04</td>
      <td>0.08</td>
    </tr>
    <tr>
      <td>ours</td>
      <td>0.02</td>
      <td>0.04</td>
    </tr>
  </tbody>
</table>

<p><strong>Table 2:</strong> Error rates: We compare our final classifier against reasonable baselines (the best performing heuristic, a classifier trained on the best performing heuristic, and a majority vote among all heuristics). Pooling heuristic labels using Snorkel and training a classifier slashes the error rate by at least 50% on both our dev and test set.</p>

<h1 id="final-thoughts">Final Thoughts</h1>

<p>I should point out that Snorkel is not a magic wand. The heuristics that one uses should be pretty good already and should broadly cover the dataset. Also, detecting hand-signers is quite a simple computer-vision problem, in the era of GPT-4o. That being said, it is pretty exciting that it works at all and should make data acquisition cheaper in a lot of use cases.</p>

<p>Our code is available on <a href="https://github.com/Longtail-AI-Foundation/sign-detect">GitHub</a>. We have used it to clean ~ 500 hours of news videos (ISL) along with transcripts. If anyone has any use for this this dataset/code, please reach out to us. Any thoughts/suggestions will be appreciated.</p>

<hr />]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Segmenting Comic book Frames</title><link href="/blog/2024/02/23/comic-frame-segmentation.html" rel="alternate" type="text/html" title="Segmenting Comic book Frames" /><published>2024-02-23T00:00:00+00:00</published><updated>2024-02-23T00:00:00+00:00</updated><id>/blog/2024/02/23/comic-frame-segmentation</id><content type="html" xml:base="/blog/2024/02/23/comic-frame-segmentation.html"><![CDATA[<p><em>This post is based on my project in my Computer Vision class last semester</em></p>

<h1 id="introduction">Introduction</h1>

<p>As I was learning classical techniques in my Computer Vision class, I came across a <a href="https://maxhalford.github.io/blog/comic-book-panel-segmentation/">blog post</a> by Max Halford on extracting frames from comic books. He developed a very interesting algorithm where he applied <em>Canny</em> to detect the boundary of frames, filled holes and fit bounding boxes to contiguous regions.</p>

<p>This elegant algorithm did the job very well but had its shortcomings. For one, it didn’t handle arbitrary, un-aligned polygons and didn’t work on <em>negative frames</em>, which didn’t have a boundary of their own, but rather were defined by the boundaries of neighboring frames.</p>

<p>Given the hype around <em>foundation models</em> for segmentation such as <a href="https://github.com/facebookresearch/segment-anything">SAM</a>, I approached this problem by procedurally generating a synthetic dataset of comic books and finetuning SAM to detect the corner points of frames.</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/comic_panels.png" width="300px" />
</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">Failure cases of heuristic approaches: (Top) Frames from Pepper and Carrot by David Revoy are polygons and not axis-aligned bounding boxes. (Bottom) Negative frames may not have a well defined border. </caption>
</table>
</div>

<h1 id="procedural-comic-generator">Procedural Comic Generator</h1>

<p>There isn’t abundant data available for this problem. But that doesn’t mean that we should hold our head in our hands. A common technique that is widely used (see <a href="https://errollw.com/">Erroll Wood’s</a> work) is to procedurally generate training data.</p>

<p>In our case, this means simulating comic books. Note, we don’t really need to make gripping animations and tell a story, we just need to generate panels that look like comics from 50,000 feet. In order to do this, I wrote a procedural generator of layouts and assigned random boxes on an empty image. I filled these boxes with images sampled from the <a href="https://danbooru.donmai.us/">Danbooru</a> dataset.</p>

<p>In order to ensure that the sampled images were atleast semi-coherant, I used <a href="https://github.com/openai/CLIP">CLIP L/14 image encoder</a> to create an image index. While choosing images for a particular page, I sampled one image at random from Danbooru and filled the rest of the boxes using it’s k-nearest neighbors.</p>

<p>With this procedural generator, I had complete control of the size, shape and boundary properties of the box, which I could set appropriately to simulate <em>negative</em> and <em>polygonal</em> frames.</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/procedural_frames.png" width="300px" />
</td>
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/procedural2.png" width="300px" />
</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">Initial and final version of our procedural comic generator: (Left) Initial version is just a bunch of boxes. (Right) Final version where I added images from Danbooru randomly to the boxes.</caption>
</table>
</div>

<h1 id="comic-segmentation">Comic Segmentation</h1>

<p>I used SAM as the backbone for my model. SAM is the state-of-the-art image segmentation model. It consists of a heavy, compute expensive image encoder and a light-weight decoder, which answers segmentation queries. The heavy encoder encodes an image only once, after which multiple segmentation queries are answered cheaply. This division of labor is particularly useful for deployment, where an enterprise serving a user can optimize for both speed and costs by keeping the heavy encoder inference on the cloud and using the user’s device for light-weight inference.</p>

<p>Since SAM predicts dense, per pixel mask, I modified it to predict points instead. An overview of the model can be seen below. The procedurally generated comic frame is fed to the image encoder (whose weights remain unchanged during training). A point is randomly sampled from a frame and given as a query/prompt. The light-weight decoder is trained to recover the corners of the frame.</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/architecture.png" width="500px" />
</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">Model Overview</caption>
</table>
</div>

<p>I learned two lessons while training this model. Firstly, it was important to canonicalize the order in which the corners of the frame were predicted. Without this, the model got conflicting signals on the ordering of corner points and never converged. Secondly, it was important to use L1 instead of L2 loss since L2 optimized very quickly without improving the quality of predictions.</p>

<h1 id="evaluation">Evaluation</h1>

<p>I compared my method against original SAM and Halford’s method. Note that Halford’s method is a bit disadvantaged in this comparison since my method also uses a query (set to the center of the ground truth frame to be predicted). Despite this, it is evident that our model trained on our procedurally generated dataset, generalizes on “real-world” comics (Pepper and Carrot abbrev. as P&amp;C), coming close to Halford in the process. It beats Halford on procedurally generated dataset (abbrev. Pr), since this dataset is designed to expose the flaws in the method.</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>IoU (P&amp;C)</th>
      <th>PCK@0.1 (P&amp;C)</th>
      <th>L1 (P&amp;C)</th>
      <th>IoU (Pr)</th>
      <th>PCK@0.1 (Pr)</th>
      <th>L1 (Pr)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>SAM</td>
      <td>0.42</td>
      <td>0.52</td>
      <td>0.37</td>
      <td>0.81</td>
      <td>0.94</td>
      <td>0.08</td>
    </tr>
    <tr>
      <td>Halford</td>
      <td><strong>0.93</strong></td>
      <td>0.96</td>
      <td><strong>0.04</strong></td>
      <td>0.47</td>
      <td>0.61</td>
      <td>0.47</td>
    </tr>
    <tr>
      <td>Ours</td>
      <td>0.88</td>
      <td><strong>0.98</strong></td>
      <td>0.05</td>
      <td><strong>0.88</strong></td>
      <td><strong>0.99</strong></td>
      <td><strong>0.03</strong></td>
    </tr>
  </tbody>
</table>

<p>Here, IoU simply measures the area of intersection over union of the ground truth and predicted frames. PCK@0.1 refers to the percentage of times, the predicted frame corner lies within certain radius of the ground truth frame corner (0.1 refers to the radius as a percentage of the diagonal of the comic page). L1 is simply the L1 distance between ground truth and predicted frames.</p>

<p>Below are some qualitative results which demonstrate that our method works on “real-world” comics. We run it in two modes. On the left, we interactively provide a query and the model produces the corners. On the right, we sample a bunch of query on the image, predict polygons and filter them using <em>non-maximal suppression</em> like the original SAM paper.</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/interactive.png" width="300px" />
</td>
<td style="border: none; text-align: center">
  <img src="/assets/comic_frame_seg/confidence.png" width="300px" />
</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">Qualitative Results</caption>
</table>
</div>

<h1 id="final-thoughts">Final Thoughts</h1>

<p>There are still shortcomings to my method and it can often fail for complex, cluttered comic pages. But still, I like this approach to designing algorithm over composing OpenCV functions because it is often easier to see how to improve the dataset than to design new heuristics. Once you do that, you almost have a guarantee that the Neural Network machinery will get you the results.</p>

<p>The annotated Pepper and Carrot dataset that I used for evaluation can be found in my <a href="https://drive.google.com/file/d/1z8OE8TC8eupC6_ZNxUSVyfvk4rSkVIgE">drive link</a>. All my code and checkpoints are available in my <a href="https://github.com/Vrroom/segment-anything-comic">Github Repo</a>. If you think of any improvements to my approach, feel free to reach out!</p>

<hr />]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[This post is based on my project in my Computer Vision class last semester]]></summary></entry><entry><title type="html">Coloring with ControlNet</title><link href="/blog/2024/02/16/interactive-coloring.html" rel="alternate" type="text/html" title="Coloring with ControlNet" /><published>2024-02-16T00:00:00+00:00</published><updated>2024-02-16T00:00:00+00:00</updated><id>/blog/2024/02/16/interactive-coloring</id><content type="html" xml:base="/blog/2024/02/16/interactive-coloring.html"><![CDATA[<!-- Embed videos side by side -->
<div style="overflow-x: auto;">
<!--<tr style="border: none;"> -->
<!--<td style="border: none;"> -->
<table style="border: none;">
<tr style="border: none;">
<td style="border: none;">

<video width="320" height="240" controls="">
  <source src="/assets/int_color_vid_1.mp4" type="video/mp4" />
  Your browser does not support the video tag.
</video>

</td>
<td style="border: none;">

<video width="320" height="240" controls="">
  <source src="/assets/int_color_vid_2.mp4" type="video/mp4" />
  Your browser does not support the video tag.
</video>

</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">A couple of cherry-picked examples that show how someone might use this model</caption>
</table>
</div>

<h2 id="introduction">Introduction</h2>

<p>I trained a ControlNet for interactively coloring line drawings. I was inspired partly by a <a href="https://twitter.com/lvminzhang/status/1612421180933406720">Twitter post</a> by the Lvmin Zhang, the original author of <a href="https://github.com/lllyasviel/ControlNet">ControlNet</a> and <a href="https://github.com/lllyasviel/style2paints">Style2Paints</a> project, and partly by my niece, who can spend hours filling up her coloring books.</p>

<!-- <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Color scribbles. Note that color scribbles in large diffusion models can be very tricky. We are still experimenting this to make it easier to control. <a href="https://t.co/WoYtJlsjbc">pic.twitter.com/WoYtJlsjbc</a></p>&mdash; style2paints (@lvminzhang) <a href="https://twitter.com/lvminzhang/status/1612421180933406720?ref_src=twsrc%5Etfw">January 9, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> -->

<p>More generally, I am excited by how deep learning provides a way to design new user interactions. The basic idea is to conceive of a new interaction (e.g. color strokes), procedurally generate examples of the interaction from some target (e.g. colored images) and train a model to map interactions to target (e.g. from color strokes to colored images).</p>

<div style="overflow-x: auto;">
<table style="border: none;">
<tr style="border: none;">
<td style="border: none;">
  <img src="/assets/int_color_overview.png" />
</td>
</tr>
<caption style="caption-side: bottom; text-align: center;">Overview</caption>
</table>
</div>

<p>Now that I’m winding down this experiment in order to reduce my expenses on Lambda Labs, I’m writing this blog post to document my progress.</p>

<h2 id="model-and-dataset">Model and Dataset</h2>

<p>I trained the popular ControlNet diffusion model (based on Stable Diffusion) that synthesizes images matching a given hint. Training such a model requires a paired dataset of target images and hints. Usually, during training, the hint is generated from the target image <em>procedurally</em>. In the simplest case, by applying <em>Canny</em> Edge Detector or a <em>pre-trained</em> Depth-estimation model to the target image. Using this paired dataset, The model learns to go the other way, i.e. from the hint to the target image. Examples of hints from the original ControlNet paper are <em>Canny</em>, <em>HED</em> edges, <em>depth</em> maps, <em>human pose</em> and even simulated <em>doodles</em>.</p>

<p>In our case, we concoct two hints from the target image, a <em>Canny</em> edge map and a few procedurally generated <em>color strokes</em>. These are encoded as a 5 channel image, 1 channel for <em>Canny</em> and 4 for the RGBA color strokes. The extra <em>alpha</em> channel delineates the presence or absence of the stroke.</p>

<p>The target image set I used, was a combination of <a href="https://huggingface.co/datasets/vivym/midjourney-messages">Midjourney Messages</a> and <a href="https://ai.google.com/research/ConceptualCaptions">Google Conceptual Captions</a>. Both of these are conveniently available as caption, image URL pairs. Since each training iteration of ControlNet is so slow (~ 1it/s), I found it convenient to simply fetch these URLs in my dataloading loop instead of downloading all files upfront. This saved me a lot of money for storing data.</p>

<p>I generated color strokes using a simple algorithm. I chose a random pixel in the target image and began a random walk for certain steps. I picked all the colors from the target during this walk and applied it on my color stroke map. With some probability, the random walk was made thicker than 1 pixel, using a <code class="language-plaintext highlighter-rouge">max</code>-filter. Again, with some probability, the color stroke map was blurred to discourage the model from learning the trivial <code class="language-plaintext highlighter-rouge">copy</code> operation, where it simply copied the color stroke onto the target image without doing anything meaningful elsewhere.</p>

<p>The code for the color stroke generator can be found below:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add_color_stroke</span> <span class="p">(</span><span class="n">color_strokes</span><span class="p">,</span> <span class="n">source_img</span><span class="p">)</span> <span class="p">:</span>
    <span class="c1"># get shape of image, starting point and length of walk
</span>    <span class="n">h</span><span class="p">,</span> <span class="n">w</span> <span class="o">=</span> <span class="n">source_img</span><span class="p">.</span><span class="n">shape</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span>
    <span class="n">st_y</span><span class="p">,</span> <span class="n">st_x</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">h</span> <span class="o">-</span> <span class="mi">1</span><span class="p">),</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">w</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">L</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>

    <span class="c1"># construct walk
</span>    <span class="n">dirs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
    <span class="n">rng_idx</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="p">(</span><span class="n">L</span><span class="p">,))</span>
    <span class="n">steps</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">rng_idx</span><span class="p">]</span>
    <span class="n">px_points</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">steps</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">px_points</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="n">st_y</span>
    <span class="n">px_points</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="n">st_x</span>

    <span class="c1"># find when walk goes out of bounds and truncate it
</span>    <span class="n">y_mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">px_points</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">h</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">px_points</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">x_mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">px_points</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">w</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">px_points</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">ff_id</span> <span class="o">=</span> <span class="n">find_first_false</span><span class="p">(</span><span class="n">y_mask</span> <span class="o">&amp;</span> <span class="n">x_mask</span><span class="p">)</span>
    <span class="n">px_points</span> <span class="o">=</span> <span class="n">px_points</span><span class="p">[:</span><span class="n">ff_id</span><span class="p">]</span>

    <span class="c1"># create mask from walk
</span>    <span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">h</span><span class="p">,</span> <span class="n">w</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">uint8</span><span class="p">)</span>
    <span class="n">mask</span><span class="p">[</span><span class="n">px_points</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">px_points</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">255</span>

    <span class="c1"># optionally, dilate the walk
</span>    <span class="n">thickness</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
    <span class="k">if</span> <span class="n">thickness</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="p">:</span>
        <span class="n">footprint</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="n">thickness</span><span class="p">,</span> <span class="n">thickness</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">uint8</span><span class="p">)</span>
        <span class="n">mask</span> <span class="o">=</span> <span class="n">skimage</span><span class="p">.</span><span class="n">filters</span><span class="p">.</span><span class="n">rank</span><span class="p">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">footprint</span><span class="p">)</span>

    <span class="c1"># copy over colors from input image
</span>    <span class="n">color_strokes</span><span class="p">[</span><span class="n">mask</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="p">:</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">source_img</span><span class="p">[</span><span class="n">mask</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">]</span>
    <span class="n">color_strokes</span><span class="p">[</span><span class="n">mask</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">mask</span><span class="p">[</span><span class="n">mask</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<h2 id="infrastructure-and-costs">Infrastructure and Costs</h2>

<p>Now, onto some details that give you a sense of the investment required to run such experiments. I rented an A10 24GB GPU on Lambda Labs. I have used GCP before but I loved the simplicity of what Lambda Labs offers. For example, I can use it using their cloud computers using tools I already know. I don’t have to worry about the <code class="language-plaintext highlighter-rouge">gcloud</code> SDK. This is a minor pain point, but I do think it makes a difference. Lambda Labs also offers a permanent file system, which means that your data is always backed up even if you terminate your instances.</p>

<p>All that said, it is very expensive. Last year, the hourly cost of renting A10 on Lambda Labs was 0.60$ but this year they have increased it to 0.75$. They justified this by saying they’ll provide more GPU instances to cover the demand. Even so, GPU instances are almost always unavailable (e.g. I have never seen an available A100 with my naked eye). Normally, I just run a script that polls their API for available GPUs and wait for something to happen.</p>

<p>For this experiment, in particular, I ran a training job for around 2 months. Since only batch size of 1 can fit into the A10’s memory and ControlNet’s are finicky with low batch sizes, I used gradient accumulation for an effective batch size of 64. The total cost of this run must have been around 1,000$.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>I’m aware of the broader discussions around Generative AI and their impact on people’s livelihood. But I do think they are quite inadequate for realizing really specific visions for which companies tend to hire commercial artists. I feel this way, not because I’m a commercial artist, but because, sadly, I have spent a lot of time cherry-picking examples from these models to make demos. And so, I know that they fail a lot. Maybe, that’ll change tomorrow with OpenAI’s SORA but even then, I’m sure customization will be a key open problem. And to that end, perhaps all we can do is to build better tools for people.</p>

<p>Anyway, before I get ahead of myself, at its current stage, this project can be improved by a lot. I have two directions in mind at the moment. It may be, that the current approach will work, if only trained longer. If I had to start over, I would build on <a href="https://github.com/Stability-AI/StableCascade">Stable Cascade</a>, the latest Latent Diffusion model by Stability AI. The core text-conditional diffusion model here operates on a very low dimensional latent space, with spatial dimensions 12 by 12 compared 32 by 32 for Stable Diffusion (for 512 by 512 images). For the same compute budget, this would give me 4x more training steps.</p>

<p>The other direction that I’m thinking of is maybe this problem is not amenable to Latent Diffusion Models. While these may be adequate for structural hints (e.g. <em>Canny</em>, <em>human pose</em> or <em>depth</em> maps), it might be a better idea to train with the denoising diffusion loss in the image space (where the exact pixel color values are being compared with). Luckily, there is a large image space diffusion model <a href="https://github.com/deep-floyd/IF">DeepFloyd IF</a> that I can work with.</p>

<hr />]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Self Driving RC Car</title><link href="/robot/2022/07/16/self-driving.html" rel="alternate" type="text/html" title="Self Driving RC Car" /><published>2022-07-16T06:41:00+00:00</published><updated>2022-07-16T06:41:00+00:00</updated><id>/robot/2022/07/16/self-driving</id><content type="html" xml:base="/robot/2022/07/16/self-driving.html"><![CDATA[<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		tex2jax: {
			inlineMath: [['$', '$']]
		}
	});
</script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p align="center">
  <img src="/assets/car.jpeg" alt="drawing" width="400" />
<p />

<h1>Motivation</h1>

<p> 
I saw a talk where the speaker spoke about how modern computer vision systems should take more inspiration from biology. He described the mantis shrimp, a sea animal with two eyes that move independently of each other, and wondered whether its eyes enable better representations of the visual world. He was asking why evolution chose this design for the shrimp? And how did the design help?
</p>

<p> 
Transforming these questions into scientific inquiry would be challenging to say the least. Nevertheless, they inspired me to try to build a small-scale autonomous vehicle with mantis shrimp like eyes. I'm not there yet: my RC Car has only one eye i.e. one camera. But it can drive itself, albeit only on a particular track. I'll use this post to describe my car. Before I begin, I would like to give huge credit to the <a href="https://github.com/autorope/donkeycar">Donkeycar</a> project. I was able to build my car only due to them.
</p>

<h1>Design</h1>

<p>
I haven't found discussions on motor selection, battery selection, power consumption and weight of RC cars on the Internet. My struggles in understanding these culminated in a lot of bad designs. I think it'll be useful to describe and justify my design here. 
</p>

<p align="center">
  <img src="/assets/schematic.png" alt="drawing" width="400" />
<p />

<p>
The schematic (<a href="https://www.electronicshub.org/raspberry-pi-l298n-interface-tutorial-control-dc-motor-l298n-raspberry-pi/">original source</a>) shows how to control a DC motor using a Raspberry Pi. A lithium-polymer battery powers both the Raspberry Pi (4B) and the L298N motor controller. The Raspberry Pi requires an operating voltage of 5V at 2A current, which it gets from the XY-3606 DC-to-DC converter (the blue board in the schematic). It is important to connect the XY-3606 to the Pi using a USB cable. I tried to connect the two, directly using jumper wires. It didn't work: the Pi failed to boot. 
</p>

<p>
The Pi controls motor speed and direction by sending 3 signals to the L298N motor controller. Two of them (orange and green wires in the schematic) control start, break and the direction (forward or reverse). The third (yellow wire) is a PWM signal which controls the voltage supplied to the motor, and hence the car's speed. A single L298N motor controller controls 2 motors. My car has 4 motors and thus requires 2 such motor controllers. These are stitched on the underside of my car in order to save space.
</p> 

<p align="center">
  <img src="/assets/carbackside.jpeg" alt="drawing" width="400" />
<p />

<p> 
The total cost of all the components on the car is around 11,000 rupees. It is quite a lot but note that the Raspberry Pi, its camera module and the Lithium-Polymer batter account for nearly 75% of the total cost. These can be used for a lot of other things as well. 
</p> 

<p> 
Now that I have described my design, I'll try to justify it. My justifications are based on extremely crude models that don't represent the actual physics of the system. Still, they are useful in illustrating which variables are important and which are not.
</p> 

<h4>Battery</h4> 

<p>
Selecting the battery is obviously crucial. They are quite expensive. Before purchasing them, you have to be sure that they will be able to power the components in your circuit and will be able to run them for sufficient amount of time. 
</p> 

<p>
I collected the power requirements of components in my car. Some I obtained from their respective datasheets. Others I estimated from their operating currents and voltages. I don't have the sources for these anymore. Treat them as ball-park estimates.
</p> 

<table>
  <tr>
    <th>Component</th>
    <th>Power Consumption (W)</th>
  </tr>
  <tr>
    <td>Raspberry Pi 4B</td>
    <td>10</td>
  </tr>
  <tr>
    <td>2 L298N motor controllers</td>
    <td>50</td>
  </tr>
  <tr>
    <td>4 BO motors</td>
    <td>10</td>
  </tr>
  <tr>
    <td>XY-3606</td>
    <td>5</td>
  </tr> 
  <tr>
    <td><b>Total</b></td>
    <td><b>75</b></td>
  </tr> 
</table> 

<p>
My <a href="https://robu.in/product/orange-1000mah-3s-30c60c-lithium-polymer-battery-pack-lipo/">Lithium-Polymer</a> battery supplies a voltage of 11.1V. Since $\text{Power} = \text{Voltage} \times \text{Current}$, the discharge rate is $\frac{\text{Power}}{\text{Voltage}} = \frac{70}{11.1} \approx 6.3A$. This is well within the rated discharge rate for my battery. 
</p>

<p>
I can also estimate the battery life by reading the battery capacity. The battery contains $1000mAh$ of charge. In SI units, this is $1000 \times 10^{-3} \times 3600 = 3600C$. Now assuming that I can discharge the battery completely, I get $\frac{3600}{6.3} \approx 10 \text{ minutes}$ of battery life. 
</p>

<p>
This analysis should raise a lot of eyebrows. For one, Lithium-Polymer batteries should never be discharged completely. Doing so would render them unsafe for future use. Moreover, batteries don't supply constant power as they discharge. Power supply diminishes through battery operation. This means that my calculation of the discharge rate and battery life is not really accurate. Nevertheless, these back-of-the-envelope calculations helped me be sure that my car could run and would do so, at least for a few minutes. 
</p> 

<h4>Force</h4> 

<p> 
In order for the car to move, it has to overcome friction. Frictional force is often modelled as $F_{\text{friction}} = \mu \times M \times g$. The mass of my car $M$ is around 0.5Kg, the coefficient of friction $\mu$ is usually less than 1 and $g$ is acceleration due to gravity, $10 m/s^2$. If the motors can provide a force greater than $1 \times 0.5 \times 10 = 5N$, the car should move.
</p> 

<p> 
Motors convert electrical power to mechanical power. The cheap DC motors that I use, do so with around 50% efficiency. This means that we get $0.5 \times 10W = 5W$ of mechanical power to work with. 
<p> 

<p> 
Mechanical power is $F_{\text{motors}} \times v_{\text{car}}$. Since the motor turns at 150 rpm or about 15 rad/s, the velocity of the car is $v_{\text{car}} = \text{wheel radius} \times \omega = 2 cm \times 15 rad/s = 0.3 m/s$. Hence $F_{\text{motor}} = \frac{5W}{0.3m/s} \approx 17N$. This is more than $F_{\text{friction}}$ with some margin. The car should move. In practice, it does!
</p>

<p> 
Even with these crude arguments, few things become apparent. For example, your car shouldn't be too heavy. Perhaps this was already obvious. But some other things were not, at least to me. For example, wheel radius is an important variable. It trades of the car's velocity for motor's force. If your car doesn't move, try reducing the wheel radius. I learnt this the hard way when I bought monster wheels only to later have to bin them when the car didn't move. 
</p>

<h1>Software</h1> 

<p>
I installed the Donkeycar github repository. It has a lot of useful features that make experiments easier. You can launch a web server on the Pi and control your car through a web browser. There is a convenient way to define parts in your car and interface between them. They even save data for you as you drive. You can use this later as training data for your neural networks.
</p>

<p>
Donkeycar also advices you to install Tensorflow or Pytorch on your Pi. It wasn't easy to install these. The pre-built python wheels didn't work for me. In the end, I abandoned them, instead installing the lightweight ONNX runtime for neural network inference. 
</p>

<p> 
Donkeycar has also provides predefined templates for 2 wheel differential drive using the L298N motor controllers. These specify which pins on the Pi control what. Differential dirve lets you can steer the car by controlling the speeds of 2 motors. For example, to turn left, you would turn the left side wheel slowler than the right wheel. Since my car has 4 wheels, I had to define my own template. You can find it in this <a href="https://github.com/autorope/donkeycar/pull/1019">pull request</a> for your reference. 
</p>

<h1>Driving it around</h1> 

<p>
I enjoy driving the car around. Part of my motivation in building it was to show it off to other people. Such projects were quite common in my college, but they are rarer back at home. I kind of miss that environment now that I have graduated. It would be nice to have a community such as Donkeycar's where people got together to learn about this kind of stuff. People, regardless of age, will always find something like this very interesting. It can be an easy gateway to learning deeply about today's technological buzzwords such as IoT and Machine Learning. 
</p> 

<p> 
Sometimes when I'm driving the car around, people ask me how I built it? how much did it cost? how does it work? how does it move on its own? To the last question, initially, I gave some vague reply, "Oh, it'll be tough to explain". That person pushed me, "You think I won't understand?". So I explained that I had recorded myself driving and he immediately got it, "Oh so it is copying your behaviour". This is quite accurate. Maybe I can explain it to people at different levels of detail. It is not always necessary to describe what neural networks are and what backpropagation is.
</p> 

<p> 
I get fascinated by how children interact with the car. They'll run in front of it or chase it. One toddler even tried to poke it, as if it were a dog. Sometimes, I let them drive it. It is a good way to stress-test the car. Since it is hard for them to control the car, they'll often drive it off the track or into a wall or turn it upside-down. It makes them laugh and so far my car is intact. 
</p>

<p align="center"> 
 <video width="400" height="400" controls="">
   <source src="/assets/movie.mp4" type="video/mp4" />
 &lt;/figure&gt;
&lt;/p&gt;

<h1>Training Self-Driving Module</h1>

<p>
I collected training data using Donkeycar's web app by driving around a small track near my house. I visualized histograms of throttles and steering angles and found that I drove at a constant throttle. Hence I chose to only predict the steering angles through a neural network. 
</p>

<p>
The inputs to my neural network were the camera image and 5 previous steering commands. From this, the network predicted the next steering command. I augmented the training images by color jittering, adjusting sharpness and inverting them at random. 
</p>

<p>
I used the ImageNet-pretrained MobileNet-V2 network available in TorchVision as the convolutional branch in the network. I found that this model is at least twice as fast as ResNet-18 on the Pi. Finally, I held out a portion of the training data for validation and plotted network predictions along with ground truth steering commands. This is my <a href="https://colab.research.google.com/drive/1v24nXC2KOibLp5mg60A8y0_wthuUE04o?usp=sharing">Colab notebook</a> where I pondered over neural architectures and such.
</p> 

<p align="center">
  <img src="/assets/val.png" alt="val" width="400" />
<p />

<h1>Driving on its own</h1> 

<p> 
The trained neural network predicted steering controls almost perfectly on the validation set. But when I deployed it to my car, things didn't go so well. The car kept going off track. Even the camera stick broke, ending experiments for the day. 
</p>

<p>
I think the main problem was the difference in the training and deployment environments. I collected training data by driving the car at full throttle with Donkeycar backing up snapshots of $(\text{image}_t, \text{throttle}_t, \text{angle}_t)$ 20 times per second. On the other hand, the neural network could make steering predictions only 10 times per second. 
<p> 

<p> 
This was important because the network relied on previous predictions in order to make the next. In the training data, the car had moved by distance $d$ between subsequent frames, but when the model was deployed, the car moved by distance $2 \times d$ between subsequent neural network evaluations. By then, it would already be off track. 
</p>

<p>
I had two options to fix this. The first was to speed up inference, possibly by quantizing the model. This wasn't straightforward because there is limited operator support for quantized models on ONNX. I could try <a href="https://pytorch.org/tutorials/intermediate/realtime_rpi.html">torch.jit</a> but it requires a 64 bit OS. My Raspberry Pi runs on the 32 bit OS since the PiCam library, which Donkeycar uses to interface with the camera, is deprecated in the 64 bit OS.
</p>

<p>
The second option was to simply run the car at half the speed. The deployed network would observe new images after the car had covered the same distance as it had in the training data. This idea was not only easier to implement but more importantly, successful. Below you can see the car moving on its own. It still crashes every now and but its much better than what it was like before. 
</p>

<p align="center"> 
 <video width="400" height="400" controls="">
   <source src="/assets/drive.mp4" type="video/mp4" />
 &lt;/figure&gt;
&lt;/p&gt;

<h1>Next Steps</h1>

<p>
There is a lot that excites me about this car. I want to explore its potential as a medium for education. It would be great if I can get even one other person in my area excited about this project and have them build there own car. 
</p>

<p> 
Right now, the car doesn't know anything about it's own position and velocity. I would like to integrate sensors for odometry. With this, I could learn more about path planing and algorithms such as SLAM. I saw a <a href="https://www.youtube.com/watch?v=o9jEQZn6I6E">talk</a> by Joydeep Biswas and it would be fun to implement one of his papers on my car. In his talk, he envisions a fleet of autonomous vehicles. We could call such a fleet a Redundant Array of Inexpensive Cars. 
</p> 

<p> 
I can also try to improve inference on the Pi. The main bottleneck right now seems to be the dependency on the 32 bit OS. I should try to build a clone of PiCam compatible with the 64 bit OS. In this way I would also be giving something back to the Donkeycar project. 
</p>

<h1>Credits</h1> 

I depended on a lot of resources on the Internet at various moments. The biggest of them all is Donkeycar which I have already mentioned. Apart from them, I used <a href="https://www.tomshardware.com/reviews/raspberry-pi-headless-setup-how-to,6028.html">Tom's Hardware</a> to figure out how to setup the Pi via SSH. I found out how to connect the Pi to mobile hotspot from this <a href="https://medium.com/geekculture/how-to-connect-to-the-raspberry-pi-using-mobile-hotspot-2362a6b02efc">Medium Article</a>. I was able to build the circuit with help from a <a href="https://www.youtube.com/watch?v=lPyDtuzYE5s&amp;t=959s">YouTube Video</a> and this <a href="https://www.electronicshub.org/raspberry-pi-l298n-interface-tutorial-control-dc-motor-l298n-raspberry-pi/">Article</a>.
</video></p></p></p></p></video></p></p></p></p></p></p>]]></content><author><name></name></author><category term="robot" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Geometry 1</title><link href="/codeforces/2022/06/11/geometry.html" rel="alternate" type="text/html" title="Geometry 1" /><published>2022-06-11T11:18:11+00:00</published><updated>2022-06-11T11:18:11+00:00</updated><id>/codeforces/2022/06/11/geometry</id><content type="html" xml:base="/codeforces/2022/06/11/geometry.html"><![CDATA[<script type="text/x-mathjax-config">
    MathJax.Hub.Register.StartupHook("TeX Jax Ready",function () {
        MathJax.Hub.Insert(MathJax.InputJax.TeX.Definitions.macros,{
            cancel: ["Extension","cancel"],
            bcancel: ["Extension","cancel"],
            xcancel: ["Extension","cancel"],
            cancelto: ["Extension","cancel"]
        });
    });
	MathJax.Hub.Config({
		tex2jax: {
			inlineMath: [['$', '$']]
		}
	});
</script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>Here I’ll share some Codeforces problems I solved by visualizing what happens on a 2D plane. The tricks I discuss here rely on <em>checking parity</em>, the <em>pigeonhole principle</em> and <em>Dilworth’s Lemma</em>.</p>

<h2 id="two-hundred-twenty-one"><a href="https://codeforces.com/contest/1562/problem/D1">Two Hundred Twenty One</a></h2>

<p>In this problem, we are given a sequence of $N$ numbers. Each number can be either $+1$ or $-1$.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>i  :  0  1  2  3  4  5  6  7  8  9  10  11  12 
a_i: +1 -1 +1 +1 +1 -1 +1 +1 -1 -1  +1  -1  +1 
</code></pre></div></div>

<p>For a range of the array  $i \in [l, r]$, there is an alternating sum:</p>

\[S_{lr} = \sum_{i = l}^r (-1)^{i - l} a_i\]

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>l  r   S_lr
-----------
0  0   1
0  3   2
1  3  -1
</code></pre></div></div>

<p>Given a range $[l, r]$, we have to find the minimum number of elements in range that we should remove so that the alternating sum becomes $0$. We are given a lot of such queries so we would like to compute the answer fast (in $O(1)$).</p>

<p>This is a very weird problem and like many Codeforces problems, it is unlikely that we’ll come across it in the real world. Nevertheless, analysing it is quite fun. Earlier, I had a lot of trouble in initiating the analysis. I was a bit lost about what questions I should ask. Now Ifind it helpful to see what happens when <em>one</em> thing changes. In the present case, let’s see what happens when we remove one element from the sequence.</p>

<p><img src="/assets/ex-p1.png" alt="remove-one-element" /></p>

<p>In the image, $b_i$ takes into account the sign that $a_i$ was multiplied with while computing $S_{lr}$. Since $b_i$ is $\pm 1$ after removing it, the <em>parity</em> of the initial sum changes. If $S_{lr}$ is even initially, it becomes odd and vice versa. We can use this fact to lower-bound the number of times we need to remove an element. Since $0$ is even, if $S_{lr}$ is odd, we need remove at least $1$ element. On the other hand, if $S_{lr}$ is even and not $0$, then we need to remove at least $2$ elements.</p>

<p>In this problem, it turns out that we can always achieve these lower-bounds. To see this, let’s visualize the problem on a graph. Fix $l = 0$ and plot $S_{0r}$ for different $r$.</p>

<p><img src="/assets/ex-p2.png" alt="plot-1" /></p>

<p>The alternating sum of the entire array is $7$. If there was an index $j$ such that $S_{0j} = 3$, then we could just remove the element at $j + 1$ and obtain an alternating sum of $0$. Such a $j$ always exists. This is because $S_{0r}$ changes by $1$ at each $r$. You can think of it as a discrete version of the Intermediate Value Theorem from Calculus. In our example $j = 4$.</p>

<p><img src="/assets/ex-p3.png" alt="plot-2" /></p>

<p>After removing the red point, the green part will mirror about the x-axis, making the alternating sum $0$. Generalizing this observation, if $S_{lr}$ is odd, then we can always make the alternating sum $0$ by removing just $1$ element. We find the index $j$ such that $S_{lj} = \frac{S_{lr} - 1}{2}$ and remove the element at $j + 1$. If $S_{lr}$ is even, we remove any element to make the alternating sum odd. Then, we make the alternating sum $0$ using the trick above.</p>

<p><img src="/assets/ex-p4.png" alt="plot-3" /></p>

<p>In order to answer queries, simply compute the alternating sum in the given range. If the alternating sum if odd, the answer is $1$, if even and not $0$, the answer is $2$. The alternating sum can be computed for any range in $O(1)$ using prefix-sums.</p>

<h2 id="training-session"><a href="https://codeforces.com/problemset/problem/1598/D">Training Session</a></h2>

<p>We are given $[n]$ and two arrays $A_1 \cdots A_n$ and $B_1 \cdots B_n$. All pairs $(A_i, B_i)$ are distinct. We have to find the number of ways we can select three distinct numbers $i, j, k \in [n]$ such that one of the following is true:</p>

<ul>
  <li>$A_i$, $A_j$, $A_k$ are distinct.</li>
  <li>$B_i$, $B_j$, $B_k$ are distinct.</li>
</ul>

<p>I wasn’t able to count these directly. Instead, I counted the triplets that <em>don’t</em> satisfy this condition (the <em>bad</em> triplets). The final answer was the difference between all $nC3$ possible triplets and the <em>bad</em> triplets.</p>

<p>Negating the conditions above, we find that <em>bad</em> triplets have to satisfy both of the following:</p>

<ul>
  <li>Any two of $A_i$, $A_j$ and $A_k$ are the same.</li>
  <li>Any two of $B_i$, $B_j$ and $B_k$ are the same.</li>
</ul>

<p>Again, plotting $(A_i, B_i)$ on a graph gave a way to count <em>bad</em> triplets. On a graph, you can immediately see that <em>bad</em> triplets form an <code class="language-plaintext highlighter-rouge">L</code> (eg. points $1, 3, 4$ in the image below). This is because of <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">Pigeonhole Principle</a> which says that if there are more pigeons than holes, then some hole has more than two pigeons. It is a stupid principle, much like the pigeons themselves.</p>

<p>Anyway, to see why <em>bad</em> triplets always make an <code class="language-plaintext highlighter-rouge">L</code>, let $i_1, i_2$ be the indices for which $a_{i_1} = a_{i_2}$ and $j_1, j_2$ be the indices for which $b_{j_1} = b_{j_2}$. Since these are indices in a triplet (the holes), all of $i_1, i_2, j_1, j_2$ (the pigeons) can’t be distinct. Some $i_{k_1} = j_{k_2}, k_1, k_2 \in \{1, 2 \}$. In the image, the common index is $3$. Let’s refer to this common index as the <em>pivot</em>.</p>

<p><img src="/assets/ex-p5.png" alt="pivot" /></p>

<p>We’ll count the <em>bad</em> triplets by counting <em>bad</em> triplets in which a each index is a <em>pivot</em>. There will be <em>bad</em> triplets with $1$ as the <em>pivot</em>, $2$ as the <em>pivot</em>, $3$ as the <em>pivot</em> and so on. All these subsets will be disjoint and their union will be the set of all <em>bad</em> triplets. We’ll get the sizes of these subsets individually and sum them up.</p>

<p>When an index $i$ is a <em>pivot</em>, one member of the triplet is horizontal to it and the other is vertical to it. If we count indices having the same $B$-value and $A$-value (<code class="language-plaintext highlighter-rouge">Bcnts[b]</code> and <code class="language-plaintext highlighter-rouge">Acnts[a]</code>), then the number of <em>bad</em> triplets with $i$ as the <em>pivot</em> are <code class="language-plaintext highlighter-rouge">(Bcnts[B[i]] - 1) * (Acnts[A[i]] - 1)</code>. We subtract one to avoid counting $i$ twice in the triplet. Referring to the image, you can see that the number of <em>bad</em> triplets in which $3$ is the <em>pivot</em> are $2 \times 3 = 6$.</p>

<p>By precomputing <code class="language-plaintext highlighter-rouge">Bcnts</code> and <code class="language-plaintext highlighter-rouge">Acnts</code>, we can count the <em>bad</em> triplets in $O(n)$ by iterating over all possible <em>pivots</em> and adding their contribution using the formula above. To solve the original problem, we subtract this number from $nC3$, the number of triplets in total.</p>

<h2 id="manhattan-subarrays"><a href="https://codeforces.com/problemset/problem/1550/C">Manhattan Subarrays</a></h2>

<p>I like this problem because it is an interesting application of the <a href="https://ocw.mit.edu/courses/6-042j-mathematics-for-computer-science-fall-2010/efac321fdc8d0b27586ca35b04aab808_MIT6_042JF10_chap07.pdf">Dilworth’s Lemma</a>. We are given an array $a_1 \cdots a_n$. A triplet of <em>distinct</em> indices $i, j, k$ is considered <em>bad</em> if:</p>

\[\underbrace{|a_i - a_j| + |i - j|}_\textrm{manhattan distance between i, j} = \underbrace{|a_i - a_k| + |i - k|}_\textrm{manhattan distance between i, k} + \underbrace{|a_k - a_j| + |k - j|}_\textrm{manhattan distance between k, j}\]

<p>A contiguous subarray $a_l \cdots a_r$ is considered <em>good</em> if we can’t pick any <em>bad</em> triplet from it. We have to count the number of <em>good</em> subarrays.</p>

<p>I found the definition of a <em>bad</em> triplet quite convoluted. I searched for simpler <em>but</em> equivalent definitions of <em>bad</em> triplets. I simplified the problem by asking this question in one dimension. When does the following happen?</p>

\[|a_i - a_j| = |a_i - a_k| + |a_k - a_j|\]

<p>This happens <a href="https://en.wikipedia.org/wiki/If_and_only_if">if and only if</a> $a_k$ is between $a_i$ and $a_j$. The image below illustrates that when $a_k$ is between $a_i$ and $a_j$ then LHS and the RHS are equal. When it is not, such as for $a_{k’}$, then the segment $|a_i - a_{k’}|$ on its own is longer than $|a_i - a_j|$. In this case:</p>

\[|a_i - a_j| &lt; |a_i - a_{k'}| + |a_{k'} - a_j|\]

<p><img src="/assets/ex-p6.png" alt="illustration" /></p>

<p>Back in our original two dimensional problem, we can infer that if $a_k$ is between $a_i$ and $a_j$, then $k$ is between $i$ and $j$.</p>

\[\cancel{|a_i - a_j|} + |i - j| = \cancel{|a_i - a_k|} + |i - k| + \cancel{|a_k - a_j|} + |k - j|\]

\[\Rightarrow k\text{ is between }i\text{ and } j\]

<p>Likewise, if $k$ is between $i$ and $j$ then $a_k$ is between $a_i$ and $a_j$. On the other hand, when $k$ is not between $i$ and $j$, then $a_k$ is not between $a_i$ and $a_j$. In this case, the triplet $i, j, k$ is not <em>bad</em>. Visualizing <em>bad</em> triplets on a graph, observe that they form a monotonic sequence.</p>

<p><img src="/assets/ex-p7.png" alt="illustration" /></p>

<p>As you can imagine, <em>good</em> subarrays i.e. those that don’t contain <em>bad</em> triplets, can’t be too long. If they are small enough, then we can simply enumerate all of subarrays upto some size and count the <em>good</em> ones. We can obtain a bound on the size of <em>good</em> subarrays using partial orders and Dilworth’s Lemma.</p>

<p>Define a partial order as $(a_i, i) \leq (a_j, j)$ if $i \leq j$ and $a_i \leq a_j$. The image below shows what this partial order looks like. The directed arrow indicates when one point is $\leq$ than the other. If $x \leq y$ and $y \leq z$ than $x \leq z$ and so I haven’t bothered to draw the arrows from $x$ to $z$.</p>

<p><img src="/assets/ex-p8.png" alt="illustration" /></p>

<p>Notice that unrelated points i.e. points where neither is $\leq$ than the other, form a decreasing sequence. By Dilworth’s Lemma, in a subarray of length $N$, either there is an increasing subsequence of length greater than $\sqrt{N}$ or a decreasing subsequence of length greater than equal to $\sqrt{N}$. If a subarray has length greater than $4$, then Dilworth’s Lemma tells us that it is bound to be <em>bad</em>.</p>

<p>Now, we can count <em>good</em> subarrays by enumerating all subarrays of size at most $4$ and checking whether they are good. The time complexity of doing so is $O(n)$.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Visualizing problems helps a lot obviously. Also, the steps in the solutions presented above didn’t occur to me in the linear order in which they are presented. There are a lot of tiny failure tracks that I have trimmed while writing this. This just means that when you are solving problems, you are going to be led down the proverbial garden path. And that is ok!</p>

<p>I recommend reading the <a href="https://ocw.mit.edu/courses/6-042j-mathematics-for-computer-science-fall-2010/efac321fdc8d0b27586ca35b04aab808_MIT6_042JF10_chap07.pdf">post</a> that explains Dilworth’s Lemma. In fact, that MIT OCW course is quite good on the whole.</p>]]></content><author><name></name></author><category term="codeforces" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Binary Search 1</title><link href="/codeforces/2022/05/31/binary-search-v1.html" rel="alternate" type="text/html" title="Binary Search 1" /><published>2022-05-31T12:07:14+00:00</published><updated>2022-05-31T12:07:14+00:00</updated><id>/codeforces/2022/05/31/binary-search-v1</id><content type="html" xml:base="/codeforces/2022/05/31/binary-search-v1.html"><![CDATA[<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		tex2jax: {
			inlineMath: [['$', '$']]
		}
	});
</script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

<p>Binary Search applies to many problems on Codeforces. These problems can be framed as – find the largest $x \in [n]$ for which $f(x)$ is true. If $f$ has the monotonicity property:</p>

\[f(x) \Rightarrow f(x - 1)\]

<p>Then we can <em>binary search</em> for the largest $x$. This helps when evaluating $f$ on each $x$ is prohibitively expensive. When monotonicity holds, we can guess $O(lg n)$ $x$’s and evaluate $f$ on just those to find the largest.</p>

<h2 id="keshi-is-throwing-a-party"><a href="https://codeforces.com/contest/1610/problem/C">Keshi Is Throwing a Party</a></h2>

<p>This is an example where Binary Search yields a simple and efficient solution. In a nutshell, we are given $[n]$ and two arrays $a_1 \cdots a_n$, $b_1 \cdots b_n$. We have to find the largest set $S \subseteq [n]$ that satisfies – for all $i \in S$, there are at most $a_i$ elements in $S$ which are greater than $i$ and at most $b_i$ elements in $S$ which are lesser than $i$.</p>

<p>For example:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># assume for now that arrays are 1-indexed
</span><span class="n">n</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>

<p>Then $S = \{1, 2\}$ is a <em>valid</em> subset.</p>

<p>The key to solving this problem is that the question – does a <em>valid</em> set of size $x$ exists? – has the monotonicity property. If a <em>valid</em> set of size $x$ exists, simply remove the last element and obtain a <em>valid</em> set of size $x - 1$.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># returns true if a valid set of size x exists
# def checker (x) : 
#   ...
</span><span class="n">l</span><span class="p">,</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">n</span> 
<span class="k">while</span> <span class="n">l</span> <span class="o">&lt;</span> <span class="n">r</span> <span class="p">:</span> 
  <span class="n">m</span> <span class="o">=</span> <span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="n">r</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
  <span class="k">if</span> <span class="n">checker</span><span class="p">(</span><span class="n">m</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="p">:</span> 
    <span class="n">l</span> <span class="o">=</span> <span class="n">m</span> <span class="o">+</span> <span class="mi">1</span>
  <span class="k">else</span> <span class="p">:</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">m</span>

<span class="k">print</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> 
</code></pre></div></div>

<p>Finally, we can write <code class="language-plaintext highlighter-rouge">checker</code> using a simple greedy strategy. We build the set $S$ by scanning through $1 \cdots n$ and greedily adding the first $i$ to $S$ that satisfies $|S| \leq b_i$ and $x - (|S| + 1) \leq a_i$. If at the end of the scan, $|S| &lt; x$, we report that that a size $x$ <em>valid</em> set doesn’t exist.</p>

<p>Why does <code class="language-plaintext highlighter-rouge">checker</code> correct? If <code class="language-plaintext highlighter-rouge">checker</code> is incorrect, that means that for some $x$, there is a <em>valid</em> $S$ of that size $x$ but <code class="language-plaintext highlighter-rouge">checker</code> only finds $S’$ and $|S’| = k &lt; x$.</p>

\[S = \{i_1, \cdots, i_x\}\]

\[S' = \{j_1, \cdots, j_k\}\]

<p>Let $l$ be the smallest index where $S$ and $S’$ differ. There has to be some $l \in [k]$ for which $i_l \neq j_l$ because otherwise <code class="language-plaintext highlighter-rouge">checker</code> would have extended $S’$. Since <code class="language-plaintext highlighter-rouge">checker</code> picks the earliest item to add to the set: $j_l &lt; i_l$. Both $j_l$ and $i_l$ satisfy the <em>validity</em> condition.</p>

\[l - 1 \leq b[j_l]\]

\[l - 1 \leq b[i_l]\]

\[x - l \leq a[j_l]\]

\[x - l \leq a[i_l]\]

<p>Hence we can replace $i_l \in S$ by $j_l$. The new set $S^{(1)} = S - \{i_l\} + \{j_l\}$ is still <em>valid</em>. Repeating this process $k - l + 1$ times gives us $S^{(k - l + 1)}$ for which no $l \in [k]$ exists where $S^{(k - l + 1)}$ and $S’$ differ. But such an $l$ <em>has</em> to exist and so we reach a contradiction, proving that <code class="language-plaintext highlighter-rouge">checker</code> is correct.</p>

<p>I’ll end the story here. <a href="https://codeforces.com/contest/1610/submission/136793593">Here</a> is my submission on Codeforces. In summary, under certain conditions, Binary Search allows us to convert an optimization problem (<em>find the largest x for which …</em>) into a decision problem (<em>does there exist an x for which …</em>). When applicable, Binary Search introduces only a small multiplicative $O(lg n)$ overhead over the underlying decision problem.</p>

<p>I’m slightly disappointed that I haven’t been able to find an efficient dynamic-programming solution to this problem. If anyone knows of one, please tell me!</p>]]></content><author><name></name></author><category term="codeforces" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Why I do Codeforces</title><link href="/codeforces/2022/05/30/competitive-programming.html" rel="alternate" type="text/html" title="Why I do Codeforces" /><published>2022-05-30T06:10:14+00:00</published><updated>2022-05-30T06:10:14+00:00</updated><id>/codeforces/2022/05/30/competitive-programming</id><content type="html" xml:base="/codeforces/2022/05/30/competitive-programming.html"><![CDATA[<p>Recently, I spent a lot of time solving problems on Codeforces. These problems are similar to those I had encountered in undergraduate classes, such as Discrete Maths, Automata Theory and Algorithms; classes I didn’t do particularly well in.</p>

<p>Often, I was quite nervous before their exams. I didn’t back myself to be able to solve the problems. I guess I simply didn’t understand how problem-solving worked. I expected that either a solution would just pop in my head or it never will: I would just stick with the first idea that fruitlessly try to knead it into a solution. I didn’t play around with examples, create smaller subproblems, consider different hypotheses etc. On Codeforces, I could practise these strategies without consequences (such as grades).</p>

<p><img src="/assets/cf-activity.png" alt="My activity on CF" /></p>

<p>Having solved roughly 400 problems, I am more comfortable with the problem-solving process. Even if I don’t end up solving a problem, at least I give myself a fair chance. I try out examples, create hypotheses, either prove hypotheses or try to construct simple counter-examples. Sometimes, different hypotheses combine into a solution (as in square-root decomposition). If nothing works, I write a program to generate small examples and see if a pattern arises. If still nothing works, I go and smash the day’s Wordle and feel happy about that.</p>

<p>Now, instead of solving more problems, I want to take a step back and document some interesting problems and problem-solving techniques. This is slightly redundant work because problems are reviewed at length on Codeforces itself. Even so, I think that writing about them will improve my skills.</p>]]></content><author><name></name></author><category term="codeforces" /><summary type="html"><![CDATA[Recently, I spent a lot of time solving problems on Codeforces. These problems are similar to those I had encountered in undergraduate classes, such as Discrete Maths, Automata Theory and Algorithms; classes I didn’t do particularly well in.]]></summary></entry></feed>