Vrroom’s Blog

Harnessing Weak Supervision to Isolate Sign Language Communicators in Crowded News Videos

2024-08-11T00:00:00+00:00

This post is part of a new foundation that we are trying to create – Longtail AI Foundation, that works towards increasing accessibility for the hearing impaired. It describes our first steps towards creating a large-scale dataset – isl-500 for training bi-directional translation models between English and Indian Sign Language (ISL).

Introduction

Our Hand-Signer detection model is able to detect hand-signers in News Videos with (a) multiple people present (b) in multiple views. The hand-signers are marked by a bounding box.

I believe that we can solve continuous sign language translation convincingly, if we have access to truly large-scale datasets. Compare with Audio Transcription, by considering Table 6. from Whisper, the state-of-the-art audio transcription system. Their lowest word-error rate (WER) of 9.9, requires 680K hours of noisy transcribed audio data. It is unlikely that such a dataset will ever exist for Sign Language Translation; the footprint of the hearing-impaired on the Internet is far too small.

This is not a problem, since the table also shows that with 6.8K hours, the model achieves a word error rate of 19.6. This is pretty impressive considering that OpenAI evaluated their models on datasets that, for one weren’t seen during training but secondly, and more importantly, came from a variety of sources and were made by entirely different groups.

This is in stark constrast with current SL research where most papers report numbers on the validation sets of the same dataset used in training (commonly used datasets are CSL-Daily for Chinese SL and Phoenix-2014T for German SL. These datasets are on the order of 100 hours). While over years, the numbers improve, they themselves aren’t indicative of how these SL Translation systems perform in-the-wild and whether they bring meaningful impact to the lives of the hearing-impaired.

All this is to say, that we need to build a 5000 hour scale dataset for Sign Language Translation and we are good to go. But where can we find this data? Luckily news broadcasters often include special news segments for the hearing-impaired. In these segments, a hand-signer gesticulates words while simultaneously, an anchor reads out the news. By detecting the hand-signer and pairing their gestures with audio transcriptions, we can start making progress on this problem. The steps to do detect hand-signers are described below.

Overview of our System

We first run a human pose estimation model DWPose on news videos that we collected from YouTube. We design heuristics to label to pose sequences. These heuristics are described in detail later in this post. Some of these heuristics may abstain from labelling while others may not be applicable at test time. We use Snorkel to aggregate these heuristics and assign probabilistic labels on an unlabelled training set. Internally, Snorkel weighs agreement/disagreement between heuristics on the entire unlabelled training set to assign probabilistic labels. We use these probabilistic labels to train a hand-signer classifier on pose sequences. Since Snorkel is only applicable during training time, it’s necessary to train the classifier so that we can use it later to detect hand-signers.

Snorkel: Making sense of Weak Labels

Let me be up front. I don’t know how Snorkel works. But let me use this space to pass on some intuition, regarding why such a system may improve the quality of heuristic labels. Let’s say we have 5 heuristics A, B, C, D and E. A, B and C agree on everything but are random i.e. they label by flipping a coin. D and E, on the other hand, are pretty accurate.

Consider using majority vote to combine these heuristics. It’s evident that since A, B and C always vote the same way, they represent the majority. All the valuable information from heuristics D and E is lost in this exercise.

My guess is that Snorkel improves on majority vote by detecting correlations between heuristics and avoids double, or in this case, triple counting certain heuristics. Thus, its probabilistic labels are more reliable than the heuristics themselves.

Heuristics for classifying Hand-Signers

Here I describe the heuristics I developed for classifying pose sequences.

num tracks: This heuristic counts the number of humans simultaneously tracked in a sequence. If there are many people tracked, it’s unlikely that any one of them is a signer. This is a really coarse heuristic, but the table below shows that it is better than random (which will have an error rate of 0.5 for this binary classification problem).
video path: This heuristic assigns labels based on the data source. For example, if the sequence came from a Sign Language dictionary, we can be sure that the tracked person is indeed a hand-signer. This heuristic has 0 error rate but has poor coverage as well. It highlights a common tradeoff in designing heuristics: it is hard to design heuristics that label all samples in the dataset with low error rate.
legs visible: While going through News Videos, I found that most hand-signers are either sitting, or are too close to the camera, so that their legs are not visible. Using pose estimation confidence scores, we can check this and assign a label.
only one person: This is similar to the num track heuristic. If there is only one person tracked, it’s likely that the person is a hand-signer.
bounding box: This heuristic checks if the bounding box of a person is too small compared to video dimensions. If so, it’s unlikely that the person is a hand-signer.
movement: This heuristic measure how much a person is moving their hands. If it is above a certain threshold, they are probably a hand-signer. This heuristic is pretty good, already achieving pretty low error rate and 100% coverage.
chest level: This heuristic simply checks whether a person’s hands cross their chest. If so, they maybe a hand-signer.

Heuristic	Error Rate (dev) ↓	Error Rate (test) ↓	Coverage (dev) ↑	Coverage (test) ↑
num tracks	0.28	0.27	1.00	1.00
video path	0.00	0.00	0.02	0.02
legs visible	0.19	0.15	1.00	1.00
only one person	0.05	0.03	0.35	0.37
bounding box	0.23	0.21	1.00	1.00
movement	0.05	0.08	1.00	1.00
chest level	0.15	0.16	1.00	1.00

Table 1: Labelling Functions: Error Rate (1 − Accuracy) and Coverage (the fraction of data points they don’t abstain from labelling) of different labelling functions. Note that our dev and test sets are class balanced i.e. they have equal number of hand-signers and non-hand-signers. The dev set was used to tune the thresholds in the heuristics.

Evaluation

I compared our classifier trained on Snorkel’s probabilistic labels against few reasonable baselines. The simplest one is a majority vote, where among all the non-abstaining heuristics, we choose the majority vote. From the table, we can confirm our intuition that majority vote is not good. In fact, it’s worse than our best heuristic (movement).

To control for the effect of Snorkel, we also trained the classifier on labels from the movement heuristic. It is not surprising that this doesn’t meaningfully impact the error rate. In fact, had it done so, I’d be more likely to attribute it to a bug in my code than to anything else.

Training on Snorkel’s probabilistic labels produces the best error rate. They are half of our best-performing heuristic (both on the dev and test set, incidentally).

Method	Error Rate (dev) ↓	Error Rate (test) ↓
majority vote	0.13	0.12
movement heuristic	0.05	0.08
movement heuristic + classifier	0.04	0.08
ours	0.02	0.04

Table 2: Error rates: We compare our final classifier against reasonable baselines (the best performing heuristic, a classifier trained on the best performing heuristic, and a majority vote among all heuristics). Pooling heuristic labels using Snorkel and training a classifier slashes the error rate by at least 50% on both our dev and test set.

Final Thoughts

I should point out that Snorkel is not a magic wand. The heuristics that one uses should be pretty good already and should broadly cover the dataset. Also, detecting hand-signers is quite a simple computer-vision problem, in the era of GPT-4o. That being said, it is pretty exciting that it works at all and should make data acquisition cheaper in a lot of use cases.

Our code is available on GitHub. We have used it to clean ~ 500 hours of news videos (ISL) along with transcripts. If anyone has any use for this this dataset/code, please reach out to us. Any thoughts/suggestions will be appreciated.

Segmenting Comic book Frames

2024-02-23T00:00:00+00:00

This post is based on my project in my Computer Vision class last semester

Introduction

As I was learning classical techniques in my Computer Vision class, I came across a blog post by Max Halford on extracting frames from comic books. He developed a very interesting algorithm where he applied Canny to detect the boundary of frames, filled holes and fit bounding boxes to contiguous regions.

This elegant algorithm did the job very well but had its shortcomings. For one, it didn’t handle arbitrary, un-aligned polygons and didn’t work on negative frames, which didn’t have a boundary of their own, but rather were defined by the boundaries of neighboring frames.

Given the hype around foundation models for segmentation such as SAM, I approached this problem by procedurally generating a synthetic dataset of comic books and finetuning SAM to detect the corner points of frames.

Failure cases of heuristic approaches: (Top) Frames from Pepper and Carrot by David Revoy are polygons and not axis-aligned bounding boxes. (Bottom) Negative frames may not have a well defined border.

Procedural Comic Generator

There isn’t abundant data available for this problem. But that doesn’t mean that we should hold our head in our hands. A common technique that is widely used (see Erroll Wood’s work) is to procedurally generate training data.

In our case, this means simulating comic books. Note, we don’t really need to make gripping animations and tell a story, we just need to generate panels that look like comics from 50,000 feet. In order to do this, I wrote a procedural generator of layouts and assigned random boxes on an empty image. I filled these boxes with images sampled from the Danbooru dataset.

In order to ensure that the sampled images were atleast semi-coherant, I used CLIP L/14 image encoder to create an image index. While choosing images for a particular page, I sampled one image at random from Danbooru and filled the rest of the boxes using it’s k-nearest neighbors.

With this procedural generator, I had complete control of the size, shape and boundary properties of the box, which I could set appropriately to simulate negative and polygonal frames.

Initial and final version of our procedural comic generator: (Left) Initial version is just a bunch of boxes. (Right) Final version where I added images from Danbooru randomly to the boxes.

Comic Segmentation

I used SAM as the backbone for my model. SAM is the state-of-the-art image segmentation model. It consists of a heavy, compute expensive image encoder and a light-weight decoder, which answers segmentation queries. The heavy encoder encodes an image only once, after which multiple segmentation queries are answered cheaply. This division of labor is particularly useful for deployment, where an enterprise serving a user can optimize for both speed and costs by keeping the heavy encoder inference on the cloud and using the user’s device for light-weight inference.

Since SAM predicts dense, per pixel mask, I modified it to predict points instead. An overview of the model can be seen below. The procedurally generated comic frame is fed to the image encoder (whose weights remain unchanged during training). A point is randomly sampled from a frame and given as a query/prompt. The light-weight decoder is trained to recover the corners of the frame.

Model Overview

I learned two lessons while training this model. Firstly, it was important to canonicalize the order in which the corners of the frame were predicted. Without this, the model got conflicting signals on the ordering of corner points and never converged. Secondly, it was important to use L1 instead of L2 loss since L2 optimized very quickly without improving the quality of predictions.

Evaluation

I compared my method against original SAM and Halford’s method. Note that Halford’s method is a bit disadvantaged in this comparison since my method also uses a query (set to the center of the ground truth frame to be predicted). Despite this, it is evident that our model trained on our procedurally generated dataset, generalizes on “real-world” comics (Pepper and Carrot abbrev. as P&C), coming close to Halford in the process. It beats Halford on procedurally generated dataset (abbrev. Pr), since this dataset is designed to expose the flaws in the method.

Method	IoU (P&C)	PCK@0.1 (P&C)	L1 (P&C)	IoU (Pr)	PCK@0.1 (Pr)	L1 (Pr)
SAM	0.42	0.52	0.37	0.81	0.94	0.08
Halford	0.93	0.96	0.04	0.47	0.61	0.47
Ours	0.88	0.98	0.05	0.88	0.99	0.03

Here, IoU simply measures the area of intersection over union of the ground truth and predicted frames. PCK@0.1 refers to the percentage of times, the predicted frame corner lies within certain radius of the ground truth frame corner (0.1 refers to the radius as a percentage of the diagonal of the comic page). L1 is simply the L1 distance between ground truth and predicted frames.

Below are some qualitative results which demonstrate that our method works on “real-world” comics. We run it in two modes. On the left, we interactively provide a query and the model produces the corners. On the right, we sample a bunch of query on the image, predict polygons and filter them using non-maximal suppression like the original SAM paper.

Qualitative Results

Final Thoughts

There are still shortcomings to my method and it can often fail for complex, cluttered comic pages. But still, I like this approach to designing algorithm over composing OpenCV functions because it is often easier to see how to improve the dataset than to design new heuristics. Once you do that, you almost have a guarantee that the Neural Network machinery will get you the results.

The annotated Pepper and Carrot dataset that I used for evaluation can be found in my drive link. All my code and checkpoints are available in my Github Repo. If you think of any improvements to my approach, feel free to reach out!

Coloring with ControlNet

2024-02-16T00:00:00+00:00

A couple of cherry-picked examples that show how someone might use this model
Your browser does not support the video tag.	Your browser does not support the video tag.

Introduction

I trained a ControlNet for interactively coloring line drawings. I was inspired partly by a Twitter post by the Lvmin Zhang, the original author of ControlNet and Style2Paints project, and partly by my niece, who can spend hours filling up her coloring books.

More generally, I am excited by how deep learning provides a way to design new user interactions. The basic idea is to conceive of a new interaction (e.g. color strokes), procedurally generate examples of the interaction from some target (e.g. colored images) and train a model to map interactions to target (e.g. from color strokes to colored images).

Overview

Now that I’m winding down this experiment in order to reduce my expenses on Lambda Labs, I’m writing this blog post to document my progress.

Model and Dataset

I trained the popular ControlNet diffusion model (based on Stable Diffusion) that synthesizes images matching a given hint. Training such a model requires a paired dataset of target images and hints. Usually, during training, the hint is generated from the target image procedurally. In the simplest case, by applying Canny Edge Detector or a pre-trained Depth-estimation model to the target image. Using this paired dataset, The model learns to go the other way, i.e. from the hint to the target image. Examples of hints from the original ControlNet paper are Canny, HED edges, depth maps, human pose and even simulated doodles.

In our case, we concoct two hints from the target image, a Canny edge map and a few procedurally generated color strokes. These are encoded as a 5 channel image, 1 channel for Canny and 4 for the RGBA color strokes. The extra alpha channel delineates the presence or absence of the stroke.

The target image set I used, was a combination of Midjourney Messages and Google Conceptual Captions. Both of these are conveniently available as caption, image URL pairs. Since each training iteration of ControlNet is so slow (~ 1it/s), I found it convenient to simply fetch these URLs in my dataloading loop instead of downloading all files upfront. This saved me a lot of money for storing data.

I generated color strokes using a simple algorithm. I chose a random pixel in the target image and began a random walk for certain steps. I picked all the colors from the target during this walk and applied it on my color stroke map. With some probability, the random walk was made thicker than 1 pixel, using a max-filter. Again, with some probability, the color stroke map was blurred to discourage the model from learning the trivial copy operation, where it simply copied the color stroke onto the target image without doing anything meaningful elsewhere.

The code for the color stroke generator can be found below:

def add_color_stroke (color_strokes, source_img) :
    # get shape of image, starting point and length of walk
    h, w = source_img.shape[:2]
    st_y, st_x = random.randint(0, h - 1), random.randint(0, w - 1)
    L = random.randint(200, 1000)

    # construct walk
    dirs = np.array([[-1, 0], [0, -1], [0, 1], [1, 0], [-1, -1], [-1, 1], [1, -1], [1, 1]], dtype=int)
    rng_idx = np.random.randint(0, 8, (L,))
    steps = dirs[rng_idx]
    px_points = np.cumsum(steps, axis=0)
    px_points[:, 0] += st_y
    px_points[:, 1] += st_x

    # find when walk goes out of bounds and truncate it
    y_mask = (px_points[:, 0] < h) & (px_points[:, 0] >= 0)
    x_mask = (px_points[:, 1] < w) & (px_points[:, 1] >= 0)
    ff_id = find_first_false(y_mask & x_mask)
    px_points = px_points[:ff_id]

    # create mask from walk
    mask = np.zeros((h, w), dtype=np.uint8)
    mask[px_points[:, 0], px_points[:, 1]] = 255

    # optionally, dilate the walk
    thickness = random.choice([0, 1, 2, 3])
    if thickness > 0 :
        footprint = np.ones((thickness, thickness), dtype=np.uint8)
        mask = skimage.filters.rank.maximum(mask, footprint)

    # copy over colors from input image
    color_strokes[mask > 0, :3] = source_img[mask > 0]
    color_strokes[mask > 0, 3] = mask[mask > 0]

Infrastructure and Costs

Now, onto some details that give you a sense of the investment required to run such experiments. I rented an A10 24GB GPU on Lambda Labs. I have used GCP before but I loved the simplicity of what Lambda Labs offers. For example, I can use it using their cloud computers using tools I already know. I don’t have to worry about the gcloud SDK. This is a minor pain point, but I do think it makes a difference. Lambda Labs also offers a permanent file system, which means that your data is always backed up even if you terminate your instances.

All that said, it is very expensive. Last year, the hourly cost of renting A10 on Lambda Labs was 0.60$ but this year they have increased it to 0.75$. They justified this by saying they’ll provide more GPU instances to cover the demand. Even so, GPU instances are almost always unavailable (e.g. I have never seen an available A100 with my naked eye). Normally, I just run a script that polls their API for available GPUs and wait for something to happen.

For this experiment, in particular, I ran a training job for around 2 months. Since only batch size of 1 can fit into the A10’s memory and ControlNet’s are finicky with low batch sizes, I used gradient accumulation for an effective batch size of 64. The total cost of this run must have been around 1,000$.

Final Thoughts

I’m aware of the broader discussions around Generative AI and their impact on people’s livelihood. But I do think they are quite inadequate for realizing really specific visions for which companies tend to hire commercial artists. I feel this way, not because I’m a commercial artist, but because, sadly, I have spent a lot of time cherry-picking examples from these models to make demos. And so, I know that they fail a lot. Maybe, that’ll change tomorrow with OpenAI’s SORA but even then, I’m sure customization will be a key open problem. And to that end, perhaps all we can do is to build better tools for people.

Anyway, before I get ahead of myself, at its current stage, this project can be improved by a lot. I have two directions in mind at the moment. It may be, that the current approach will work, if only trained longer. If I had to start over, I would build on Stable Cascade, the latest Latent Diffusion model by Stability AI. The core text-conditional diffusion model here operates on a very low dimensional latent space, with spatial dimensions 12 by 12 compared 32 by 32 for Stable Diffusion (for 512 by 512 images). For the same compute budget, this would give me 4x more training steps.

The other direction that I’m thinking of is maybe this problem is not amenable to Latent Diffusion Models. While these may be adequate for structural hints (e.g. Canny, human pose or depth maps), it might be a better idea to train with the denoising diffusion loss in the image space (where the exact pixel color values are being compared with). Luckily, there is a large image space diffusion model DeepFloyd IF that I can work with.

Self Driving RC Car

2022-07-16T06:41:00+00:00

Motivation

I saw a talk where the speaker spoke about how modern computer vision systems should take more inspiration from biology. He described the mantis shrimp, a sea animal with two eyes that move independently of each other, and wondered whether its eyes enable better representations of the visual world. He was asking why evolution chose this design for the shrimp? And how did the design help?

Transforming these questions into scientific inquiry would be challenging to say the least. Nevertheless, they inspired me to try to build a small-scale autonomous vehicle with mantis shrimp like eyes. I'm not there yet: my RC Car has only one eye i.e. one camera. But it can drive itself, albeit only on a particular track. I'll use this post to describe my car. Before I begin, I would like to give huge credit to the Donkeycar project. I was able to build my car only due to them.

Design

I haven't found discussions on motor selection, battery selection, power consumption and weight of RC cars on the Internet. My struggles in understanding these culminated in a lot of bad designs. I think it'll be useful to describe and justify my design here.

The schematic (original source) shows how to control a DC motor using a Raspberry Pi. A lithium-polymer battery powers both the Raspberry Pi (4B) and the L298N motor controller. The Raspberry Pi requires an operating voltage of 5V at 2A current, which it gets from the XY-3606 DC-to-DC converter (the blue board in the schematic). It is important to connect the XY-3606 to the Pi using a USB cable. I tried to connect the two, directly using jumper wires. It didn't work: the Pi failed to boot.

The Pi controls motor speed and direction by sending 3 signals to the L298N motor controller. Two of them (orange and green wires in the schematic) control start, break and the direction (forward or reverse). The third (yellow wire) is a PWM signal which controls the voltage supplied to the motor, and hence the car's speed. A single L298N motor controller controls 2 motors. My car has 4 motors and thus requires 2 such motor controllers. These are stitched on the underside of my car in order to save space.

The total cost of all the components on the car is around 11,000 rupees. It is quite a lot but note that the Raspberry Pi, its camera module and the Lithium-Polymer batter account for nearly 75% of the total cost. These can be used for a lot of other things as well.

Now that I have described my design, I'll try to justify it. My justifications are based on extremely crude models that don't represent the actual physics of the system. Still, they are useful in illustrating which variables are important and which are not.

Battery

Selecting the battery is obviously crucial. They are quite expensive. Before purchasing them, you have to be sure that they will be able to power the components in your circuit and will be able to run them for sufficient amount of time.

I collected the power requirements of components in my car. Some I obtained from their respective datasheets. Others I estimated from their operating currents and voltages. I don't have the sources for these anymore. Treat them as ball-park estimates.

Component	Power Consumption (W)
Raspberry Pi 4B	10
2 L298N motor controllers	50
4 BO motors	10
XY-3606	5
Total	75

My Lithium-Polymer battery supplies a voltage of 11.1V. Since $\text{Power} = \text{Voltage} \times \text{Current}$, the discharge rate is $\frac{\text{Power}}{\text{Voltage}} = \frac{70}{11.1} \approx 6.3A$. This is well within the rated discharge rate for my battery.

I can also estimate the battery life by reading the battery capacity. The battery contains $1000mAh$ of charge. In SI units, this is $1000 \times 10^{-3} \times 3600 = 3600C$. Now assuming that I can discharge the battery completely, I get $\frac{3600}{6.3} \approx 10 \text{ minutes}$ of battery life.

This analysis should raise a lot of eyebrows. For one, Lithium-Polymer batteries should never be discharged completely. Doing so would render them unsafe for future use. Moreover, batteries don't supply constant power as they discharge. Power supply diminishes through battery operation. This means that my calculation of the discharge rate and battery life is not really accurate. Nevertheless, these back-of-the-envelope calculations helped me be sure that my car could run and would do so, at least for a few minutes.

Force

In order for the car to move, it has to overcome friction. Frictional force is often modelled as $F_{\text{friction}} = \mu \times M \times g$. The mass of my car $M$ is around 0.5Kg, the coefficient of friction $\mu$ is usually less than 1 and $g$ is acceleration due to gravity, $10 m/s^2$. If the motors can provide a force greater than $1 \times 0.5 \times 10 = 5N$, the car should move.

Motors convert electrical power to mechanical power. The cheap DC motors that I use, do so with around 50% efficiency. This means that we get $0.5 \times 10W = 5W$ of mechanical power to work with.

Mechanical power is $F_{\text{motors}} \times v_{\text{car}}$. Since the motor turns at 150 rpm or about 15 rad/s, the velocity of the car is $v_{\text{car}} = \text{wheel radius} \times \omega = 2 cm \times 15 rad/s = 0.3 m/s$. Hence $F_{\text{motor}} = \frac{5W}{0.3m/s} \approx 17N$. This is more than $F_{\text{friction}}$ with some margin. The car should move. In practice, it does!

Even with these crude arguments, few things become apparent. For example, your car shouldn't be too heavy. Perhaps this was already obvious. But some other things were not, at least to me. For example, wheel radius is an important variable. It trades of the car's velocity for motor's force. If your car doesn't move, try reducing the wheel radius. I learnt this the hard way when I bought monster wheels only to later have to bin them when the car didn't move.

Software

I installed the Donkeycar github repository. It has a lot of useful features that make experiments easier. You can launch a web server on the Pi and control your car through a web browser. There is a convenient way to define parts in your car and interface between them. They even save data for you as you drive. You can use this later as training data for your neural networks.

Donkeycar also advices you to install Tensorflow or Pytorch on your Pi. It wasn't easy to install these. The pre-built python wheels didn't work for me. In the end, I abandoned them, instead installing the lightweight ONNX runtime for neural network inference.

Donkeycar has also provides predefined templates for 2 wheel differential drive using the L298N motor controllers. These specify which pins on the Pi control what. Differential dirve lets you can steer the car by controlling the speeds of 2 motors. For example, to turn left, you would turn the left side wheel slowler than the right wheel. Since my car has 4 wheels, I had to define my own template. You can find it in this pull request for your reference.

Driving it around

I enjoy driving the car around. Part of my motivation in building it was to show it off to other people. Such projects were quite common in my college, but they are rarer back at home. I kind of miss that environment now that I have graduated. It would be nice to have a community such as Donkeycar's where people got together to learn about this kind of stuff. People, regardless of age, will always find something like this very interesting. It can be an easy gateway to learning deeply about today's technological buzzwords such as IoT and Machine Learning.

Sometimes when I'm driving the car around, people ask me how I built it? how much did it cost? how does it work? how does it move on its own? To the last question, initially, I gave some vague reply, "Oh, it'll be tough to explain". That person pushed me, "You think I won't understand?". So I explained that I had recorded myself driving and he immediately got it, "Oh so it is copying your behaviour". This is quite accurate. Maybe I can explain it to people at different levels of detail. It is not always necessary to describe what neural networks are and what backpropagation is.

I get fascinated by how children interact with the car. They'll run in front of it or chase it. One toddler even tried to poke it, as if it were a dog. Sometimes, I let them drive it. It is a good way to stress-test the car. Since it is hard for them to control the car, they'll often drive it off the track or into a wall or turn it upside-down. It makes them laugh and so far my car is intact.

Training Self-Driving Module

I collected training data using Donkeycar's web app by driving around a small track near my house. I visualized histograms of throttles and steering angles and found that I drove at a constant throttle. Hence I chose to only predict the steering angles through a neural network.

The inputs to my neural network were the camera image and 5 previous steering commands. From this, the network predicted the next steering command. I augmented the training images by color jittering, adjusting sharpness and inverting them at random.

I used the ImageNet-pretrained MobileNet-V2 network available in TorchVision as the convolutional branch in the network. I found that this model is at least twice as fast as ResNet-18 on the Pi. Finally, I held out a portion of the training data for validation and plotted network predictions along with ground truth steering commands. This is my Colab notebook where I pondered over neural architectures and such.

Driving on its own

The trained neural network predicted steering controls almost perfectly on the validation set. But when I deployed it to my car, things didn't go so well. The car kept going off track. Even the camera stick broke, ending experiments for the day.

I think the main problem was the difference in the training and deployment environments. I collected training data by driving the car at full throttle with Donkeycar backing up snapshots of $(\text{image}_t, \text{throttle}_t, \text{angle}_t)$ 20 times per second. On the other hand, the neural network could make steering predictions only 10 times per second.

This was important because the network relied on previous predictions in order to make the next. In the training data, the car had moved by distance $d$ between subsequent frames, but when the model was deployed, the car moved by distance $2 \times d$ between subsequent neural network evaluations. By then, it would already be off track.

I had two options to fix this. The first was to speed up inference, possibly by quantizing the model. This wasn't straightforward because there is limited operator support for quantized models on ONNX. I could try torch.jit but it requires a 64 bit OS. My Raspberry Pi runs on the 32 bit OS since the PiCam library, which Donkeycar uses to interface with the camera, is deprecated in the 64 bit OS.

The second option was to simply run the car at half the speed. The deployed network would observe new images after the car had covered the same distance as it had in the training data. This idea was not only easier to implement but more importantly, successful. Below you can see the car moving on its own. It still crashes every now and but its much better than what it was like before.

Next Steps

There is a lot that excites me about this car. I want to explore its potential as a medium for education. It would be great if I can get even one other person in my area excited about this project and have them build there own car.

Right now, the car doesn't know anything about it's own position and velocity. I would like to integrate sensors for odometry. With this, I could learn more about path planing and algorithms such as SLAM. I saw a talk by Joydeep Biswas and it would be fun to implement one of his papers on my car. In his talk, he envisions a fleet of autonomous vehicles. We could call such a fleet a Redundant Array of Inexpensive Cars.

I can also try to improve inference on the Pi. The main bottleneck right now seems to be the dependency on the 32 bit OS. I should try to build a clone of PiCam compatible with the 64 bit OS. In this way I would also be giving something back to the Donkeycar project.

Credits

I depended on a lot of resources on the Internet at various moments. The biggest of them all is Donkeycar which I have already mentioned. Apart from them, I used Tom's Hardware to figure out how to setup the Pi via SSH. I found out how to connect the Pi to mobile hotspot from this Medium Article. I was able to build the circuit with help from a YouTube Video and this Article.

Geometry 1

2022-06-11T11:18:11+00:00

Here I’ll share some Codeforces problems I solved by visualizing what happens on a 2D plane. The tricks I discuss here rely on checking parity, the pigeonhole principle and Dilworth’s Lemma.

Two Hundred Twenty One

In this problem, we are given a sequence of $N$ numbers. Each number can be either $+1$ or $-1$.

i  :  0  1  2  3  4  5  6  7  8  9  10  11  12 
a_i: +1 -1 +1 +1 +1 -1 +1 +1 -1 -1  +1  -1  +1 

For a range of the array $i \in [l, r]$, there is an alternating sum:

\[S_{lr} = \sum_{i = l}^r (-1)^{i - l} a_i\]

l  r   S_lr
-----------
0  0   1
0  3   2
1  3  -1

Given a range $[l, r]$, we have to find the minimum number of elements in range that we should remove so that the alternating sum becomes $0$. We are given a lot of such queries so we would like to compute the answer fast (in $O(1)$).

This is a very weird problem and like many Codeforces problems, it is unlikely that we’ll come across it in the real world. Nevertheless, analysing it is quite fun. Earlier, I had a lot of trouble in initiating the analysis. I was a bit lost about what questions I should ask. Now Ifind it helpful to see what happens when one thing changes. In the present case, let’s see what happens when we remove one element from the sequence.

In the image, $b_i$ takes into account the sign that $a_i$ was multiplied with while computing $S_{lr}$. Since $b_i$ is $\pm 1$ after removing it, the parity of the initial sum changes. If $S_{lr}$ is even initially, it becomes odd and vice versa. We can use this fact to lower-bound the number of times we need to remove an element. Since $0$ is even, if $S_{lr}$ is odd, we need remove at least $1$ element. On the other hand, if $S_{lr}$ is even and not $0$, then we need to remove at least $2$ elements.

In this problem, it turns out that we can always achieve these lower-bounds. To see this, let’s visualize the problem on a graph. Fix $l = 0$ and plot $S_{0r}$ for different $r$.

The alternating sum of the entire array is $7$. If there was an index $j$ such that $S_{0j} = 3$, then we could just remove the element at $j + 1$ and obtain an alternating sum of $0$. Such a $j$ always exists. This is because $S_{0r}$ changes by $1$ at each $r$. You can think of it as a discrete version of the Intermediate Value Theorem from Calculus. In our example $j = 4$.

After removing the red point, the green part will mirror about the x-axis, making the alternating sum $0$. Generalizing this observation, if $S_{lr}$ is odd, then we can always make the alternating sum $0$ by removing just $1$ element. We find the index $j$ such that $S_{lj} = \frac{S_{lr} - 1}{2}$ and remove the element at $j + 1$. If $S_{lr}$ is even, we remove any element to make the alternating sum odd. Then, we make the alternating sum $0$ using the trick above.

In order to answer queries, simply compute the alternating sum in the given range. If the alternating sum if odd, the answer is $1$, if even and not $0$, the answer is $2$. The alternating sum can be computed for any range in $O(1)$ using prefix-sums.

Training Session

We are given $[n]$ and two arrays $A_1 \cdots A_n$ and $B_1 \cdots B_n$. All pairs $(A_i, B_i)$ are distinct. We have to find the number of ways we can select three distinct numbers $i, j, k \in [n]$ such that one of the following is true:

$A_i$, $A_j$, $A_k$ are distinct.
$B_i$, $B_j$, $B_k$ are distinct.

I wasn’t able to count these directly. Instead, I counted the triplets that don’t satisfy this condition (the bad triplets). The final answer was the difference between all $nC3$ possible triplets and the bad triplets.

Negating the conditions above, we find that bad triplets have to satisfy both of the following:

Any two of $A_i$, $A_j$ and $A_k$ are the same.
Any two of $B_i$, $B_j$ and $B_k$ are the same.

Again, plotting $(A_i, B_i)$ on a graph gave a way to count bad triplets. On a graph, you can immediately see that bad triplets form an L (eg. points $1, 3, 4$ in the image below). This is because of Pigeonhole Principle which says that if there are more pigeons than holes, then some hole has more than two pigeons. It is a stupid principle, much like the pigeons themselves.

Anyway, to see why bad triplets always make an L, let $i_1, i_2$ be the indices for which $a_{i_1} = a_{i_2}$ and $j_1, j_2$ be the indices for which $b_{j_1} = b_{j_2}$. Since these are indices in a triplet (the holes), all of $i_1, i_2, j_1, j_2$ (the pigeons) can’t be distinct. Some $i_{k_1} = j_{k_2}, k_1, k_2 \in \{1, 2 \}$. In the image, the common index is $3$. Let’s refer to this common index as the pivot.

We’ll count the bad triplets by counting bad triplets in which a each index is a pivot. There will be bad triplets with $1$ as the pivot, $2$ as the pivot, $3$ as the pivot and so on. All these subsets will be disjoint and their union will be the set of all bad triplets. We’ll get the sizes of these subsets individually and sum them up.

When an index $i$ is a pivot, one member of the triplet is horizontal to it and the other is vertical to it. If we count indices having the same $B$-value and $A$-value (Bcnts[b] and Acnts[a]), then the number of bad triplets with $i$ as the pivot are (Bcnts[B[i]] - 1) * (Acnts[A[i]] - 1). We subtract one to avoid counting $i$ twice in the triplet. Referring to the image, you can see that the number of bad triplets in which $3$ is the pivot are $2 \times 3 = 6$.

By precomputing Bcnts and Acnts, we can count the bad triplets in $O(n)$ by iterating over all possible pivots and adding their contribution using the formula above. To solve the original problem, we subtract this number from $nC3$, the number of triplets in total.

Manhattan Subarrays

I like this problem because it is an interesting application of the Dilworth’s Lemma. We are given an array $a_1 \cdots a_n$. A triplet of distinct indices $i, j, k$ is considered bad if:

\[\underbrace{|a_i - a_j| + |i - j|}_\textrm{manhattan distance between i, j} = \underbrace{|a_i - a_k| + |i - k|}_\textrm{manhattan distance between i, k} + \underbrace{|a_k - a_j| + |k - j|}_\textrm{manhattan distance between k, j}\]

A contiguous subarray $a_l \cdots a_r$ is considered good if we can’t pick any bad triplet from it. We have to count the number of good subarrays.

I found the definition of a bad triplet quite convoluted. I searched for simpler but equivalent definitions of bad triplets. I simplified the problem by asking this question in one dimension. When does the following happen?

\[|a_i - a_j| = |a_i - a_k| + |a_k - a_j|\]

This happens if and only if $a_k$ is between $a_i$ and $a_j$. The image below illustrates that when $a_k$ is between $a_i$ and $a_j$ then LHS and the RHS are equal. When it is not, such as for $a_{k’}$, then the segment $|a_i - a_{k’}|$ on its own is longer than $|a_i - a_j|$. In this case:

\[|a_i - a_j| < |a_i - a_{k'}| + |a_{k'} - a_j|\]

Back in our original two dimensional problem, we can infer that if $a_k$ is between $a_i$ and $a_j$, then $k$ is between $i$ and $j$.

\[\cancel{|a_i - a_j|} + |i - j| = \cancel{|a_i - a_k|} + |i - k| + \cancel{|a_k - a_j|} + |k - j|\] \[\Rightarrow k\text{ is between }i\text{ and } j\]

Likewise, if $k$ is between $i$ and $j$ then $a_k$ is between $a_i$ and $a_j$. On the other hand, when $k$ is not between $i$ and $j$, then $a_k$ is not between $a_i$ and $a_j$. In this case, the triplet $i, j, k$ is not bad. Visualizing bad triplets on a graph, observe that they form a monotonic sequence.

As you can imagine, good subarrays i.e. those that don’t contain bad triplets, can’t be too long. If they are small enough, then we can simply enumerate all of subarrays upto some size and count the good ones. We can obtain a bound on the size of good subarrays using partial orders and Dilworth’s Lemma.

Define a partial order as $(a_i, i) \leq (a_j, j)$ if $i \leq j$ and $a_i \leq a_j$. The image below shows what this partial order looks like. The directed arrow indicates when one point is $\leq$ than the other. If $x \leq y$ and $y \leq z$ than $x \leq z$ and so I haven’t bothered to draw the arrows from $x$ to $z$.

Notice that unrelated points i.e. points where neither is $\leq$ than the other, form a decreasing sequence. By Dilworth’s Lemma, in a subarray of length $N$, either there is an increasing subsequence of length greater than $\sqrt{N}$ or a decreasing subsequence of length greater than equal to $\sqrt{N}$. If a subarray has length greater than $4$, then Dilworth’s Lemma tells us that it is bound to be bad.

Now, we can count good subarrays by enumerating all subarrays of size at most $4$ and checking whether they are good. The time complexity of doing so is $O(n)$.

Conclusion

Visualizing problems helps a lot obviously. Also, the steps in the solutions presented above didn’t occur to me in the linear order in which they are presented. There are a lot of tiny failure tracks that I have trimmed while writing this. This just means that when you are solving problems, you are going to be led down the proverbial garden path. And that is ok!

I recommend reading the post that explains Dilworth’s Lemma. In fact, that MIT OCW course is quite good on the whole.

Binary Search 1

2022-05-31T12:07:14+00:00

Binary Search applies to many problems on Codeforces. These problems can be framed as – find the largest $x \in [n]$ for which $f(x)$ is true. If $f$ has the monotonicity property:

\[f(x) \Rightarrow f(x - 1)\]

Then we can binary search for the largest $x$. This helps when evaluating $f$ on each $x$ is prohibitively expensive. When monotonicity holds, we can guess $O(lg n)$ $x$’s and evaluate $f$ on just those to find the largest.

Keshi Is Throwing a Party

This is an example where Binary Search yields a simple and efficient solution. In a nutshell, we are given $[n]$ and two arrays $a_1 \cdots a_n$, $b_1 \cdots b_n$. We have to find the largest set $S \subseteq [n]$ that satisfies – for all $i \in S$, there are at most $a_i$ elements in $S$ which are greater than $i$ and at most $b_i$ elements in $S$ which are lesser than $i$.

For example:

# assume for now that arrays are 1-indexed
n = 3
a = [1, 2, 1]
b = [2, 1, 1]

Then $S = \{1, 2\}$ is a valid subset.

The key to solving this problem is that the question – does a valid set of size $x$ exists? – has the monotonicity property. If a valid set of size $x$ exists, simply remove the last element and obtain a valid set of size $x - 1$.

# returns true if a valid set of size x exists
# def checker (x) : 
#   ...
l, r = 1, n 
while l < r : 
  m = (l + r) / 2
  if checker(m + 1) : 
    l = m + 1
  else :
    r = m

print(l) 

Finally, we can write checker using a simple greedy strategy. We build the set $S$ by scanning through $1 \cdots n$ and greedily adding the first $i$ to $S$ that satisfies $|S| \leq b_i$ and $x - (|S| + 1) \leq a_i$. If at the end of the scan, $|S| < x$, we report that that a size $x$ valid set doesn’t exist.

Why does checker correct? If checker is incorrect, that means that for some $x$, there is a valid $S$ of that size $x$ but checker only finds $S’$ and $|S’| = k < x$.

\[S = \{i_1, \cdots, i_x\}\] \[S' = \{j_1, \cdots, j_k\}\]

Let $l$ be the smallest index where $S$ and $S’$ differ. There has to be some $l \in [k]$ for which $i_l \neq j_l$ because otherwise checker would have extended $S’$. Since checker picks the earliest item to add to the set: $j_l < i_l$. Both $j_l$ and $i_l$ satisfy the validity condition.

\[l - 1 \leq b[j_l]\] \[l - 1 \leq b[i_l]\] \[x - l \leq a[j_l]\] \[x - l \leq a[i_l]\]

Hence we can replace $i_l \in S$ by $j_l$. The new set $S^{(1)} = S - \{i_l\} + \{j_l\}$ is still valid. Repeating this process $k - l + 1$ times gives us $S^{(k - l + 1)}$ for which no $l \in [k]$ exists where $S^{(k - l + 1)}$ and $S’$ differ. But such an $l$ has to exist and so we reach a contradiction, proving that checker is correct.

I’ll end the story here. Here is my submission on Codeforces. In summary, under certain conditions, Binary Search allows us to convert an optimization problem (find the largest x for which …) into a decision problem (does there exist an x for which …). When applicable, Binary Search introduces only a small multiplicative $O(lg n)$ overhead over the underlying decision problem.

I’m slightly disappointed that I haven’t been able to find an efficient dynamic-programming solution to this problem. If anyone knows of one, please tell me!

Why I do Codeforces

2022-05-30T06:10:14+00:00

Recently, I spent a lot of time solving problems on Codeforces. These problems are similar to those I had encountered in undergraduate classes, such as Discrete Maths, Automata Theory and Algorithms; classes I didn’t do particularly well in.

Often, I was quite nervous before their exams. I didn’t back myself to be able to solve the problems. I guess I simply didn’t understand how problem-solving worked. I expected that either a solution would just pop in my head or it never will: I would just stick with the first idea that fruitlessly try to knead it into a solution. I didn’t play around with examples, create smaller subproblems, consider different hypotheses etc. On Codeforces, I could practise these strategies without consequences (such as grades).

Having solved roughly 400 problems, I am more comfortable with the problem-solving process. Even if I don’t end up solving a problem, at least I give myself a fair chance. I try out examples, create hypotheses, either prove hypotheses or try to construct simple counter-examples. Sometimes, different hypotheses combine into a solution (as in square-root decomposition). If nothing works, I write a program to generate small examples and see if a pattern arises. If still nothing works, I go and smash the day’s Wordle and feel happy about that.

Now, instead of solving more problems, I want to take a step back and document some interesting problems and problem-solving techniques. This is slightly redundant work because problems are reviewed at length on Codeforces itself. Even so, I think that writing about them will improve my skills.