The TWIML AI Podcast with Sam Charrington
Overcoming Oscillations in Quantization-Aware Training, Variational On-the-Fly Personalization, and CITRIS: Causal Identifiability from Temporal Intervened Sequences." data-search-guests="Arash Behboodi">
Forecasting from LiDAR via Future Object Detection, which proposes an end-to-end approach for detection and motion forecasting based on raw sensor measurement as opposed to ground truth tracks. Finally, we discuss Aljosa’s third and final paper Opening up Open-World Tracking, which proposes a new benchmark to analyze existing efforts in multi-object tracking and constructs a baseline for these tasks." data-search-guests="Aljosa Osep">
Unsupervised Domain Generalization by learning a Bridge Across Domains." data-search-guests="Kate Saenko">
Imposing Consistency for Optical Flow Estimation, a paper that introduces novel and effective consistency strategies for optical flow estimation. The final paper we discuss is IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes, which proposes a transformer architecture to simultaneously estimate depths, normals, spatially-varying albedo, roughness, and lighting from a single image of an indoor scene. For each paper, we explore the motivations and challenges and get concrete examples to demonstrate each problem and solution presented." data-search-guests="Fatih Porikli">
https://twimlai.com/podcast/twimlai/series/data-centric-ai." data-search-guests="D. Sculley">
World Models and Attention for Reinforcement Learning, and The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning.
This interview is Nerd Alert certified, so get your notes ready!
PS. David is one of our favorite follows on Twitter (@hardmaru), so check him out and share your thoughts on this interview and his work!
" data-search-guests="David Ha">
WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels, explores the use of rf signals to infer what environment looks like, allowing for estimation of a person's movement.
We also discuss the ability for machine learning and AI to help enable 5G and make it more efficient for these applications, as well as the scenarios that ML would allow for more effective delivery of connected services, and look towards what might be possible in the near future.
" data-search-guests="Joseph Soriaga">
conversations with Ville, we explored his experience building and deploying the open-source framework, Metaflow, while working at Netflix. Since our last chat, Ville has embarked on a few new journeys, including writing the upcoming book Effective Data Science Infrastructure, and commercializing Metaflow, both of which we dig into quite a bit in this conversation.
We reintroduce the problem that Metaflow was built to solve and discuss some of the unique use cases that Ville has seen since its release, the relationship between Metaflow and Kubernetes, and the maturity of services like batch and lambdas allowing a complete production ML system to be delivered. Finally, we discuss the degree to which Ville is catering is Outerbounds' efforts to building tools for the MLOps community, and what the future looks like for him and Metaflow.
" data-search-guests="Ville Tuulos">
Jesse Engel.
" data-search-guests="Alexander Richard">
Milind Tambe, as well as a project focused on using ML techniques to assist in the identification of people in need of housing resources, and ensuring that they get the best interventions possible.
If you enjoyed this conversation, I encourage you to check out our conversation with Milind Tambe from last year's TWIMLfest on Why AI Innovation and Social Impact Go Hand in Hand.
" data-search-guests="Eric Rice">
which you can catch the videos for here).
" data-search-guests="Chris Fregly, Antje Barth">
In our conversation with Jabran, we explore their recent endeavor into the complete mapping of which T-cells bind to which antigens through the Antigen Map Project. We discuss how Jabran's background in astrophysics and cosmology has translated to his current work in immunology and biology, the origins of the antigen map, a walkthrough of the biological aspect of the project, and how the focus was changed by the emergence of the coronavirus pandemic.
We talk through the biological advancements, and the challenges of using machine learning in this setting, some of the more advanced ML techniques that they've tried that have not panned out (as of yet), the path forward for the antigen map to make a broader impact, and much more.
" data-search-guests="Jabran Zahid">
Advancing Your Data Science Career During the Pandemic panel, where she shared her experience trying to navigate the suddenly hectic data science job market. Now, a year removed from that panel, we explore her book on data science careers, top insights for folks just getting into the field, ways that job seekers should be signaling that they have the required background, and how to approach and navigate failure as a data scientist.
We also spend quite a bit of time discussing Dask, an open-source library for parallel computing in Python, as well as use cases for the tool, the relationship between Dask and Kubernetes/Docker containers, where data scientists are in regards to the software development toolchain and much more!
" data-search-guests="Jacqueline Nolis">
Today we're joined by Melanie Mitchell, Davis Professor at the Santa Fe Institute and author of Artificial Intelligence: A Guide for Thinking Humans. While Melanie has had a long career with a myriad of research interests, we focus on a few, complex systems and the understanding of intelligence, complexity, and her recent work on getting AI systems to make analogies. We explore examples of social learning, and how it applies to AI contextually, and defining intelligence.
We discuss potential frameworks that would help machines understand analogies, established benchmarks for analogy, and if there is a social learning solution to help machines figure out analogy. Finally, we talk through the overall state of AI systems, the progress we've made amid the limited concept of social learning, if we're able to achieve intelligence with current approaches to AI, and much more!
" data-search-guests="Melanie Mitchell">
@samcharrington or @twimlai.
To follow along with the 2020 AI Rewind Series, head over to the series page.
" data-search-guests="Michael Bronstein">
@samcharrington or @twimlai.
To follow along with the 2020 AI Rewind Series, head over to the series page!
" data-search-guests="Sameer Singh">
@samcharrington or @twimlai.
To follow along with the 2020 AI Rewind Series, head over to the series page!
" data-search-guests="Pavan Turaga">
@samcharrington or @twimlai.
To follow along with the 2020 AI Rewind Series, head over to the series page!
" data-search-guests="Pablo Samuel Castro">
here!
" data-search-guests="Sina Bahram, Cynthia Bennet, Chancey Fleet, Venkatesh Potluri, Meredith Ringel Morris">
here.
" data-search-guests="Jeremy Howard">
Visualizing The Consequences Of Climate Change Using Cycle-consistent Adversarial Networks,' and we're excited to pick her brain about the ways ML is currently being leveraged to help the environment. In our conversation, we explore the use of GANs to visualize the consequences of climate change, the evolution of different approaches she used, and the challenges of training GANs using an end-to-end pipeline.
Finally, we talk through Sasha's goals for the aforementioned panel, which is scheduled for Friday, October 23rd at 1 pm PT. Register for all of the great TWIMLfest sessions here!
" data-search-guests="Sasha Luccioni">
IBM Data Science Community site, which has over 10,000 members, they provide a place for data scientists to collaborate, share knowledge, and support one another.
IBM's Data Science Community site is a great place to connect with other data scientists and to find information and resources to support your career.
Join and get a free month of select IBM Programs on Coursera.
" data-search-guests="Chris Nuernberger, Huda Nassar, Burak Kanber, Catherine Nelson, Gabriela de Queiroz, Avi Bryant, Chris Lattner">
YouTube channel!
" data-search-guests="Rumman Chowdhury">
wandb.com/twiml.
" data-search-guests="Lukas Biewald">
Twitter for updates.
Big shout out to IBM for their support in helping to make this panel possible! IBM continues to support major initiatives -- applying data, knowledge, computing power and insights, to solve the challenging problems presented by the coronavirus. Some of these initiatives include their work with the High-Performance Computing Consortium, providing detailed virus tracking information on the Weather Channel, and offering free access to Watson Assistant for COVID-19 related applications. Click here to find out more about IBM’s response." data-search-guests="Rex Douglass, Robert Munro, Lea Shanley, Gigi Yuen-Reed">
The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning," which identifies three formal definitions of fairness in algorithms, the statistical limitations of each, and details how mathematical formalizations of fairness could be introduced into algorithms.
" data-search-guests="Sharad Goel">
here.
" data-search-guests="Beidi Chen">
Ultrasound Nerve Segmentation competition.
His secret sauce was a ground-up implementation of the U-net architecture (an encoder-decoder network), which hadn't been used in a Kaggle competition. To "preserve the localization information from the original images...I implemented [the U-net architecture]. I trained the network, and to my surprise, it worked."
He was not the only one to use a U-net in this competition, but he thinks his competitive edge came from being one of the only ones to learn how to build it from scratch. "If you did something different or you implemented your own and improved it a little bit, maybe more than what everybody else was using, you had a chance of doing a little better…"
David's solution for this competition also benefited from a lot of experimenting with data augmentation strategies. "When you train segmentation networks to avoid overfitting, you have to augment the image and the mask together and you have to find the right kind of augmentation strategies...so I hacked those and created a good augmentation strategy [and] framework."
David ultimately finished second out of 950 teams in the competition that year. Not bad for his second effort!
ProVision Body Scanners and Architectures for 3-Dimensional Data
Encouraged, David continued to participate in Kaggle. Following his early success was a competition focused on classifying images from airport body scanners (a.k.a. Nude-O-Scopes as Sam calls them), and was sponsored by the Department of Homeland Security. The goal was to create new algorithms that could more accurately predict threats and detect prohibited items during the screening process. It was the largest Kaggle competition in terms of prize money ($1.5 million) and also in terms of the size of the data set being used.
The Passenger Screening Algorithm Challenge was particularly interesting to David in its use of three-dimensional data. There were no existing best practices for how to build architectures that could process 3D data without downsizing or downsampling. Three-dimensional images require much more memory and storage to process than 2D images, but also create new opportunities. The third dimension provides a "third axis where you also can correlate features across multiple two-dimensional images because the volume is essentially a stack of two-dimensional images."
David was still in the middle of his PhD research and had already been thinking about three-dimensional data for CT or MRI images. His entry for the competition would apply the same architecture that he had been using to try to detect Parkinson's disease from brain scans.
His method involved dimensionality reduction by combining a 2D convolutional neural network (CNN) with a Long Short-Term Memory (LSTM) architecture that models sequences of data. Essentially, the CNN learned two-dimensional vectors from each of the images, and fed them into the LSTM, which could take advantage of the relationships between the frames. This allowed the team to avoid reducing the resolution of the input images.
Sam posits that "A lot of winning the competition is being on the winning side of information asymmetry." It can be hard to gauge where you might stand in the competition to make sure your efforts are worth it. David tries to plan ahead by recognizing signs of promise from the beginning, which for him means placing in the top 30 on the leaderboard as a result of his initial efforts in a problem.
Distracted Drivers and Data Augmentation
Building on challenges with processing image data, another Kaggle competition David participated in was the State Farm Distracted Driver Detection challenge. The problem was to identify distracted drivers by reviewing images to determine whether the driver was doing things like playing with the radio, using the phone or applying makeup.
Their unique approach was to implement a creative data augmentation technique to train the model. The technique involved taking for example, two images of a driver playing with the radio. They would then vertically or horizontally combine, for example, 75% of one image with 25% of the other image to get an additional image. By joining them together, the partial image would indicate a distracted driver and should be enough to make the best prediction.
Combining images in this way was a solution to avoid overfitting. The data they were dealing with only had a few examples of the distracted driving behavior they were trying to identify in the training set, causing their neural networks to tend to overfit. (That is, they memorized the few examples they found in the training set, and struggled to generalize to an unseen validation set.) By combining the images they both created additional training examples and broke the networks tendency to rely on spurious patterns in identifying examples of distracted driving.
Otherwise, though, they used an off-the-shelf model architecture, demonstrating that it's not always a unique architecture that wins the competition. According to David, you don't necessarily need a massive ensemble of models to win Kaggle competitions either:
"If you focus on one model, you can almost do as well as a massive ensemble, but oftentimes the ensemble is the easy way out. But the ensembles, there's a cost associated with that, at least for computer vision, in terms of GPU time. If you have infinite compute resources, you might be able to get away with ensembling, but oftentimes you have to weigh the cost of training many models with focusing on one and trying to get it as good as possible."
"These are some of my secrets."
David has a few tricks to share that apply to everyone, beginners and experts alike:
Keep it Simple. "The key...is that these solutions are usually simple...there's this idea that starting Kaggle [is] hard…I feel like a lot of challenges you just have to look at it with a creative approach and just opening your mind that the solution is simple."
Persistence. David also emphasizes that "Kaggle can be discouraging." But you have to believe you can do well and give it a shot regardless, even if you don't do well initially.
Reading Top Solutions. Another trick is to read the approaches from Kaggle winners so you can compare their solutions with your own to learn from what you could have improved.
Additional Tips
In addition to the tips David emphasized above, here are a few additional suggestions we gleaned from the interview:
Kaggle Discussion Forums. Digging into forums to see what other people are doing is a great way to learn what angles and perspectives others are using that might help you approach the challenge.
Teaming Up. In most of his competitions, David has teamed up with others who all bring unique perspectives and help with the challenges.
Kernels. Kaggle is collaborative and kernels might be a great place to get you started, but it's also a competition and as David puts it simply, "if you do what everybody else does, you're not going to win."
If you're interested in joining Kaggle, or want to be part of a supportive community of folks working on Kaggle projects together, check our Kaggle study group! The group hosts virtual meetups every week. Learn more at http://twimlai.com/program/kaggle-team/.
" data-search-guests="David Odaibo">
Bruno Gonçalves.
"The idea is essentially to look under the microscope of how science works, meaning for example, how it evolves over time, how collaboration occurs between different scientists, in between different fields. How scientists pick their research problems, how they, for example, move across different institutions, how nations develop expertise in different fields of research and so on."
In addition to predicting the trajectory of physics research, Matteo is also active in the computational epidemiology field. His work in that area involves building simulators that can model the spread of diseases like Zika or the seasonal flu at a global scale.
Science of Science
Matteo's background in economics and his interest in human behavior sparked his desire to explore the "science of science." Physics was the natural starting point since he already worked with many individuals in the field.
To build his models, Matteo uses a core data set of papers published in the journals of The American Physical Society. This dataset was chosen in part because of the robustness of its classification scheme, the Physics and Astronomy Classification Scheme (PACS), which provides references to affiliated topics, authors and publications for each of the papers in the archive. PACS also provides a consistent set of keywords for each of the papers.
These keywords are used to relate the various physics researchers to one another using an embedding model. In Matteo's case, the model they use is StarSpace, developed by Facebook AI Research.
As Matteo puts it, "We are treating each author as a bag of topics, a bag of research fields in which that author has worked. Then we use this bag of topics to infer the embeddings for each specific research sub-area."
Having created an embedding that relates the various research topics to one another, Matteo and his co-authors then use it to create what they call the Research Space Network (RSN). The RSN is a "mapping of the research space [created] by essentially looking at the expertise of authors to guide us on what it means for two topics to be similar to each other."
Principle of Relatedness
One of the main findings from the research so far is what Matteo refers to as a "fingerprint" of the scientific production of cities. The work is based on the idea of The Principle of Relatedness, an economics term that aims to measure the relationship between a nation's overall production, exports, expertise, and trade partners to predict what items the country should export next.
In applying this idea to their research, Matteo would look at all of the scientific publications from a city and use the embedding space to measure the level of relatedness, and predict the direction of the city's scientific knowledge. You can use a network to visually show the interactions between different vectors (science topics) and rank the probability that a city will enter a specific field. That ranking becomes your "classifier" and allows you to determine where that field will or will not be developed next.
If you were to plot out the topics of existing research in a city, you could see where the "knowledge density" collects, and note where the density is high, to predict the trajectory of research. If a country is in an intermediate stage of development, there's a higher chance of "jumping" to a different space.
Focus and Limitations
The focus, for now, is to find the best way of creating embeddings for a very specific problem, not for a variety of tasks.
For example, there is no weighting of a researcher's volume of work or its relative importance--the associations include anything they've been active in. Likewise, for some analyses, you might want to identify where the scientist is most active and remove any side projects or abandoned subjects.
None of these are considered in this paper. Rather, Matteo approaches the problem from the simplest possible scenario, effectively asking "What if we are blind?"
"We...get a big pile of papers from an author. We just list all the topics in which he has worked on and train on that."
They want to prove that you do not need to perform manual checks and optimizations to get useful results.
Performance Metrics
Matteo tested the results using a couple of different validations:
One approach was to visualize the RSN and regional fingerprints for assessment. This made it easy to see the macro areas where the PACS classification distinguishes the different subfields of physics. This hierarchy was not used at training time and the algorithm was able to determine the right classification.
The second method was to measure the predictive power of the algorithm by looking at each city at a given time period and listing the topics where they had a competitive advantage. Then they compared them using a standard metric like an ROC curve to see if the model was performing better than a random model.
What's Next?
While the goal is to eventually expand and apply these techniques to entire papers (vs just the PACS keywords), having a predetermined taxonomy and hierarchical structure laid out gives them a benchmark to validate their own observations.
Scaling this approach to other fields is something they are starting to work on. They've made some progress using the Microsoft Academic Graph which includes all the different fields in science. As of now, they can't replicate the results they get when they apply the algorithm to physics, but the potential for the embedding space can be evolved for tracking things like the semantics of a term over time, or how authors tend to move in this space. There's also the possibility of finding gaps in the science and making connections that the field might not know to make.
" data-search-guests="Matteo Chinazzi">
Online Classification with Complex Metrics on making models that optimize complex, non-decomposable metrics. (Non-decomposable here means you can't write the metric as an average, which would allow you to apply existing tools like gradient descent.)
Scaling up to More Complex Measures
To generalize this idea beyond simple binary classifiers, we have to think about the confusion matrix, which is a key statistical tool used in assessing classifiers. The confusion matrix measures the distribution of predictions that a classifier makes given an input with a certain label.
Sanmi's research provided guidance for building models that optimized arbitrary metrics based on the confusion matrix.
"Initially we work[ed out] linear weighted combinations. Eventually, we got to ratios of linear things, which captures things like F-measure. Now we're at the point where we can pretty much do any function of the confusion matrix."
Domain Experts and Metric Elicitation
Having developed a framework for optimizing classifiers against complex performance metrics, the next question Sanmi asked (because it was the next question asked of him), is which one should you choose for a particular problem? This is where metric elicitation comes in.
The idea is to flip the question around and try to determine good metrics for a particular problem by interacting with experts or users to determine which of the metrics we can now optimize for best approximate how the experts are making trade-offs against various types of predictions or classification errors.
For example, a doctor understands the costs associated with diagnosing or misdiagnosing someone with a disease. The trade-off factors could include treatment prices or side effects--factors that can be compressed to the pros/cons of predicting a diagnosis or not. Building a trade-off function for these decisions is difficult. Metric elicitation allows us to identify the preferences of doctors through a series of interactions with them, and to identify the trade-offs that should correspond to their preferences." Once we know these trade-offs, we can build a metric that captures them, which allows you to optimize those preferences directly in your models using the techniques Sanmi developed earlier.
In research developed with Gaurush Hiranandani and other colleagues at the University of Illinois, Performance Metric Elicitation from Pairwise Classifier Comparisons proposes a system of asking experts to rank pairs of preferences, kind of like an eye exam for machine learning metrics.
Metric Elicitation and Inverse Reinforcement Learning
Sanmi notes that learning metrics in this manner is similar to inverse reinforcement learning, where reward functions are being learned, often by interaction with humans. However, the fields differ in that RL is more focused on replicating behavior rather than getting the reward function correct. Metric elicitation, on the other hand, is focused on replicating the same decision-making reward function as the human expert. Matching the model's reward function, as opposed to the model's behavior, has the benefit of greater generalizability, which allows metrics that are agnostic to data distribution or the specific learner you're using.
Sanmi mentions another interesting area of application around fairness and bias, where you have different measures of fairness that correspond to different notions of trade-offs. Upcoming research is focused on finding "elicitation procedures that build context-specific notions of metrics or statistics" that should be normalized across groups to reach a fairness goal in a specific setting.
Robust Distributed Learning
This interview also covers Sanmi's research into robust distributed learning, which aims to harden distributed machine learning systems against adversarial attacks.
Be sure to check out the full interview for the interesting discussion Sam and Sanmi had on both metric elicitation and robust distributed learning. The latter discussion starts about 33 minutes into the interview.
" data-search-guests="Sanmi Koyejo">
Recent Advances in Algorithmic High-Dimensional Robust Statistics. The survey covers an overview of about a 100 papers exploring the techniques that have been developed in the space so far, and evaluates what direction the community should go in next. The survey will be published soon as part of the book Beyond Worst-Case Analysis.
Practical Implications: Data Poisoning and Implementation
One of the practical implications of robust statistics is the prevention of data poisoning. Data poisoning is the phenomenon that occurs when you have incoming data from outside and the system is vulnerable to malicious users who insert fake data that destroys the behavior of the model.
While the potential for applications is large, what's holding back widespread implementation is that the algorithms use spectral methods, and are not automatic as the machine learning community wants them to be. Further, many real-world problems are non-convex, meaning that SGD can't be applied directly. Ilias believes that can change soon, and is currently working on eliminating the algorithmics and giving structure to these non-convex problems to formulate them in such a way that SGD can sufficiently solve them.
" data-search-guests="Ilias Diakonikolas">
attention for being among the first to publicly warn about the coronavirus (COVID-19) that initially appeared in the Chinese city of Wuhan. How did the company's system of data gathering techniques and algorithms help flag the potential dangers of the disease? In this interview, Kamran shares how they use a variety of machine learning techniques to track, analyze and predict infectious disease outbreaks.
As a practicing physician based in Toronto, Kamran was directly impacted by the SARS outbreak in 2003. "We saw our hospitals completely overwhelmed. They went into lockdown. All elective procedures were canceled...even the city took on a different feel...there were billions of financial losses...and Toronto was just one of dozens." In the wake of that crisis, governments have been slow to act. Efforts like the International Health Regulations Treaty (2005), which aims to standardize communication about diseases, help but are not well enforced. It doesn't help that these nations are often unaware of the severity of an outbreak, or are hesitant to report a threat because of potential economic consequences.
Ultimately, his experience with the SARS crisis led Kamran to explore the role technology might play in anticipating outbreaks and predicting how they might spread. Kamran's insight ultimately lead to the creation of BlueDot, which applies machine learning to four main challenges in infectious disease tracking: Surveillance, Dispersion, Impact, and Communication.
[caption id="attachment_7154" align="aligncenter" width="1024"]
Surveillance
The BlueDot engine gathers data on over 150 diseases and syndromes around the world, looking at over 100,000 online articles each day spanning 65 languages, searching every 15 minutes, 24 hours a day. This includes official data from organizations like the Center for Disease Control or the World Health Organization, but also counts on less structured, local information from journalists and healthcare workers.
BlueDot's epidemiologists and physicians manually classified the data and developed a taxonomy so relevant keywords could be scanned efficiently. They later applied ML and NLP to train the system. Kamran points out that the algorithms in place perform "relatively low-complexity tasks, but they're incredibly high volume and there's an enormous amount of them, so we can simply train a machine to replicate our judgment [for classifying]".
As a result of their system's algorithms, only a handful of cases are flagged for human experts to analyze. In the case of COVOID-19, the system highlighted articles in Chinese that reported 27 pneumonia cases associated with a market that had seafood and live animals in Wuhan.
Dispersion
Recognizing the role that travel plays in disease dispersion—especially in the age of air travel—BlueDot uses geographic information system (GIS) data and flight ticket sales to create a dispersion graph for each disease based on the airports connected to a city and where passengers are likely to fly. Not everyone travels by air, so they also use anonymized location data from 400 million mobile devices to track flows from outbreak epicenters to other parts of the region or world. The locations receiving the highest volume of travelers are identified and diligently evaluated for what the impact of the disease could be in the area.
For COVOID-19, BlueDot applied this methodology to identify many of the cities among the first to receive the coronavirus, including Tokyo, Bangkok, Hong Kong, Seoul, and Taipei.
Impact
Once a virus leaves its region of origin, a wide variety of factors determine whether it will ultimately die out or grow into a full-fledged outbreak: A region may have better or worse public health infrastructure, hospitable or inhospitable climates, or varying economic resources. BlueData's systems consider factors such as these to predict the potential impact on an identified area.
For example, if a virus is being spread by ticks, and Vancouver is in the middle of winter snow, the likelihood of an outbreak is very low because ticks would not survive that climate. However, the same virus might thrive in a humid environment like Florida, making the region at-risk for an outbreak.
Communication
If an area is determined to be at-risk, the focus shifts to providing early warnings to health officials, hospitals, airlines, and government agencies in public health, national defense, national security and even agriculture. Kamran reiterates the importance of providing only the most relevant information to those who need it, referencing the ideas Clay Shirky and his 2008 talk], "It's Not Information Overload. It's Filter Failure.
BlueDot first became aware of the pneumonia cases in Wuhan on December 31st, and in addition to notifying their clients and government stakeholders directly, they publicly released their findings in the Journal of Travel Medicine on January 14th.
Criticism and Limitations
These are incredibly difficult predictions to make, and the science behind the transmission of infectious diseases is complex and evolving every day. So, what is the proper role of technology? Kamran asserts that "by no means would [they] claim that AI has got this problem solved. It's just one of the tools in the toolbox."
In some cases, Kamran and his team may lack sufficient observations to develop a machine learning model for a particular disease. For this and other reasons, the company relies on a combination of approaches and a diverse team of specialists in their work.
With coronavirus already in full swing, BlueDot is looking more heavily at analyzing location data from mobile devices to provide a real-time understanding of how people are moving around. However, Kamran compares this to predicting the weather—the further ahead you're looking, the less accurate your prediction.
Despite the limitations, Kamran reinforces the value of the work by acknowledging that "Manually, it would take a hundred people around the clock [to process the data], and we have four people and a machine."
" data-search-guests="Kamran Khan">
Building Machine Learning Powered Applications: Going from Idea to Product.
Emmanuel began his career as a data scientist and went on to mentor over a hundred Ph.D. fellows looking to transition into machine learning as an AI program lead at Insight Data Science. His new book is the culmination of what he learned, and provides a guide for aspiring and practicing engineers and data scientists on how to approach ML projects systematically.
Structuring End-to-End Machine Learning Projects
In this interview, as in the book, Emmanuel shares his best practices for structuring and building projects. Emmanuel approaches new ML projects in four main stages:
- Formulating the problem and creating a plan: Here we want to think about the best possible approach to solving our specific problem. The goal is to simplify, simplify, simplify, and have a clear understanding of what your success metrics are before you start to build anything.
- Building a working pipeline and acquiring an initial dataset : Emmanuel recommends building an end-to-end data processing pipeline, albeit a simple one, right from the start, and walks us through how to test and evolve it. Like your pipeline, your dataset is also something you'll want to iterate on. Your data should inform your features and models, and not the other way around.
- Iterate on your models : Model development is inherently iterative, and Emmanuel shares his approach to developing and evaluating models. The latter depends on your ability to successfully chose an evaluation metric that is most appropriate for your problem, and tools like confusion matrices, ROC curves, a calibration curves, and various approaches to visualization can all come into play when trying to debug your models. Evaluating feature importance can also help here, as it allows you to check your assumptions about the problem.
- Deployment and monitoring : A number of non-technical and technical considerations come into play when unleashing your models to the real world. First off, we need to consider the ethical implications of our models as well as concerns like data ownership and bias. From a technical perspective, we need to choose a deployment option that makes sense for the way the model will be accessed by its users. We also want to build safeguards and sanity checks to protect us from model failures, and monitor the model's predictions over time.
The Extended Mind, by Andy Clark and David Chalmers, which suggests that the smartphone has become an extension of the human mind. But what Abeba emphasizes as most important are the disparities in how different groups of people are impacted by technology shifts, and the connection between privilege and control over that impact. AI is just the latest in a series of technological disruptions, and as Abeba notes, one with the potential to negatively impact disadvantaged groups in significant ways.
Harm of Categorization from ML Predictions
The inherent nature of so much of modern machine learning is to make predictions. An ethical approach to AI demands that we ask hard questions about those impacted by these predictions and assess the "harm of categorization." When an AI algorithm predicts that someone is more likely to be a criminal, less likely to be successful, or less qualified to receive credit, these predictions pose dangers that disproportionately impact disadvantaged populations versus those in more privileged positions.
Abeba's paper, Algorithmic Injustices Toward Relational Ethics, which recently won the best paper award at the Black in AI Workshop at NeurIPS, posits relational ethics and the relational mindset as a rethinking of those predictions. In other words, the question we should be asking is, why are certain demographics more at risk and how do we protect the welfare of those individuals most vulnerable to the social consequences of reductive labeling?
Her work also highlights that machine learning practices often rely on the assumption that the conditions that they model are stable. This comes from the IID assumption, which means that data points are independent and identically distributed. For example, you might be a certain way at work, but at a party, you speak or act differently. This "code-switching" is natural to humans but breaks ML algorithms' tendency to see one's actions as arising from a single distribution. For the most part, this dynamism is not something that ML sufficiently accounts for. As Abeba points out, the "nature of reality is that it is never stable... it is constantly changing." So, machine learning cannot be the final answer, it "cannot stabilize this continually moving nature of being." A relational ethics approach, however, accounts for change and assumes that solutions must be revised with time.
Robot Rights vs. Human Welfare
Abeba recently published another paper with her colleague Jelle van Dijk from the University of Twente, called "Robot Rights? Let's Talk about Human Welfare Instead." Like all good things, the paper came to life after a series of debates circling on Twitter and it comes down to two major concepts:
- Robots < humans. That is to say, robots cannot be granted or denied rights because machines are not the same as humans or any living being. Their argument rested on a "philosophical post-Cartesian approach" (translation: you need a brain to exist in the world because your being and knowing are sourced in the mind, which is embodied and enacted through a social environment. Robots arguably don't have conscious minds, nor do they have an embodied biological presence that constitutes existence as a "being" in the world around them).
- AI is not truly autonomous and never will be. This is because there is always a human involved to some degree. Another layer to this is the oversight of labor from "micro-workers" who contribute to AI without being acknowledged (like when you have to choose pictures of stop signs to prove you're not a bot).
here.
Crop Masking. Name that tree! This is essentially a classification task in which Gro seeks to identify what type of crop is growing in each pixel of a satellite image.The challenge is that conditions change often and distinguishing between an orange tree and a tangerine crop might be easier said than done.
Droughts. Droughts are a major threat to farming and food production. To date, there is no standard international drought index that the world can agree on, and Gro wants to change that by analyzing environmental conditions to create an objective benchmark for severe droughts.
Knowledge Graph Automation. Gro ingests data from dozens of sources and that information needs to be organized into a common, structured, ontology or knowledge graph. Gro uses machine learning models to automate this task. Gro's knowledge graph automation models help extract data and update how it flows into the Gro knowledge graph.
The Data is So Good
Gro's models ingest "wildly different data types" to support the company's models and allow them to get a sense of a dynamic agriculture market. The majority, at least in volume, comes from satellite data, spanning the entire frequency range of the electromagnetic spectrum, including visible, ultraviolet, and infrared. This helps Gro deduce a wealth of information about crop growth and growing conditions around the globe.
In addition to satellite imagery, the company also collects a huge amount of time series data, many originating in PDFs or worse, scanned paper reports issued by local governments.
The company's database currently has over 55 million data series and the amount is doubling every 6-9 months. Reproducibility and attribution are extremely important and ensure that each data point can be traced back to where it came from.
Despite the overwhelming amount of data sources, the amount is not always sufficient. That's where Gro's own derived data series come into play. This method applies the company's machine learning models to data from multiple sources to create new, insightful data series. This helps users overcome data inconsistencies that might be found in any individual source.
For the most part, the data Gro collects is surprisingly clean. As Nemo notes, it's "hard to lie to a satellite." Try me.
Modeling Lessons Learned
To deal with their scale, Gro has had to learn many lessons about developing effective machine learning models in agriculture. The keys to their success, according to Nemo, lie in:
- Choosing what to model. Gro has to carefully determine criteria to answer whether it is an important and economically interesting problem for their user base.
- Don't come at a problem with a solution. This involves remaining "agnostic to technology" and being prepared to try different approaches to each issue.
- Build for the masses. The company actively builds general frameworks that can be applied to different situations and geographic regions.
- Pause, then go. Before launching a set of models, they evaluate the performance in unique ways such as looking at how the error is distributed spatially or its temporal distribution performance. They bring in domain expertise to figure out feature engineering and tweaks to have a good model.
differential privacy, a topic we've covered here on the show quite extensively over the years. Differential privacy is a system for publicly sharing information about a dataset by describing patterns of groups within the dataset, the catch is you have to do this without revealing information about individuals in the dataset (privacy).
Ryan currently applies differential privacy at LinkedIn, but he has worked in the field, and on the related topic of federated learning, for quite some time. He was introduced to the subject as a PhD student at the University of Pennsylvania, where he worked closely with Aaron Roth, who we had the pleasure of interviewing back in 2018.
Ryan later worked at Apple, where he focused on the local model of differential privacy, meaning differential privacy is performed on individual users' local devices before being collected for analysis. (Apple uses this, for example, to better understand our favorite emojis 🤯 👍👏).
Not surprisingly, they do things a bit differently at LinkedIn. They utilize a central model, where the user's actual data is stored in a central database, with differential privacy applied before the data is made available for analysis.
(Another interesting use case that Ryan mentioned in the interview: the U.S. Census Bureau has announced plans to publish 2020 census data using differential privacy.)
Ryan recently put together a research paper with his LinkedIn colleague, David Durfee, that they presented as a spotlight talk at NeurIPS in Vancouver. The title of the paper is a bit daunting, but we break it down in the interview. You can check out the paper here: Practical Differentially Private Top-k Selection with Pay-what-you-get Composition.
There are two major components to the paper. First, they wanted to offer practical algorithms that you can layer on top of existing systems to achieve differential privacy for a very common type of query: the "Top-k" query, which means helping answer questions like "what are the top 10 articles that members are engaging with across LinkedIn?" Secondly, because privacy is reduced when users are allowed to make multiple queries of a differentially private system, Ryan's team developed an innovative way to ensure that their systems accurately account for the information the system returns to users over the course of a session. It's called Pay-what-you-get Composition.
[caption id="attachment_7063" align="aligncenter" width="600"]
This is a picture that Sam drew to show what's happening here.
[/caption] One of the big innovations of the paper is discovering the connection between a common algorithm for implementing differential privacy, the exponential mechanism, and Gumbel noise, which is commonly used in machine learning.
Thanks to LinkedIn for sponsoring today's show! LinkedIn Engineering solves complex problems at scale to create economic opportunity for every member of the global workforce. AI and ML are integral aspects of almost every product the company builds for its members and customers. LinkedIn's highly structured dataset gives their data scientists and researchers the ability to conduct applied research to improve member experiences. To learn more about the work of LinkedIn Engineering, please visit engineering.linkedin.com/blog.
" data-search-guests="Ryan Rogers">
[/caption] One of the big innovations of the paper is discovering the connection between a common algorithm for implementing differential privacy, the exponential mechanism, and Gumbel noise, which is commonly used in machine learning.
One of the really nice connections that we made in our paper was that actually the exponential mechanism can be implemented by adding something called Gumbel noise, rather than Laplace noise. Gumbel noise actually pops up in machine learning. It's something that you would do to report the category that has the highest weight, [using what is] called the Gumbel Max Noise Trick. It turned out that we could use that with the exponential mechanism to get a differentially private algorithm. [...] Typically, to solve top-k, you would use the exponential mechanism k different times —you can now do this in one shot by just adding Gumbel noise to [existing algorithms] and report the k values that are in the the top […]which made it a lot more efficient and practical.When asked what he was most excited about for the future of differential privacy Ryan cited the progress in open source projects.
This is the future of private data analytics. It's really important to be transparent with how you're doing things, otherwise if you're just touting that you're private and you're not revealing what it is, then is it really private?He pointed out the open-source collaboration between Microsoft and Harvard's Institute for Quantitative Social Sciences. The project aims to create an open-source platform that allows researchers to share datasets containing personal information while preserving the privacy of individuals. Ryan expects such efforts to bring more people to the field, encouraging applications of differential privacy that work in practice and at scale. Listen to the interview with Ryan to get the full scope! And if you want to go deeper into differential privacy check out our series of interviews on the topic from 2018.
@samcharrington or @twimlai.
To follow along with the 2019 AI Rewind Series, head over to the series page!
" data-search-guests="Nasrin Mostafazadeh">
@samcharrington or @twimlai!
" data-search-guests="Amir Zamir">
@samcharrington or @twimlai.
To follow along with the 2019 AI Rewind Series, head over to the series page!
" data-search-guests="Timnit Gebru">
@samcharrington or @twimlai.
To follow along with the 2019 AI Rewind Series, head over to the series page!
" data-search-guests="Chelsea Finn">
@samcharrington or @twimlai.
To follow along with the 2019 AI Rewind Series, head over to the series page!
" data-search-guests="Zachary Lipton">
267. In our conversation with Tijmen, we discuss the ins and outs of compression and quantization of ML models, including how much models can actually be compressed, and the best way to achieve it. We also look at the recent "Lottery Hypothesis" paper and how that factors into this research, and best practices for training efficient networks. Finally, Tijmen recommends a few algorithms for those interested, including tensor factorization and channel pruning.
" data-search-guests="Tijmen Blankevoort">
Max Welling, Qualcomm has a hand in tons of machine learning research and hardware, and our conversation with Jeff is no different. We discuss how the various training frameworks fit into the developer experience when working with their chipsets, examples of federated learning in the wild, the role inference will play in data center devices and more.
" data-search-guests="Jeff Gehlhaar">
TWIMLcon conference, which will focus on the tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise. The event will be held October 1st & 2nd in San Francisco and early bird registration is open today at twimlcon.com.
" data-search-guests="Yunfan Gerry Zhang">
TWIMLcon conference, which will focus on the tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise. The event will be held October 1st & 2nd in San Francisco and early bird registration is open today at twimlcon.com.
" data-search-guests="Laurence Watson">
TWIMLcon conference, which will focus on the tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise. The event will be held October 1st & 2nd in San Francisco and early bird registration is open today at twimlcon.com.
" data-search-guests="William Fehlman">
TWIMLcon conference, which will focus on the tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise. The event will be held October 1st & 2nd in San Francisco and early bird registration is open today at twimlcon.com.
" data-search-guests="Judy Gichoya">
TWIMLcon conference, which will focus on the tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise. The event will be held October 1st & 2nd in San Francisco and early bird registration is open today at twimlcon.com.
" data-search-guests="Karen Levy">
TWIML Talk #184, with Viviana Acquaviva, where we explore dark energy and star formation, and if you want to go way back, TWIML Talk #5 with Joshua Bloom which provides a great overview of the application of ML in astronomy.
" data-search-guests="Yashar Hezaveh">
@samcharrington or leave a comment below with your thoughts.
" data-search-guests="Rob Walker">
" data-search-guests="Lucas Joppa">
" data-search-guests="Justin Spelhaug">