This post presents the results of the 2020 Dask User Survey,which ran earlier this summer. Thanks to everyone who took the time to fill out the survey!These results help us better understand the Dask community and will guide future development efforts.
The raw data, as well as the start of an analysis, can be found in this binder:
Let us know if you find anything in the data.
Most of the questions are the same as in 2019. We added a couple questions about deployment and dashboard usage. Let’s look at those first.
Among respondents who use a Dask package to deploy a cluster (about 53% of respondents), there’s a wide spread of methods.
Most people access the dashboard through a web browser. Those not using the dashboard are likely (hopefully) just using Dask on a single machine with the threaded scheduler (though the dashboard works fine on a single machine as well).
Respondents’ learning material usage is farily similar to last year. The most notable differences are fromour survey form providing more options (our YouTube channel and “Gitter chat”). Other than that, examples.dask.org might be relatively more popular.
Just like last year, we’ll look at resource usage grouped by how often they use Dask.
A few observations
API usage remains about the same as last year (recall that about 20 fewer people took the survey and people can select multiple, so relative differences are most interesting). We added new choices for RAPIDS, Prefect, and XGBoost, each of which are somewhat popular (in the neighborhood of dask.Bag).
About 65% of our users are using Dask on a cluster at least some of the time, which is similar to last year.
Respondents continue to say that more documentation and examples would be the most valuable improvements to the project.
One interesting change comes from looking at “Which would help you most right now?” split by API group (dask.dataframe, dask.array, etc.). Last year showed that “More examples” in my field was the most important for all API groups (first table below). But in 2020 there are some differences (second table below).
2019 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority. Which would help you most right now? Bug fixes More documentation More examples in my field New features Performance improvements Dask APIs Array 10 24 62 15 25 Bag 3 11 16 10 7 DataFrame 16 32 71 39 26 Delayed 16 22 55 26 27 Futures 12 9 25 20 17 ML 5 11 23 11 7 Xarray 8 11 34 7 9 2020 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority. Which would help you most right now? Bug fixes More documentation More examples in my field New features Performance improvements Dask APIs Array 12 16 56 15 23 Bag 7 5 24 7 16 DataFrame 24 21 67 22 41 Delayed 15 19 46 17 34 Futures 9 10 21 13 24 ML 6 4 21 9 12 Xarray 3 4 25 9 13
Examples are again the most important (for all API groups except Futures). But “Performance improvements” is now the second-most important improvement (except for Futures where it’s most important). How should we interpret this? A charitable interpretation is that Dask’s users are scaling to larger problems and are running into new scaling challenges. A less charitable interpretation is that our user’s workflows are the same but Dask is getting slower!
SSH continues to be the most popular “cluster resource mananger”. This was the big surprise last year, so we put in some work to make it nicer. Aside from that, not much has changed.
And Dask users are about as happy with its stability as last year.
Thanks again to all the respondents!