August 2022
67 posts: 5 entries, 24 links, 2 quotes, 36 beats
Aug. 1, 2022
storysniffer (via) Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it.
Aug. 2, 2022
How I Used DALL·E 2 to Generate The Logo for OctoSQL (via) Jacob Martin gives a blow-by-blow account of his attempts at creating a logo for his OctoSQL project using DALL-E, spending $30 of credits and making extensive use of both the “variations” feature and the tool that lets you request modifications to existing images by painting over parts you want to regenerate. Really interesting to read as an example of a “real world” DALL-E project.
Aug. 3, 2022
Introducing sqlite-html: query, parse, and generate HTML in SQLite (via) Another brilliant SQLite extension module from Alex Garcia, this time written in Go. sqlite-html adds a whole family of functions to SQLite for parsing and constructing HTML strings, built on the Go goquery and cascadia libraries. Once again, Alex uses an Observable notebook to describe the new features, with embedded interactive examples that are backed by a Datasette instance running in Fly.
Aug. 4, 2022
Your documentation is complete when someone can use your module without ever having to look at its code. This is very important. This makes it possible for you to separate your module's documented interface from its internal implementation (guts). This is good because it means that you are free to change the module's internals as long as the interface remains the same.
Remember: the documentation, not the code, defines what a module does.
Aug. 5, 2022
Aug. 6, 2022
Microsoft® Open Source Software (OSS) Secure Supply Chain (SSC) Framework Simplified Requirements. This is really good: don’t get distracted by the acronyms, skip past the intro and head straight to the framework practices section, which talks about things like keeping copies of the packages you depend on, running scanners, tracking package updates and most importantly keeping an inventory of the open source packages you work so you can quickly respond to things like log4j.
I feel like I say this a lot these days, but if you had told teenage-me that Microsoft would be publishing genuinely useful non-FUD guides to open source supply chain security by 2022 I don’t think I would have believed you.
Aug. 7, 2022
Aug. 9, 2022
sqlite-zstd: Transparent dictionary-based row-level compression for SQLite. Interesting SQLite extension from phiresky, the author of that amazing SQLite WASM hack from a while ago which could fetch subsets of a large SQLite database using the HTTP range header. This extension, written in Rust, implements row-level compression for a SQLite table by creating compression dictionaries for larger chunks of the table, providing better results than just running compression against each row value individually.
Aug. 10, 2022
curl-impersonate (via) “A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”
I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL.
How SQLite Helps You Do ACID (via) Ben Johnson’s series of posts explaining the internals of SQLite continues with a deep look at how the rollback journal works. I’m learning SO much from this series.
Introducing sqlite-http: A SQLite extension for making HTTP requests (via) Characteristically thoughtful SQLite extension from Alex, following his sqlite-html extension from a few days ago. sqlite-http lets you make HTTP requests from SQLite—both as a SQL function that returns a string, and as a table-valued SQL function that lets you independently access the body, headers and even the timing data for the request.
This write-up is excellent: it provides interactive demos but also shows how additional SQLite extensions such as the new-to-me “define” extension can be combined with sqlite-http to create custom functions for parsing and processing HTML.
Let websites framebust out of native apps (via) Adrian Holovaty makes a compelling case that it is Not OK that we allow native mobile apps to embed our websites in their own browsers, including the ability for them to modify and intercept those pages (it turned out today that Instagram injects extra JavaScript into pages loaded within the Instagram in-app browser). He compares this to frame-busting on the regular web, and proposes that the X-Frame-Options: DENY header which browsers support to prevent a page from being framed should be upgraded to apply to native embedded browsers as well.
I’m not convinced that reusing X-Frame-Options: DENY would be the best approach—I think it would break too many existing legitimate uses—but a similar option (or a similar header) specifically for native apps which causes pages to load in the native OS browser instead sounds like a fantastic idea to me.
Aug. 11, 2022
sethmlarson/pypi-data (via) Seth Michael Larson uses GitHub releases to publish a ~325MB (gzipped to ~95MB) SQLite database on a roughly monthly basis that contains records of 370,000+ PyPI packages plus their OpenSSF score card metrics. It’s a really interesting dataset, but also a neat way of packaging and distributing data—the scripts Seth uses to generate the database file are included in the repository.
datasette on Open Source Insights
(via)
Open Source Insights is "an experimental service developed and hosted by Google to help developers better understand the structure, security, and construction of open source software packages". It calculates scores for packages using various automated heuristics. A JSON version of the resulting score card can be accessed using https://deps.dev/_/s/pypi/p/{package_name}/v/
Litestream backups for Datasette Cloud (and weeknotes)
My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.
[... 1,604 words]Aug. 12, 2022
Aug. 13, 2022
Bypassing macOS notarization (via) Useful tip from the geckodriver docs: if you’ve downloaded an executable file through your browser and now cannot open it because of the macOS quarantine feature, you can run “xattr -r -d com.apple.quarantine path-to-binary” to clear that flag so you can execute the file.