Recently I had to download 70+ files from the web URLs—found as raw text strings. There was no way that I’d copy-paste the URLs 70+ times to browser address bar. How do I automate this simple task?
Background
Working on an internal presentation, I found this deck about Atlassian Design team via Slideshare. Lucky me, I thought, that there was even a big Download Now
button with a kind text that says download to read offline. Unfortunately, albeit predictably, it didn’t allow me to download unless I sign up for subscription.
While they did offer a 30-day free trial, I was required to provide credit card info up-front to activate a free trial. I know they’re running a business and need to pay their server bills and what-not, but I didn’t like the idea of giving away my credit card info (or Paypal payment) just to download a reference document—knowing that I’m probably never going to use on a regular basis.
Process
1. Look at the markup and determine the file source
Fine, I won’t be able to download the source PDF
or PPT
without paying. Then what about the in-browser preview images? Wouldn’t there be standard image files for the web?
Inspecting via dev tools, I found that the HTML markup for the slides were pretty straightforward.
html1<!-- Removed irrelevant data attributes and other details for brevity -->23<div id="slide-container">4 <div class="slide" id="slide-0" data-index="0">...</div>5 <div class="slide" id="slide-1" data-index="1">...</div>6 <div class="slide" id="slide-2" data-index="2">...</div>7 <div class="slide" id="slide-3" data-index="3">...</div>8 <div class="slide current" id="slide-4" data-index="4">9 <picture>10 <source srcset="https://path.to/file-title-5-size.jpg 2048w" ... />11 <img12 class="slide-image"13 src="https://path.to/file-title-5-size.jpg 2048w"14 ...15 />16 </picture>17 </div>18 <div class="slide" id="slide-5" data-index="4">...</div>19 <div class="slide" id="slide-6" data-index="5">...</div>20 <!-- this goes on until `data-index="70" -->21</div>
Inside slide-container
are a series of slide
s with data-index
indicating the slide number, each of which containing a picture
element pointing to a corresponding CDN address. Image file naming convention looked also very clear with comma-separated slide-title-##-####
format, in which the last parts of the string indicated the slide number and image width.
To test, I manually copy-pasted a few image file URLs for different slides and confirmed that I can indeed access these individual slides as JPG
assets.
2. Try downloading directly inside a browser
To automate around these assets, I needed the full list. Now that I know the naming convention, this part was fairly easy. I first started with a plain for
loop in the browser console.
javascript1for (let i = 1; i < 72; i++) {2 console.log(`https://path.to/file-title-${i}-2048.jpg`);3}
console1https://path.to/file-title-1-2048.jpg2https://path.to/file-title-2-2048.jpg3https://path.to/file-title-3-2048.jpg4https://path.to/file-title-4-2048.jpg5...
The snippet above sort of works. The URLs logged in the console are directly clickable and would open up a new tab with the target image in it. With huge pain points though:
- I need to click the links 71 times to open each file.
- I need to hit
CMD
S
71 times to save them individually. - Chrome browser automatically converts this as
WEBP
format, but I'd needJPG
Pathetically ignorant in Node environment, I’d rather stick to browser environment as much as possible but now it felt inevitable that I would have to do this away from the browser.
3. Use core Node.js modules
For a throw-away task like this, I didn't want any dependencies here, so the obvious first step was to create a JavaScript file.
bash1touch download_slides.js
Of the several resources that Google pointed me to, this article and this blog piece were the most succinct and practical for my needs.
I started by loading the modules.
javascript1const fs = require("fs");2const https = require("https");
Then generated URL strings and stored them in a variable.
javascript1const files = new Array(71)2 .fill("")3 .map((item, index) => `https://path.to/file-title-${index + 1}-2048.jpg`);
Created a function that returns a single HTTPS request as a Promise. Each request response will be saved locally via fs.createWriteStream()
method.
javascript1const download = (url, destPath) => {2 return new Promise((resolve, reject) => {3 https.get(url, (res) => {4 const filePath = fs.createWriteStream(destPath);5 res.pipe(filePath);6 resolve(true);7 });8 });9};
With those single Promises, created an array that contains all HTTPS requests for assets.
javascript1const createDownloadRequests = (urls) => {2 const requests = [];3 for (const url of urls) {4 let urlObj = new URL(url);5 let parts = urlObj.pathname.split("/");6 let filename = parts[parts.length - 1];7 requests.push(download(url, `${filename}`));8 }9 return requests;10};
As a final step, use Promise.all()
to carry out all downloads.
javascript1(async () => {2 try {3 const requests = createDownloadRequests(files);4 await Promise.all(requests);5 } catch (err) {6 console.log(err);7 }8})();
The full code looks like below:
javascript1const fs = require("fs");2const https = require("https");3const files = new Array(71)4 .fill("")5 .map((item, index) => `https://path.to/file-title-${index + 1}-2048.jpg`);6const download = (url, destPath) => {7 return new Promise((resolve, reject) => {8 https.get(url, (res) => {9 const filePath = fs.createWriteStream(destPath);10 res.pipe(filePath);11 resolve(true);12 });13 });14};15const createDownloadRequests = (urls) => {16 const requests = [];17 for (const url of urls) {18 let urlObj = new URL(url);19 let parts = urlObj.pathname.split("/");20 let filename = parts[parts.length - 1];21 requests.push(download(url, `${filename}`));22 }23 return requests;24};25(async () => {26 try {27 const requests = createDownloadRequests(files);28 await Promise.all(requests);29 } catch (err) {30 console.log(err);31 }32})();
4. Run the script
bash1node download_slides.js
This part was the most obvious part. Running the script brought down all 71 JPG
s I wanted. I ended up referencing and using only two of those slides. You might rightfully say I could’ve simply downloaded the two particular files rather than going through this hassle. But, hey, I learned a little something about Node with this process. 🤷♂️
What I’d do differently
The way it’s coded, this script is fairly limited in functionality and overly specific to this one use case.
- It does not provide any meaningful feedback. Like:
- what the status of each file is
- which asset is currently being downloaded
- whether the whole download process is completed
- whether there was any specific errors, etc.
- The asset URL strings are mostly hard-coded and cannot be reused for other contexts. Also it requires that I make sense of the file naming convention before running this script.
If I were to face similar needs in the future, I might want to add explicit feedback UI and would try to automate harvesting the target file names. Oh, also in TypeScript. Perhaps in Deno.
In any case, the presentation went well and I can come back to this blog post when/if I need to.
if you liked it