Large File Upload

Approach

A common approach is to split the file into smaller chunks, thereby reducing the risk of upload failure.

Details

How the underlying protocol standards are defined determines how the front-end and back-end interact, and dictates how the front-end and back-end code are developed.

Beyond the protocol, it also involves how the front-end controls concurrency and efficiently splits chunks, how the back-end stores and efficiently merges chunks, and how to guarantee chunk uniqueness.

There is no universal solution available in the market. Although OSS on public cloud exists, a product may be deployed to a private cloud, so it's still necessary to learn how to implement it yourself.

Approach (Client-Side Chunking & Optimization)

The client splits the file, calculates the chunk hash and the full file hash, and then uses the hash to exchange file information with the server.

Calculating the hash is a CPU-intensive operation, which can cause long UI blocking.

Although Web Workers can be used to speed it up, my tests show that even with multi-threading, the blocking time for extremely large files (10 GB or more) still exceeds 30 seconds, which is unacceptable.

Therefore, I optimized the upload flow: I assume that most files are new, so I allow users to start uploading chunks directly before obtaining the full hash. This enables nearly zero-delay start of upload, and after the overall hash is calculated, the hash data is supplemented to the server.

Communication Protocol Design

Four communication protocols:

Create File Protocol: The front-end uses a HEAD request to obtain a unique upload token, which must be carried in all subsequent requests.
Hash Verification: The front-end sends a chunk hash or the entire file hash to the server, and receives the status of the chunk and file.
Chunk Upload Protocol: The front-end sends the binary data of a chunk to the server for storage.
Chunk Merge Protocol: The front-end notifies the server that chunk merging can be completed.

Storage

Since a BFF layer is involved, server-side code needs to be written.

The biggest challenge is ensuring the uniqueness of each chunk. This uniqueness covers both storage uniqueness and transmission uniqueness. Storage uniqueness ensures that a chunk is not saved repeatedly, avoiding data redundancy; transmission uniqueness ensures that a chunk is not uploaded repeatedly, avoiding communication redundancy.

To guarantee that a chunk is not saved repeatedly, it is necessary to decouple chunks from files with no parent-child relationship. Files are independently recorded and point to different chunks in order.

To guarantee that a chunk is not uploaded repeatedly, chunks must never be deleted. If chunks are deleted after the file is merged, the next upload attempt would find the corresponding chunk missing and must re-upload.

Finally, the chunk merging logic. If chunks were truly merged into a large file, the large file data would actually be redundant, and the entire process would be extremely time-consuming. Therefore, I handled it this way: when the server receives a merge request, it only needs to perform simple checks (file size, number of chunks) and then generates a URL. When the user downloads the file, the server reads the chunk data sequentially using a stream, and directly pipes the stream to the client.

Both merging and file access are efficient, and the server has no redundant storage.

Flow

First, notify the server that a file will be uploaded, requiring the server to return a unique identifier.
File chunking

const chunkSize = 5 * 1024 * 1024; // 5MB
const chunks = [];
let cur = 0;
while (cur < file.size) {
    chunks.push(file.slice(cur, cur + chunkSize));
    cur += chunkSize;
}

Generate fingerprint. To identify the uniqueness of the file (for instant upload and resume upload), the file's MD5 or SHA-256 needs to be computed.
Instant upload check. Before the actual upload, send a request to the server with the file hash value.

Scenario A: The file already exists on the server → directly return "Upload success" (instant upload).
Scenario B: Some chunks already exist on the server → return the list of received chunk indices (resume upload).
Scenario C: The file does not exist on the server → start a full upload.

Concurrent upload. Use FormData to wrap each chunk and send via XMLHttpRequest or Fetch.

Concurrency control: Do not initiate hundreds or thousands of requests at the same time. It is recommended to maintain a send queue and limit concurrency to 3-6 to avoid consuming too many browser resources and causing page lag.
Retry mechanism: When a single chunk upload fails, automatic retry (e.g., 3 times) should be supported.

Merge request. After all chunks are uploaded, the front-end sends a "merge" instruction to the server. The server merges the file blocks into the original file according to the chunk indices, and verifies whether the merged hash matches.