Merge branch '351213-restructure-uploads-doc' into 'master'

Break up uploads developer docs into 3 sections See merge request gitlab-org/gitlab!80392

Merge branch '351213-restructure-uploads-doc' into 'master'
Break up uploads developer docs into 3 sections See merge request gitlab-org/gitlab!80392
c7bf70dc · Marcin Sedlak-Jakubowski · 2d8edee2 · 54cd239b · c7bf70dc · c7bf70dc
Commit c7bf70dc authored Feb 28, 2022 by Marcin Sedlak-Jakubowski
17 changed files
--- a/doc/administration/object_storage.md
+++ b/doc/administration/object_storage.md
@@ -62,7 +62,7 @@ Using the consolidated object storage configuration has a number of advantages:
 - It enables the use of [encrypted S3 buckets](#encrypted-s3-buckets).
 - It [uploads files to S3 with proper `Content-MD5` headers](https://gitlab.com/gitlab-org/gitlab-workhorse/-/issues/222).

-Because [direct upload mode](../development/uploads.md#direct-upload)
+Because [direct upload mode](../development/uploads/implementation.md#direct-upload)
 must be enabled, only the following providers can be used:

 - [Amazon S3-compatible providers](#s3-compatible-connection-settings)

--- a/doc/administration/uploads.md
+++ b/doc/administration/uploads.md
@@ -67,7 +67,7 @@ For source installations the following settings are nested under `uploads:` and
 |---------|-------------|---------|
 | `enabled` | Enable/disable object storage | `false` |
 | `remote_directory` | The bucket name where Uploads will be stored| |
-| `direct_upload` | Set to `true` to remove Puma from the Upload path. Workhorse handles the actual Artifact Upload to Object Storage while Puma does minimal processing to keep track of the upload. There is no need for local shared storage. The option may be removed if support for a single storage type for all files is introduced. Read more on [direct upload](../development/uploads.md#direct-upload). | `false` |
+| `direct_upload` | Set to `true` to remove Puma from the Upload path. Workhorse handles the actual Artifact Upload to Object Storage while Puma does minimal processing to keep track of the upload. There is no need for local shared storage. The option may be removed if support for a single storage type for all files is introduced. Read more on [direct upload](../development/uploads/implementation.md#direct-upload). | `false` |
 | `background_upload` | Set to `false` to disable automatic upload. Option may be removed once upload is direct to S3 (if `direct_upload` is set to `true` it will override `background_upload`) | `true` |
 | `proxy_download` | Set to `true` to enable proxying all files served. Option allows to reduce egress traffic as this allows clients to download directly from remote storage instead of proxying all data | `false` |
 | `connection` | Various connection options described below | |

--- a/doc/architecture/blueprints/object_storage/index.md
+++ b/doc/architecture/blueprints/object_storage/index.md
@@ -196,7 +196,7 @@ require one bucket.

 ## Additional reading materials

- [Uploads development documentation: The problem description](../../../development/uploads.md#the-problem-description).
+- [Uploads development guide](../../../development/uploads/index.md).
 - [Speed up the monolith, building a smart reverse proxy in Go](https://archive.fosdem.org/2020/schedule/event/speedupmonolith/): a presentation explaining a bit of workhorse history and the challenge we faced in releasing the first cloud-native installation.
 - [Object Storage improvements epic](https://gitlab.com/groups/gitlab-org/-/epics/483).
 - We are moving to GraphQL API, but [we do not support direct upload](https://gitlab.com/gitlab-org/gitlab/-/issues/280819).

--- a/doc/development/code_review.md
+++ b/doc/development/code_review.md
@@ -624,7 +624,7 @@ Enterprise Edition instance. This has some implications:
      [added to Omnibus](https://docs.gitlab.com/omnibus/settings/gitlab.yml#adding-a-new-setting-to-gitlabyml).
 1. **File system access** is not possible in a [cloud-native architecture](architecture.md#adapting-existing-and-introducing-new-components).
   Ensure that we support object storage for any file storage we need to perform. For more
-   information, see the [uploads documentation](uploads.md).
+   information, see the [uploads documentation](uploads/index.md).

 ### Review turnaround time


--- a/doc/development/contributing/merge_request_workflow.md
+++ b/doc/development/contributing/merge_request_workflow.md
@@ -81,7 +81,7 @@ request is as follows:
 1. If your MR touches code that executes shell commands, reads or opens files, or
   handles paths to files on disk, make sure it adheres to the
   [shell command guidelines](../shell_commands.md)
-1. If your code needs to handle file storage, see the [uploads documentation](../uploads.md).
+1. If your code needs to handle file storage, see the [uploads documentation](../uploads/index.md).
 1. If your merge request adds one or more migrations, make sure to execute all
   migrations on a fresh database before the MR is reviewed. If the review leads
   to large changes in the MR, execute the migrations again once the review is complete.

--- a/doc/development/fe_guide/content_editor.md
+++ b/doc/development/fe_guide/content_editor.md
@@ -47,7 +47,7 @@ The Content Editor requires two properties:

 - `renderMarkdown` is an asynchronous function that returns the response (String) of invoking the
 [Markdown API](../../api/markdown.md).
- `uploadsPath` is a URL that points to a [GitLab upload service](../uploads.md#upload-encodings)
+- `uploadsPath` is a URL that points to a [GitLab upload service](../uploads/implementation.md#upload-encodings)
  with `multipart/form-data` support.

 See the [`WikiForm.vue`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/pages/shared/wikis/components/wiki_form.vue#L207)

--- a/doc/development/file_storage.md
+++ b/doc/development/file_storage.md
@@ -8,7 +8,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w

 We use the [CarrierWave](https://github.com/carrierwaveuploader/carrierwave) gem to handle file upload, store and retrieval.

-File uploads should be accelerated by workhorse, for details please refer to [uploads development documentation](uploads.md).
+File uploads should be accelerated by workhorse, for details please refer to [uploads development documentation](uploads/index.md).

 There are many places where file uploading is used, according to contexts:


--- a/doc/development/index.md
+++ b/doc/development/index.md
@@ -227,7 +227,7 @@ the [reviewer values](https://about.gitlab.com/handbook/engineering/workflow/rev
 - [Working with merge request diffs](diffs.md)
 - [Approval Rules](approval_rules.md)
 - [Repository mirroring](repository_mirroring.md)
- [File uploads](uploads.md)
+- [Uploads development guide](uploads/index.md)
 - [Auto DevOps development guide](auto_devops.md)
 - [Renaming features](renaming_features.md)
 - [Code Intelligence](code_intelligence/index.md)

--- a/doc/development/merge_request_performance_guidelines.md
+++ b/doc/development/merge_request_performance_guidelines.md
@@ -526,7 +526,7 @@ end

 The usage of shared temporary storage is required if your intent
 is to persistent file for a disk-based storage, and not Object Storage.
-[Workhorse direct_upload](uploads.md#direct-upload) when accepting file
+[Workhorse direct_upload](uploads/implementation.md#direct-upload) when accepting file
 can write it to shared storage, and later GitLab Rails can perform a move operation.
 The move operation on the same destination is instantaneous.
 The system instead of performing `copy` operation just re-attaches file into a new place.
@@ -550,7 +550,7 @@ that implements a seamless support for Shared and Object Storage-based persisten
 #### Data access

 Each feature that accepts data uploads or allows to download them needs to use
-[Workhorse direct_upload](uploads.md#direct-upload). It means that uploads needs to be
+[Workhorse direct_upload](uploads/implementation.md#direct-upload). It means that uploads needs to be
 saved directly to Object Storage by Workhorse, and all downloads needs to be served
 by Workhorse.

@@ -562,5 +562,5 @@ can time out, which is especially problematic for slow clients. If clients take
 to upload/download the processing slot might be killed due to request processing
 timeout (usually between 30s-60s).

-For the above reasons it is required that [Workhorse direct_upload](uploads.md#direct-upload) is implemented
+For the above reasons it is required that [Workhorse direct_upload](uploads/implementation.md#direct-upload) is implemented
 for all file uploads and downloads.
--- a/doc/development/packages.md
+++ b/doc/development/packages.md
@@ -151,7 +151,7 @@ During this phase, the idea is to collect as much information as possible about
  1. Empty file structure (API file, base service for this package)
  1. Authentication system for "logging in" to the package manager
  1. Identify metadata and create applicable tables
-  1. Workhorse route for [object storage direct upload](uploads.md#direct-upload)
+  1. Workhorse route for [object storage direct upload](uploads/implementation.md#direct-upload)
  1. Endpoints required for upload/publish
  1. Endpoints required for install/download
  1. Endpoints required for required actions
@@ -210,7 +210,7 @@ File uploads should be handled by GitLab Workhorse using object accelerated uplo
 the workhorse proxy that checks all incoming requests to GitLab intercept the upload request,
 upload the file, and forward a request to the main GitLab codebase only containing the metadata
 and file location rather than the file itself. An overview of this process can be found in the
-[development documentation](uploads.md#direct-upload).
+[development documentation](uploads/implementation.md#direct-upload).

 In terms of code, this means a route must be added to the
 [GitLab Workhorse project](https://gitlab.com/gitlab-org/gitlab-workhorse) for each upload endpoint being added
@@ -272,7 +272,7 @@ features must be implemented when the feature flag is removed.
 - File format guards (only accept valid file formats for the package type)
 - Name regex with validation
 - Version regex with validation
- Workhorse route for [accelerated](uploads.md#how-to-add-a-new-upload-route) uploads
+- Workhorse route for [accelerated](uploads/working_with_uploads.md) uploads
 - Background workers for extracting package metadata (if applicable)
 - Documentation (how to use the feature)
 - API Documentation (individual endpoints with curl examples)

--- a/doc/development/shared_files.md
+++ b/doc/development/shared_files.md
@@ -11,5 +11,5 @@ servers in `shared/`, using a shared storage solution like NFS. Although this is
 some GitLab installations, it must not be the only file storage option for a given feature. This is
 because [cloud-native GitLab installations do not support it](architecture.md#adapting-existing-and-introducing-new-components).

-Our [uploads documentation](uploads.md) describes how to handle file storage in
+Our [uploads documentation](uploads/index.md) describes how to handle file storage in
 such a way that it supports both options: direct disk access and object storage.
--- a/doc/development/uploads.md
+++ b/doc/development/uploads.md
--- a/doc/development/uploads/background.md
+++ b/doc/development/uploads/background.md
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Uploads guide: Why GitLab uses custom upload logic
+
+This page is for developers trying to better understand the history behind GitLab uploads and the
+technical challenges associated with uploads.
+
+## The problem description
+
+[GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse) has special rules for handling uploads.
+We process the upload in Workhorse to prevent occupying a Ruby process on I/O operations and because it is cheaper.
+This process can also directly upload to object storage.
+
+The following graph explains machine boundaries in a scalable GitLab installation. Without any Workhorse optimization in place, we can expect incoming requests to follow the numbers on the arrows.
+
+```mermaid
+graph TB
+    subgraph "load balancers"
+      LB(Proxy)
+    end
+
+    subgraph "Shared storage"
+       nfs(NFS)
+    end
+
+    subgraph "redis cluster"
+       r(persisted redis)
+    end
+    LB-- 1 -->Workhorse
+
+    subgraph "web or API fleet"
+      Workhorse-- 2 -->rails
+    end
+    rails-- "3 (write files)" -->nfs
+    rails-- "4 (schedule a job)" -->r
+
+    subgraph sidekiq
+      s(sidekiq)
+    end
+    s-- "5 (fetch a job)" -->r
+    s-- "6 (read files)" -->nfs
+```
+
+We have three challenges here: performance, availability, and scalability.
+
+### Performance
+
+Rails process are expensive in terms of both CPU and memory. Ruby [global interpreter lock](https://en.wikipedia.org/wiki/Global_interpreter_lock) adds to cost too because the Ruby process spends time on I/O operations on step 3 causing incoming requests to pile up.
+
+In order to improve this, [disk buffered upload](implementation.md#disk-buffered-upload) was implemented. With this, Rails no longer deals with writing uploaded files to disk.
+
+```mermaid
+graph TB
+    subgraph "load balancers"
+      LB(HA Proxy)
+    end
+
+    subgraph "Shared storage"
+       nfs(NFS)
+    end
+
+    subgraph "redis cluster"
+       r(persisted redis)
+    end
+    LB-- 1 -->Workhorse
+
+    subgraph "web or API fleet"
+      Workhorse-- "3 (without files)" -->rails
+    end
+    Workhorse -- "2 (write files)" -->nfs
+    rails-- "4 (schedule a job)" -->r
+
+    subgraph sidekiq
+      s(sidekiq)
+    end
+    s-- "5 (fetch a job)" -->r
+    s-- "6 (read files)" -->nfs
+```
+
+### Availability
+
+There's also an availability problem in this setup, NFS is a [single point of failure](https://en.wikipedia.org/wiki/Single_point_of_failure).
+
+To address this problem an HA object storage can be used and it's supported by [direct upload](implementation.md#direct-upload)
+
+### Scalability
+
+Scaling NFS is outside of our support scope, and NFS is not a part of cloud native installations.
+
+All features that require Sidekiq and do not use direct upload doesn't work without NFS. In Kubernetes, machine boundaries translate to PODs, and in this case the uploaded file is written into the POD private disk. Since Sidekiq POD cannot reach into other pods, the operation fails to read it.
--- a/doc/development/uploads/implementation.md
+++ b/doc/development/uploads/implementation.md
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Uploads guide: How uploads work technically
+
+This page is for developers trying to better understand what kinds of uploads exist in GitLab and how they are implemented.
+
+## Kinds of uploads and how to choose between them
+
+We can identify three major use-cases for an upload:
+
+1. **storage:** if we are uploading for storing a file (like artifacts, packages, or discussion attachments). In this case [direct upload](#direct-upload) is the proper level as it's the less resource-intensive operation. Additional information can be found on [File Storage in GitLab](../file_storage.md).
+1. **in-controller/synchronous processing:** if we allow processing **small files** synchronously, using [disk buffered upload](#disk-buffered-upload) may speed up development.
+1. **Sidekiq/asynchronous processing:** Asynchronous processing must implement [direct upload](#direct-upload), the reason being that it's the only way to support Cloud Native deployments without a shared NFS.
+
+Selecting the proper acceleration is a tradeoff between speed of development and operational costs.
+
+For more details about currently broken feature see [epic &1802](https://gitlab.com/groups/gitlab-org/-/epics/1802).
+
+### Handling repository uploads
+
+Some features involves Git repository uploads without using a regular Git client.
+Some examples are uploading a repository file from the web interface and [design management](../../user/project/issues/design_management.md).
+
+Those uploads requires the rails controller to act as a Git client in lieu of the user.
+Those operation falls into _in-controller/synchronous processing_ category, but we have no warranties on the file size.
+
+In case of a LFS upload, the file pointer is committed synchronously, but file upload to object storage is performed asynchronously with Sidekiq.
+
+## Upload encodings
+
+By upload encoding we mean how the file is included within the incoming request.
+
+We have three kinds of file encoding in our uploads:
+
+1. <i class="fa fa-check-circle"></i> **multipart**: `multipart/form-data` is the most common, a file is encoded as a part of a multipart encoded request.
+1. <i class="fa fa-check-circle"></i> **body**: some APIs uploads files as the whole request body.
+1. <i class="fa fa-times-circle"></i> **JSON**: some JSON APIs upload files as base64-encoded strings. This requires a change to GitLab Workhorse,
+   which is tracked [in this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/325068).
+
+## Uploading technologies
+
+By uploading technologies we mean how all the involved services interact with each other.
+
+GitLab supports 3 kinds of uploading technologies, here follows a brief description with a sequence diagram for each one. Diagrams are not meant to be exhaustive.
+
+### Rack Multipart upload
+
+This is the default kind of upload, and it's the most expensive in terms of resources.
+
+In this case, Workhorse is unaware of files being uploaded and acts as a regular proxy.
+
+When a multipart request reaches the rails application, `Rack::Multipart` leaves behind temporary files in `/tmp` and uses valuable Ruby process time to copy files around.
+
+```mermaid
+sequenceDiagram
+    participant c as Client
+    participant w as Workhorse
+    participant r as Rails
+
+    activate c
+    c ->>+w: POST /some/url/upload
+    w->>+r:  POST /some/url/upload
+
+    r->>r: save the incoming file on /tmp
+    r->>r: read the file for processing
+
+    r-->>-c: request result
+    deactivate c
+    deactivate w
+```
+
+### Disk buffered upload
+
+This kind of upload avoids wasting resources caused by handling upload writes to `/tmp` in rails.
+
+This optimization is not active by default on REST API requests.
+
+When enabled, Workhorse looks for files in multipart MIME requests, uploading
+any it finds to a temporary file on shared storage. The MIME data in the request
+is replaced with the path to the corresponding file before it is forwarded to
+Rails.
+
+To prevent abuse of this feature, Workhorse signs the modified request with a
+special header, stating which entries it modified. Rails ignores any
+unsigned path entries.
+
+```mermaid
+sequenceDiagram
+    participant c as Client
+    participant w as Workhorse
+    participant r as Rails
+    participant s as NFS
+
+    activate c
+    c ->>+w: POST /some/url/upload
+
+    w->>+s: save the incoming file on a temporary location
+    s-->>-w: request result
+
+    w->>+r:  POST /some/url/upload
+    Note over w,r: file was replaced with its location<br>and other metadata
+
+    opt requires async processing
+      r->>+redis: schedule a job
+      redis-->>-r: job is scheduled
+    end
+
+    r-->>-c: request result
+    deactivate c
+    w->>-w: cleanup
+
+    opt requires async processing
+      activate sidekiq
+      sidekiq->>+redis: fetch a job
+      redis-->>-sidekiq: job
+
+      sidekiq->>+s: read file
+      s-->>-sidekiq: file
+
+      sidekiq->>sidekiq: process file
+
+      deactivate sidekiq
+    end
+```
+
+### Direct upload
+
+This is the more advanced acceleration technique we have in place.
+
+Workhorse asks Rails for temporary pre-signed object storage URLs and directly uploads to object storage.
+
+In this setup, an extra Rails route must be implemented in order to handle authorization. Examples of this can be found in:
+
+- [`Projects::LfsStorageController`](https://gitlab.com/gitlab-org/gitlab/-/blob/cc723071ad337573e0360a879cbf99bc4fb7adb9/app/controllers/projects/lfs_storage_controller.rb)
+  and [its routes](https://gitlab.com/gitlab-org/gitlab/-/blob/cc723071ad337573e0360a879cbf99bc4fb7adb9/config/routes/git_http.rb#L31-32).
+- [API endpoints for uploading packages](../packages.md#file-uploads).
+
+Direct upload falls back to _disk buffered upload_ when `direct_upload` is disabled inside the [object storage setting](../../administration/uploads.md#object-storage-settings).
+The answer to the `/authorize` call contains only a file system path.
+
+```mermaid
+sequenceDiagram
+    participant c as Client
+    participant w as Workhorse
+    participant r as Rails
+    participant os as Object Storage
+
+    activate c
+    c ->>+w: POST /some/url/upload
+
+    w ->>+r: POST /some/url/upload/authorize
+    Note over w,r: this request has an empty body
+    r-->>-w: presigned OS URL
+
+    w->>+os: PUT file
+    Note over w,os: file is stored on a temporary location. Rails select the destination
+    os-->>-w: request result
+
+    w->>+r:  POST /some/url/upload
+    Note over w,r: file was replaced with its location<br>and other metadata
+
+    r->>+os: move object to final destination
+    os-->>-r: request result
+
+    opt requires async processing
+      r->>+redis: schedule a job
+      redis-->>-r: job is scheduled
+    end
+
+    r-->>-c: request result
+    deactivate c
+    w->>-w: cleanup
+
+    opt requires async processing
+      activate sidekiq
+      sidekiq->>+redis: fetch a job
+      redis-->>-sidekiq: job
+
+      sidekiq->>+os: get object
+      os-->>-sidekiq: file
+
+      sidekiq->>sidekiq: process file
+
+      deactivate sidekiq
+    end
+```
--- a/doc/development/uploads/index.md
+++ b/doc/development/uploads/index.md
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Uploads development guide
+
+Uploads are an integral part of many GitLab features. To understand how GitLab handles uploads, refer to
+the following pages:
+
+- [Why GitLab uses custom upload logic.](background.md)
+- [How uploads work technically.](implementation.md)
+- [How to add new uploads.](working_with_uploads.md)
--- a/doc/development/uploads/working_with_uploads.md
+++ b/doc/development/uploads/working_with_uploads.md
--- a/rubocop/cop/gitlab/avoid_uploaded_file_from_params.rb
+++ b/rubocop/cop/gitlab/avoid_uploaded_file_from_params.rb
@@ -4,7 +4,7 @@ module RuboCop
  module Cop
    module Gitlab
      # This cop checks for `UploadedFile.from_params` usage.
-      # See https://docs.gitlab.com/ee/development/uploads.html#how-to-add-a-new-upload-route
+      # See https://docs.gitlab.com/ee/development/uploads/working_with_uploads.html
      #
      # @example
      #
@@ -34,7 +34,7 @@ module RuboCop
      #     end
      #   end
      class AvoidUploadedFileFromParams < RuboCop::Cop::Cop
-        MSG = 'Use the `UploadedFile` set by `multipart.rb` instead of calling `UploadedFile.from_params` directly. See https://docs.gitlab.com/ee/development/uploads.html#how-to-add-a-new-upload-route'
+        MSG = 'Use the `UploadedFile` set by `multipart.rb` instead of calling `UploadedFile.from_params` directly. See https://docs.gitlab.com/ee/development/uploads/working_with_uploads.html'

        def_node_matcher :calling_uploaded_file_from_params?, <<~PATTERN
          (send (const nil? :UploadedFile) :from_params ...)