add documentation

e4e45f43 · Micaël Bergeron · d50c4078 · e4e45f43 · e4e45f43
Commit e4e45f43 authored Jun 19, 2018 by Micaël Bergeron
Show whitespace changes
Inline Side-by-side

Showing with 107 additions and 0 deletions

doc/administration/pseudonymizer.md doc/administration/pseudonymizer.md +55 -0

doc/raketasks/pseudonymizer.md doc/raketasks/pseudonymizer.md +52 -0

No files found.
--- a/doc/administration/pseudonymizer.md
+++ b/doc/administration/pseudonymizer.md
+# Pseudonymizer
+
+## Object Storage Settings
+
+**In Omnibus installations:**
+
+1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
+   the values you want:
+
+    ```ruby
+    gitlab_rails['pseudonymizer_enabled'] = true
+    gitlab_rails['pseudonymizer_manifest'] = 'lib/pseudonymizer/manifest.yml'
+    gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt'
+    gitlab_rails['pseudonymizer_upload_connection'] = {
+      'provider' => 'AWS',
+      'region' => 'eu-central-1',
+      'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
+      'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
+    }
+    ```
+
+>**Note:**
+If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.
+
+    ```ruby
+    gitlab_rails['pseudonymizer_upload_connection'] = {
+      'provider' => 'AWS',
+      'region' => 'eu-central-1',
+      'use_iam_profile' => true
+    }
+    ```
+
+1. Save the file and [reconfigure GitLab][] for the changes to take effect.
+
+---
+
+**In installations from source:**
+
+1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
+   lines:
+
+    ```yaml
+    pseudonymizer:
+      enabled: true
+	  manifest: lib/pseudonymizer/manifest.yml
+	  upload:
+        remote_directory: 'gitlab-elt' # The bucket name
+        connection:
+          provider: AWS # Only AWS supported at the moment
+          aws_access_key_id: AWS_ACESS_KEY_ID
+          aws_secret_access_key: AWS_SECRET_ACCESS_KEY
+          region: eu-central-1
+    ```
+
+1. Save the file and [restart GitLab][] for the changes to take effect.
--- a/doc/raketasks/pseudonymizer.md
+++ b/doc/raketasks/pseudonymizer.md
+# Pseudonymizer
+
+> [Introduced](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5532) in [GitLab Enterprise Edition][ee] 11.1
+
+## Export GitLab's data for safe analytics
+
+As the GitLab's database host sensitive informations, using it unfiltered for analytics implies high security requirements. To help alleviate this constraint, the Pseudonymizer service shall export GitLab's data, in a pseudonymized way.
+
+### Pseudonymization
+
+> **Note:**
+> This process is not impervious: if the source data is available, it is possible for an user to correlate data to the pseudonymized version.
+
+The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that should not textually exported. This should ensure that:
+
+  - End-user of the data source cannot infer/revert the pseudonymized fields
+  - Referencial integrity is maintained
+
+### Manifest
+
+The manifest is a file that describe which fields should be included or pseudonymized.
+
+You may find this manifest at `lib/pseudonymizer/manifest.yml`. 
+
+### Usage
+
+> **Note:**
+> You can configure the pseudonymizer using the following environment variables:
+>
+>   - PSEUDONYMIZER_OUTPUT_DIR: where to store the output CSV files (default: `/tmp`)
+>   - PSEUDONYMIZER_BATCH: the batch size when querying the DB (default: `100 000`)
+
+> **Note:**
+> Object store is required for the pseudonymizer to work properly.
+
+```
+bundle exec rake gitlab:db:pseudonymizer
+```
+
+### Output
+
+> **Note:**
+> The output CSV files might be very large. Make sure the `PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least 10% of the database size is recommended.
+
+After the pseudonymizer has run, the output CSV files should be uploaded to the configured object store.
+
+### Configuration
+
+See [administration].
+
+[ee]: https://about.gitlab.com/products/
+[administration]: administration/pseudonymizer.md