Commit ecaa80d2 authored by Ophélie Gagnard's avatar Ophélie Gagnard

WIP: Rewrite the README. (The old one is kept for now.)

parent bdb34f61
# metadata-collect-agent # metadata-collect-agent
In the context of the project [GNU/Linux System files on-boot Tamper Detection System](https://www.erp5.com/group_section/forum/GNU-Linux-System-files-on-boot-Tamper-Detection-System-94DGdYfmx1), we need to create an agent that will be run inside an initramfs to report as much metadata as useful while keeping system boot times acceptable. It must then report that metadata to a remote service for later analysis. ## Compile from source
## Current performance properties ### Without dracut
- Reads file system metadata from the main thread (stat, xattrs (SELinux, ..), POSIX ACLs) Get the executables:
- Reads files and hashes them in md5, sha1, sha256 and sha512 across multiple processes (as many as core count) with the `multiprocessing` python module make no-dracut
- Maximizes disk I/O utilization successfully, the Python code's performance is not a bottleneck, the disk is (good sign) At this stage, the useful files are in bin/, lib/ and flb.conf is useful too.
Tested on a laptop with: Install:
- 3.2 GB/s read NVMe SSD (don't forget to set $DESTDIR and $PREFIX variables if you want to; default: DESTDIR="", PREFIX="/usr/local")
- Intel(R) Core(TM) i7-1065G7 CPU (4 cores, 8 threads) @ 1.30 GHz min / 3.90 GHz max make install-no-dracut
- 2 GHz per thread on average under full multithreaded load due to heat and unoptimal laptop thermals (Dell XPS 13 2020)
- For ~1 million files on EXT4 over LUKS+LVM and ~140GB occupied disk space: ### With dracut
```
real 6m11.532s
user 31m7.676s
sys 3m27.251s
```
6 minutes and 12 seconds of real world time #### Without signing the image.
This will hardly get any better because the disk is the bottleneck, CPU usage is not full but disk I/O utilization is, peaking at 500 MB/s reads for these test conditions. 3.2 GB/s on this SSD is for sequential reads (optimal conditions).
It can and probably will be faster on performant servers with less files, less disk space usage, more CPU cores and similar disk.
## Desired performance properties
- Reduce memory usage
- Avoid storing all the collected data in memory at the same time
- encode and output JSON as the program runs (incompatible with tree-like data structure like now)
- discard data after output so that memory usage can be deterministic
- Beware of stack overflows
- Currently the file system traverse function is recursive, Python does not have tail recursion optimization so it potentially could overflow the stack. But due to file system paths being limited in size (is it always true? is it file system specific?), probably it's unlikely it ever will.
\ No newline at end of file
# metadata-collect-agent
In the context of the project [GNU/Linux System files on-boot Tamper Detection System](https://www.erp5.com/group_section/forum/GNU-Linux-System-files-on-boot-Tamper-Detection-System-94DGdYfmx1), we need to create an agent that will be run inside an initramfs to report as much metadata as useful while keeping system boot times acceptable. It must then report that metadata to a remote service for later analysis.
## Current performance properties
- Reads file system metadata from the main thread (stat, xattrs (SELinux, ..), POSIX ACLs)
- Reads files and hashes them in md5, sha1, sha256 and sha512 across multiple processes (as many as core count) with the `multiprocessing` python module
- Maximizes disk I/O utilization successfully, the Python code's performance is not a bottleneck, the disk is (good sign)
Tested on a laptop with:
- 3.2 GB/s read NVMe SSD
- Intel(R) Core(TM) i7-1065G7 CPU (4 cores, 8 threads) @ 1.30 GHz min / 3.90 GHz max
- 2 GHz per thread on average under full multithreaded load due to heat and unoptimal laptop thermals (Dell XPS 13 2020)
- For ~1 million files on EXT4 over LUKS+LVM and ~140GB occupied disk space:
```
real 6m11.532s
user 31m7.676s
sys 3m27.251s
```
6 minutes and 12 seconds of real world time
This will hardly get any better because the disk is the bottleneck, CPU usage is not full but disk I/O utilization is, peaking at 500 MB/s reads for these test conditions. 3.2 GB/s on this SSD is for sequential reads (optimal conditions).
It can and probably will be faster on performant servers with less files, less disk space usage, more CPU cores and similar disk.
## Desired performance properties
- Reduce memory usage
- Avoid storing all the collected data in memory at the same time
- encode and output JSON as the program runs (incompatible with tree-like data structure like now)
- discard data after output so that memory usage can be deterministic
- Beware of stack overflows
- Currently the file system traverse function is recursive, Python does not have tail recursion optimization so it potentially could overflow the stack. But due to file system paths being limited in size (is it always true? is it file system specific?), probably it's unlikely it ever will.
\ No newline at end of file
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment