updated smaller collections for manual
11
collections/developers/internals/zos/internals/boot.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Services Boot Sequence
|
||||
|
||||
Here is dependency graph of all the services started by 0-OS:
|
||||
|
||||

|
||||
|
||||
## Pseudo boot steps
|
||||
|
||||
both `node-ready` and `boot` are not actual services, but instead they are there to define a `boot stage`. for example once `node-ready` service is (ready) it means all crucial system services defined by 0-initramfs are now running.
|
||||
|
||||
`boot` service is similar, but guarantees that some 0-OS services are running (for example `storaged`), before starting other services like `flistd` which requires `storaged`
|
89
collections/developers/internals/zos/internals/capacity.md
Normal file
@@ -0,0 +1,89 @@
|
||||
<h1>Capacity</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [System reserved capacity](#system-reserved-capacity)
|
||||
- [Reserved Memory](#reserved-memory)
|
||||
- [Reserved Storage](#reserved-storage)
|
||||
- [User Capacity](#user-capacity)
|
||||
- [Memory](#memory)
|
||||
- [Storage](#storage)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
This document describes how ZOS does the following tasks:
|
||||
|
||||
- Reserved system resources
|
||||
- Memory
|
||||
- Storage
|
||||
- Calculation of free usable capacity for user workloads
|
||||
|
||||
## System reserved capacity
|
||||
|
||||
ZOS always reserve some amount of the available physical resources to its own operation. The system tries to be as protective
|
||||
as possible of it's critical services to make sure that the node is always reachable and usable even if it's under heavy load
|
||||
|
||||
ZOS make sure it reserves Memory and Storage (but not CPU) as per the following:
|
||||
|
||||
### Reserved Memory
|
||||
|
||||
ZOS reserve 10% of the available system memory for basic services AND operation overhead. The operation overhead can happen as a side effect of running user workloads. For example, a user network while in theory does not consume any memory, in matter of fact it also consume some memory (kernel buffers, etc...). Same for a VM. A user VM can be assigned say 5G but the process that running the VM can/will take few extra megabytes to operate.
|
||||
|
||||
This is why we decided to play on the safe side, and reserve 10% of total system memory to the system overhead, with a **MIN** reserved memory of 2GB
|
||||
|
||||
```python
|
||||
reserved = min(total_in_gb * 0.1, 2G)
|
||||
```
|
||||
|
||||
### Reserved Storage
|
||||
|
||||
While ZOS does not require installation, but it needs to download and store many things to operate correctly. This include the following:
|
||||
|
||||
- Node identity. Information about the node id and keys
|
||||
- The system binaries, those what include all zos to join the grid and operate as expected
|
||||
- Workload flists. Those are the flists of the user workloads. Those are downloaded on demand so they don't always exist.
|
||||
- State information. Tracking information maintained by ZOS to track the state of workloads, owner-ship, and more.
|
||||
|
||||
This is why the system on first start allocates and reserve a part of the available SSD storage and is called `zos-cache`. Initially is `5G` (was 100G in older version) but because the `dynamic` nature of the cache we can't fix it at `5G`
|
||||
|
||||
The required space to be reserved by the system can dramatically change based on the amount of workloads running on the system. For example if many users are running many different VMs, the system will need to download (and cache) different VM images, hence requiring more cache.
|
||||
|
||||
This is why the system periodically checks the reserved storage and then dynamically expand or shrink to a more suitable value in increments of 5G. The expansion happens around the 20% of current cache size, and shrinking if went below 20%.
|
||||
|
||||
## User Capacity
|
||||
|
||||
All workloads requires some sort of a resource(s) to run and that is actually what the user hae to pay for. Any workload can consume resources in one of the following criteria:
|
||||
|
||||
- CU (compute unit in vCPU)
|
||||
- MU (memory unit in bytes)
|
||||
- NU (network unit in bytes)
|
||||
- SU (ssd storage in bytes)
|
||||
- HU (hdd storage in bytes)
|
||||
|
||||
A workloads, based on the type can consume one or more of those resource types. Some workloads will have a well known "size" on creation, others might be dynamic and won't be know until later.
|
||||
|
||||
For example, a disk workload SU consumption will be know ahead. Unlike the NU used by a network which will only be known after usage over a certain period of time.
|
||||
|
||||
A single deployment can has multiple workloads each requires a certain amount of one or more capacity types (listed above). ZOS then for each workloads type compute the amount of resources needed per workload, and then check if it can provide this amount of capacity.
|
||||
|
||||
> This means that a deployment that define 2 VMs can partially succeed to deploy one of the VMs but not the other one if the amount of resources it requested are higher than what the node can provide
|
||||
|
||||
### Memory
|
||||
|
||||
How the system decide if there are enough memory to run a certain workload that demands MU resources goes as follows:
|
||||
|
||||
- compute the "theoretically used" memory by all user workloads excluding `self`. This is basically the sum of all consumed MU units of all active workloads (as defined by their corresponding deployments, not as per actually used in the system).
|
||||
- The theoretically used memory is topped with the system reserved memory.
|
||||
- The the system checks actually used memory on the system this is done simply by doing `actual_used = memory.total - memory.available`
|
||||
- The system now can simply `assume` an accurate used memory by doing `used = max(actual_used, theoretically_used)`
|
||||
- Then `available = total - used`
|
||||
- Then simply checks that `available` memory is enough to hold requested workload memory!
|
||||
|
||||
### Storage
|
||||
|
||||
Storage is much simpler to allocate than memory. It's completely left to the storage subsystem to find out if it can fit the requested storage on the available physical disks or not, if not possible the workloads is marked as error.
|
||||
|
||||
Storage tries to find the requested space based on type (SU or HU), then find the optimal way to fit that on the available disks, or spin up a new one if needed.
|
@@ -0,0 +1,14 @@
|
||||
# Compatibility list
|
||||
|
||||
This document track all the hardware that have been tested, the issues encountered and possible workarounds.
|
||||
|
||||
**Legend**
|
||||
✅ : fully supported
|
||||
⚠️ : supported with some tweaking
|
||||
🛑 : not supported
|
||||
|
||||
|
||||
| vendor | Hardware | Support | Issues | workaround |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Supermicro | SYS-5038ML-H8TRF | ✅ | | |
|
||||
| Gigabyte Technology Co | AB350N-Gaming WIFI | ✅ | | |
|
@@ -0,0 +1,106 @@
|
||||
<h1>Container Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [ZBus](#zbus)
|
||||
- [Home Directory](#home-directory)
|
||||
- [Introduction](#introduction)
|
||||
- [zinit unit](#zinit-unit)
|
||||
- [Interface](#interface)
|
||||
|
||||
***
|
||||
|
||||
## ZBus
|
||||
|
||||
Storage module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
| container|[container](#interface)| 0.0.1|
|
||||
|
||||
## Home Directory
|
||||
|
||||
contd keeps some data in the following locations
|
||||
| directory | path|
|
||||
|----|---|
|
||||
| root| `/var/cache/modules/containerd`|
|
||||
|
||||
## Introduction
|
||||
|
||||
The container module, is a proxy to [containerd](https://github.com/containerd/containerd). The proxy provides integration with zbus.
|
||||
|
||||
The implementation is the moment is straight forward, which includes preparing the OCI spec for the container, the tenant containerd namespace,
|
||||
setting up proper capabilities, and finally creating the container instance on `containerd`.
|
||||
|
||||
The module is fully stateless, all container information is queried during runtime from `containerd`.
|
||||
|
||||
### zinit unit
|
||||
|
||||
`contd` must run after containerd is running, and the node boot process is complete. Since it doesn't keep state, no dependency on `stroaged` is needed
|
||||
|
||||
```yaml
|
||||
exec: contd -broker unix:///var/run/redis.sock -root /var/cache/modules/containerd
|
||||
after:
|
||||
- containerd
|
||||
- boot
|
||||
```
|
||||
|
||||
## Interface
|
||||
|
||||
```go
|
||||
package pkg
|
||||
|
||||
// ContainerID type
|
||||
type ContainerID string
|
||||
|
||||
// NetworkInfo defines a network configuration for a container
|
||||
type NetworkInfo struct {
|
||||
// Currently a container can only join one (and only one)
|
||||
// network namespace that has to be pre defined on the node
|
||||
// for the container tenant
|
||||
|
||||
// Containers don't need to know about anything about bridges,
|
||||
// IPs, wireguards since this is all is only known by the network
|
||||
// resource which is out of the scope of this module
|
||||
Namespace string
|
||||
}
|
||||
|
||||
// MountInfo defines a mount point
|
||||
type MountInfo struct {
|
||||
Source string // source of the mount point on the host
|
||||
Target string // target of mount inside the container
|
||||
Type string // mount type
|
||||
Options []string // mount options
|
||||
}
|
||||
|
||||
//Container creation info
|
||||
type Container struct {
|
||||
// Name of container
|
||||
Name string
|
||||
// path to the rootfs of the container
|
||||
RootFS string
|
||||
// Env env variables to container in format {'KEY=VALUE', 'KEY2=VALUE2'}
|
||||
Env []string
|
||||
// Network network info for container
|
||||
Network NetworkInfo
|
||||
// Mounts extra mounts for container
|
||||
Mounts []MountInfo
|
||||
// Entrypoint the process to start inside the container
|
||||
Entrypoint string
|
||||
// Interactivity enable Core X as PID 1 on the container
|
||||
Interactive bool
|
||||
}
|
||||
|
||||
// ContainerModule defines rpc interface to containerd
|
||||
type ContainerModule interface {
|
||||
// Run creates and starts a container on the node. It also auto
|
||||
// starts command defined by `entrypoint` inside the container
|
||||
// ns: tenant namespace
|
||||
// data: Container info
|
||||
Run(ns string, data Container) (ContainerID, error)
|
||||
|
||||
// Inspect, return information about the container, given its container id
|
||||
Inspect(ns string, id ContainerID) (Container, error)
|
||||
Delete(ns string, id ContainerID) error
|
||||
}
|
||||
```
|
@@ -0,0 +1,74 @@
|
||||
<h1>Flist Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Zbus](#zbus)
|
||||
- [Home Directory](#home-directory)
|
||||
- [Introduction](#introduction)
|
||||
- [Public interface ](#public-interface-)
|
||||
- [zinit unit](#zinit-unit)
|
||||
|
||||
***
|
||||
|
||||
## Zbus
|
||||
|
||||
Flist module is available on zbus over the following channel:
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
|flist |[flist](#public-interface)| 0.0.1
|
||||
|
||||
## Home Directory
|
||||
flist keeps some data in the following locations:
|
||||
| directory | path|
|
||||
|----|---|
|
||||
| root| `/var/cache/modules/containerd`|
|
||||
|
||||
## Introduction
|
||||
|
||||
This module is responsible to "mount an flist" in the filesystem of the node. The mounted directory contains all the files required by containers or (in the future) VMs.
|
||||
|
||||
The flist module interface is very simple. It does not expose any way to choose where to mount the flist or have any reference to containers or VM. The only functionality is to mount a given flist and receive the location where it is mounted. It is up to the above layer to do something useful with this information.
|
||||
|
||||
The flist module itself doesn't contain the logic to understand the flist format or to run the fuse filesystem. It is just a wrapper that manages [0-fs](https://github.com/threefoldtech/0-fs) processes.
|
||||
|
||||
Its only job is to download the flist, prepare the isolation of all the data and then start 0-fs with the proper arguments.
|
||||
|
||||
## Public interface [](https://godoc.org/github.com/threefoldtech/zos/pkg/flist)
|
||||
|
||||
```go
|
||||
|
||||
//Flister is the interface for the flist module
|
||||
type Flister interface {
|
||||
// Mount mounts an flist located at url using the 0-db located at storage
|
||||
// in a RO mode. note that there is no way u can unmount a ro flist because
|
||||
// it can be shared by many users, it's then up to system to decide if the
|
||||
// mount is not needed anymore and clean it up
|
||||
Mount(name, url string, opt MountOptions) (path string, err error)
|
||||
|
||||
// UpdateMountSize change the mount size
|
||||
UpdateMountSize(name string, limit gridtypes.Unit) (path string, err error)
|
||||
|
||||
// Umount a RW mount. this only unmounts the RW layer and remove the assigned
|
||||
// volume.
|
||||
Unmount(name string) error
|
||||
|
||||
// HashFromRootPath returns flist hash from a running g8ufs mounted with NamedMount
|
||||
HashFromRootPath(name string) (string, error)
|
||||
|
||||
// FlistHash returns md5 of flist if available (requesting the hub)
|
||||
FlistHash(url string) (string, error)
|
||||
|
||||
Exists(name string) (bool, error)
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
## zinit unit
|
||||
|
||||
The zinit unit file of the module specifies the command line, test command, and the order in which the services need to be booted.
|
||||
|
||||
Flist module depends on the storage and network pkg.
|
||||
This is because it needs connectivity to download flist and data and it needs storage to be able to cache the data once downloaded.
|
||||
|
||||
Flist doesn't do anything special on the system except creating a bunch of directories it will use during its lifetime.
|
121
collections/developers/internals/zos/internals/gateway/readme.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Gateway Module
|
||||
|
||||
## ZBus
|
||||
|
||||
Gateway module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
| ------- | --------------------- | ------- |
|
||||
| gateway | [gateway](#interface) | 0.0.1 |
|
||||
|
||||
## Home Directory
|
||||
|
||||
gateway keeps some data in the following locations
|
||||
| directory | path |
|
||||
| --------- | ---------------------------- |
|
||||
| root | `/var/cache/modules/gateway` |
|
||||
|
||||
The directory `/var/cache/modules/gateway/proxy` contains the route information used by traefik to forward traffic.
|
||||
## Introduction
|
||||
|
||||
The gateway modules is used to register traefik routes and services to act as a reverse proxy. It's the backend supporting two kinds of workloads: `gateway-fqdn-proxy` and `gateway-name-proxy`.
|
||||
|
||||
For the FQDN type, it receives the domain and a list of backends in the form `http://ip:port` or `https://ip:port` and registers a route for this domain forwarding traffic to these backends. It's a requirement that the domain resolves to the gateway public ip. The `tls_passthrough` parameter determines whether the tls termination happens on the gateway or in the backends. When it's true, the backends must be in the form `https://ip:port`, and the backends must be https-enabled servers.
|
||||
|
||||
The name type is the same as the FQDN type except that the `name` parameter is added as a prefix to the gatweay domain to determine the fqdn. It's forbidden to use a FQDN type workload to reserve a domain managed by the gateway.
|
||||
|
||||
The fqdn type is enabled only if there's a public config on the node. The name type works only if a domain exists in the public config. To make a full-fledged gateway node, these DNS records are required:
|
||||
```
|
||||
gatwaydomain.com A ip.of.the.gateway
|
||||
*.gatewaydomain.com CNAME gatewaydomain.com
|
||||
__acme-challenge.gatewaydomain.com NS gatdwaydomain.com
|
||||
```
|
||||
|
||||
### zinit unit
|
||||
|
||||
```yaml
|
||||
exec: gateway --broker unix:///var/run/redis.sock --root /var/cache/modules/gateway
|
||||
after:
|
||||
- boot
|
||||
```
|
||||
## Implementation details
|
||||
|
||||
Traefik is used as the reverse proxy forwarding traffic to upstream servers. All worklaods deployed on the node is associated with a domain that resolves to the node IP. In the name workload case, it's a subdomain of the gateway main domain. In the FQDN case, the user must create a DNS A record pointing it to the node IP. The node by default redirects all http traffic to https.
|
||||
|
||||
When an https request reaches the node, it looks at the domain and determines the correct service that should handle the request. The services defintions are in `/var/cache/modules/gateway/proxy/` and is hot-reloaded by traefik every time a service is added/removed to/from it. Zos currently supports enabling `tls_passthrough` in which case the https request is passed as is to the backend (at the TCP level). The default is `tls_passthrough` is false which means the node terminates the TLS traffic and then forwards the request as http to the backend.
|
||||
Example of a FQDN service definition with tls_passthrough enabled:
|
||||
```yaml
|
||||
tcp:
|
||||
routers:
|
||||
37-2039-testname-route:
|
||||
rule: HostSNI(`remote.omar.grid.tf`)
|
||||
service: 37-2039-testname
|
||||
tls:
|
||||
passthrough: "true"
|
||||
services:
|
||||
37-2039-testname:
|
||||
loadbalancer:
|
||||
servers:
|
||||
- address: 137.184.106.152:443
|
||||
```
|
||||
Example of a "name" service definition with tls_passthrough disabled:
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
37-1976-workloadname-route:
|
||||
rule: Host(`workloadname.gent01.dev.grid.tf`)
|
||||
service: 40-1976-workloadname
|
||||
tls:
|
||||
certResolver: dnsresolver
|
||||
domains:
|
||||
- sans:
|
||||
- '*.gent01.dev.grid.tf'
|
||||
services:
|
||||
40-1976-workloadname:
|
||||
loadbalancer:
|
||||
servers:
|
||||
- url: http://[backendip]:9000
|
||||
```
|
||||
|
||||
The `certResolver` option has two valid values, `resolver` and `dnsresolver`. The `resolver` is an http resolver and is used in FQDN services with `tls_passthrough` disabled. It uses the http challenge to generate a single-domain certificate. The `dnsresolver` is used for name services with `tls_passthrough` disabled. The `dnsresolver` is responsible for generating a wildcard certificate to be used for all subdomains of the gateway domain. Its flow is described below.
|
||||
|
||||
The CNAME record is used to make all subdomains (reserved or not) resolve to the ip of the gateway. Generating a wildcard certificate requires adding a TXT record at `__acme-challenge.gatewaydomain.com`. The NS record is used to delegate this specific subdomain to the node. So if someone did `dig TXT __acme-challenge.gatewaydomain.com`, the query is served by the node, not the DNS provider used for the gateway domain.
|
||||
|
||||
Traefik has, as a config parameter, multiple dns [providers](https://doc.traefik.io/traefik/https/acme/#providers) to communicate with when it wants to add the required TXT record. For non-supported providers, a bash script can be provided to do the record generation and clean up (i.e. External program). The bash [script](https://github.com/threefoldtech/zos/blob/main/pkg/gateway/static/cert.sh) starts dnsmasq managing a dns zone for the `__acme-challenge` subdomain with the given TXT record. It then kills the dnsmasq process and removes the config file during cleanup.
|
||||
## Interface
|
||||
|
||||
```go
|
||||
type Backend string
|
||||
|
||||
// GatewayFQDNProxy definition. this will proxy name.<zos.domain> to backends
|
||||
type GatewayFQDNProxy struct {
|
||||
// FQDN the fully qualified domain name to use (cannot be present with Name)
|
||||
FQDN string `json:"fqdn"`
|
||||
|
||||
// Passthroug whether to pass tls traffic or not
|
||||
TLSPassthrough bool `json:"tls_passthrough"`
|
||||
|
||||
// Backends are list of backend ips
|
||||
Backends []Backend `json:"backends"`
|
||||
}
|
||||
|
||||
|
||||
// GatewayNameProxy definition. this will proxy name.<zos.domain> to backends
|
||||
type GatewayNameProxy struct {
|
||||
// Name the fully qualified domain name to use (cannot be present with Name)
|
||||
Name string `json:"name"`
|
||||
|
||||
// Passthroug whether to pass tls traffic or not
|
||||
TLSPassthrough bool `json:"tls_passthrough"`
|
||||
|
||||
// Backends are list of backend ips
|
||||
Backends []Backend `json:"backends"`
|
||||
}
|
||||
|
||||
type Gateway interface {
|
||||
SetNamedProxy(wlID string, prefix string, backends []string, TLSPassthrough bool) (string, error)
|
||||
SetFQDNProxy(wlID string, fqdn string, backends []string, TLSPassthrough bool) error
|
||||
DeleteNamedProxy(wlID string) error
|
||||
Metrics() (GatewayMetrics, error)
|
||||
}
|
||||
```
|
@@ -0,0 +1,99 @@
|
||||
# 0-OS, a bit of history and introduction to Version 2
|
||||
|
||||
## Once upon a time
|
||||
----
|
||||
A few years ago, we were trying to come up with some solutions to the problem of self-healing IT.
|
||||
We boldly started that : the current model of cloud computing in huge data-centers is not going to be able to scale to fit the demand in IT capacity.
|
||||
|
||||
The approach we took to solve this problem was to enable localized compute and storage units at the edge of the network, close to where it is needed.
|
||||
That basically meant that if we were to deploy physical hardware to the edges, nearby the users, we would have to allow information providers to deploy their solutions on that edge network and hardware. That means also sharing hardware resources between users, where we would have to make damn sure noone can peek around in things that are not his.
|
||||
|
||||
When we talk about sharing capacity in a secure environment, virtualization comes to mind. It's not a new technology and it has been around for quite some time. This solution comes with a cost though. Virtual machines, emulating a full hardware platform on real hardware is costly in terms of used resources, and eat away at the already scarce resources we want to provide for our users.
|
||||
|
||||
Containerizing technologies were starting to get some hype at the time. Containers provide for basically the same level of isolation as Full Virtualisation, but are a lot less expensive in terms of resource utilization.
|
||||
|
||||
With that in mind, we started designing the first version of 0-OS. The required features were:
|
||||
|
||||
- be able to be fully in control of the hardware
|
||||
- give the possibility to different users to share the same hardware
|
||||
- deploy this capacity at the edge, close to where it is needed
|
||||
- the System needs to self-heal. Because of their location and sheer scale, manual maintenance was not an option. Self-healing is a broad topic, and will require a lot of experience and fine-tuning, but it was meant to culminate at some point so that most of the actions that sysadmins execute, would be automated.
|
||||
- Have an a small as possible attack surface, as well for remote types of attack, as well as protecting users from each-other
|
||||
|
||||
The result of that thought process resulted in 0-OS v1. A linux kernel with the minimal components on top that allows to provide for these features.
|
||||
|
||||
In the first incantation of 0-OS, the core framework was a single big binary that got started as the first process of the system (PID 1). All the managment features were exposed through an API that was only accessible locally.
|
||||
|
||||
The idea was to have an orchestration system running on top that was going to be responsible to deploy Virtual Machines and Containers on the system using that API.
|
||||
|
||||
This API exposes 3 main primitives:
|
||||
|
||||
- networking: zerotier, vlan, macvlan, bridge, openvswitch...
|
||||
- storage: plain disk, 0-db, ...
|
||||
- compute: VM, containers
|
||||
|
||||
That was all great and it allowed us to learn a lot. But some limitations started to appear. Here is a non exhaustive list of the limitations we had to face after a couple of years of utilization:
|
||||
|
||||
- Difficulty to push new versions and fixes on the nodes. The fact that 0-OS was a single process running as PID 1, forced us to completely reboot the node every time we wanted to push an update.
|
||||
- The API, while powerful, still required to have some logic on top to actually deploy usable solutions.
|
||||
- We noticed that some features we implemented were never or extremely rarely used. This was just increasing the possible attack surface for no real benefits.
|
||||
- The main networking solution we choose at the time, zerotier, was not scaling as well as we hoped for.
|
||||
- We wrote a lot of code ourselves, instead of relying on already existing open source libraries that would have made that task a lot easier, but also, these libraries were a lot more mature and have had a lot more exposure for ironing out possible bugs and vulnerabilities than we could have created and tested ourselves with the little resources we have at hand.
|
||||
|
||||
## Now what ?
|
||||
With the knowledge and lessons gathered during these first years of usage, we
|
||||
concluded that trying to fix the already existing codebase would be cumbersome
|
||||
and we also wanted to avoid any technical debt that could haunt us for years
|
||||
after. So we decided for a complete rewrite of that stack, taking a new and
|
||||
fully modular approach, where every component could be easily replaced and
|
||||
upgraded without the need for a reboot.
|
||||
|
||||
Hence Version 2 saw the light of day.
|
||||
|
||||
Instead of trial and error, and muddling along trying to fit new features in
|
||||
that big monolithic codebase, we wanted to be sure that the components were
|
||||
reduced to a more manageable size, having a clearly cut Domain Separation.
|
||||
|
||||
Instead of creating solutions waiting for a problem, we started looking at things the other way around. Which is logical, as by now, we learned what the real puzzles to solve were, albeit sometimes by painful experience.
|
||||
|
||||
## Tadaa!
|
||||
----
|
||||
The [first commit](https://github.com/threefoldtech/zosv2/commit/7b783c888673d1e9bc400e4abbb17272e995f5a4) of the v2 repository took place the 11 of February 2019.
|
||||
We are now 6 months in, and about to bake the first release of 0-OS v2.
|
||||
Clocking in at almost 27KLoc, it was a very busy half-year. (admitted, there are the spec and docs too in that count ;-) )
|
||||
|
||||
Let's go over the main design decisions that were made and explain briefly each component.
|
||||
|
||||
While this is just an introduction, we'll add more articles digging deeper in the technicalities and approaches of each component.
|
||||
|
||||
## Solutions to puzzles (there are no problems)
|
||||
----
|
||||
**UPDATES**
|
||||
|
||||
One of the first puzzles we wanted to solve was the difficulty to push upgrades.
|
||||
In order to solve that, we designed 0-OS components as completely stand-alone modules. Each subsystem, be it storage, networking, containers/VMs, is managed by it's own component (mostly a daemon), and communicate with each-other through a local bus. And as we said, each component can then be upgraded separately, together with the necessary data migrations that could be required.
|
||||
|
||||
**WHAT API?**
|
||||
|
||||
The second big change is our approach to the API, or better, lack thereof.
|
||||
In V2 we dropped the idea to expose the primitives of the Node over an API.
|
||||
Instead, all the required knowledge to deploy workloads is directly embedded in 0-OS.
|
||||
So in order to have the node deploy a workload, we have created a blueprint like system where the user describes what his requirements in terms of compute power, storage and networking are, and the node applies that blueprint to make it reality.
|
||||
That approach has a few advantages:
|
||||
- It greatly reduces the attack surface of the node because there is no more direct interaction between a user and a node.
|
||||
- And it also allows us to have a greater control over how things are organized in the node itself. The node being its own boss, can decide to re-organize itself whenever needed to optimize the capacity it can provide.
|
||||
- Having a blueprint with requirements, gives the grid the possibility to verify that blueprint on multiple levels before applying it. That is: as well on top level as on node level a blueprint can be verified for validity and signatures before any other action will be executed.
|
||||
|
||||
**PING**
|
||||
|
||||
The last major change is how we want to handle networking.
|
||||
The solution used during the lifetime of V1 exposed its limitations when we started scaling our networks to hundreds of nodes.
|
||||
So here again we started from scratch and created our own overlay network solution.
|
||||
That solution is based on the 'new kid on the block' in terms of VPN: [Wireguard](https://wireguard.io) and it's approach and usage will be fully explained in the next 0-OS article.
|
||||
For the eager ones of you, there are some specifications and also some documentation [here](https://github.com/threefoldtech/zosv2/tree/master/docs/network) and [there](https://github.com/threefoldtech/zosv2/tree/master/specs/network).
|
||||
|
||||
## That's All, Folks (for now)
|
||||
So this little article as an intro to the brave new world of 0-OS.
|
||||
The Zero-OS team engages itself to regularly keep you updated on it's progress, the new features that will surely be added, and for the so inclined, add a lot more content for techies on how to actually use that novel beast.
|
||||
|
||||
[Till next time](https://youtu.be/b9434BoGkNQ)
|
@@ -0,0 +1,143 @@
|
||||
<h1> Node ID Generation</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [ZBus](#zbus)
|
||||
- [Home Directory](#home-directory)
|
||||
- [Introduction](#introduction-1)
|
||||
- [On Node Booting](#on-node-booting)
|
||||
- [ID generation](#id-generation)
|
||||
- [Cryptography](#cryptography)
|
||||
- [zinit unit](#zinit-unit)
|
||||
- [Interface](#interface)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
We explain the node ID generation process.
|
||||
|
||||
## ZBus
|
||||
|
||||
Identity module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
| identity|[manager](#interface)| 0.0.1|
|
||||
|
||||
## Home Directory
|
||||
|
||||
identity keeps some data in the following locations
|
||||
|
||||
| directory | path|
|
||||
|----|---|
|
||||
| root| `/var/cache/modules/identity`|
|
||||
|
||||
## Introduction
|
||||
|
||||
Identity manager is responsible for maintaining the node identity (public key). The manager make sure the node has one valid ID during the entire lifetime of the node. It also provide service to sign, encrypt and decrypt data using the node identity.
|
||||
|
||||
On first boot, the identity manager will generate an ID and then persist this ID for life.
|
||||
|
||||
Since the identity daemon is the only one that can access the node private key, it provides an interface to sign, verify and encrypt data. This methods are available for other modules on the local node to use.
|
||||
|
||||
## On Node Booting
|
||||
|
||||
- Check if node already has a seed generated
|
||||
- If yes, load the node identity
|
||||
- If not, generate a new ID
|
||||
- Start the zbus daemon.
|
||||
|
||||
## ID generation
|
||||
|
||||
At this time of development the ID generated by identityd is the base58 encoded public key of a ed25519 key pair.
|
||||
|
||||
The key pair itself is generated from a random seed of 32 bytes. It is this seed that is actually saved on the node. And during boot the key pair is re-generated from this seed if it exists.
|
||||
|
||||
## Cryptography
|
||||
|
||||
The signing and encryption capabilities of the identity module rely on this ed25519 key pair.
|
||||
|
||||
For signing, it directly used the key pair.
|
||||
For public key encryption, the ed25519 key pair is converted to its cure25519 equivalent and then use use to encrypt the data.
|
||||
|
||||
### zinit unit
|
||||
|
||||
The zinit unit file of the module specify the command line, test command, and the order where the services need to be booted.
|
||||
|
||||
`identityd` require `storaged` to make sure the seed is persisted over reboots, to make sure node has the same ID during the full life time of the node.
|
||||
The identityd daemon is only considered running if the seed file exists.
|
||||
|
||||
```yaml
|
||||
exec: /bin/identityd
|
||||
test: test -e /var/cache/modules/identity/seed.txt
|
||||
after:
|
||||
- storaged
|
||||
```
|
||||
|
||||
## Interface
|
||||
|
||||
For an up to date interface please check code [here](https://github.com/threefoldtech/zos/blob/main/pkg/identity.go)
|
||||
```go
|
||||
package pkg
|
||||
|
||||
// Identifier is the interface that defines
|
||||
// how an object can be used as an identity
|
||||
type Identifier interface {
|
||||
Identity() string
|
||||
}
|
||||
|
||||
// StrIdentifier is a helper type that implement the Identifier interface
|
||||
// on top of simple string
|
||||
type StrIdentifier string
|
||||
|
||||
// Identity implements the Identifier interface
|
||||
func (s StrIdentifier) Identity() string {
|
||||
return string(s)
|
||||
}
|
||||
|
||||
// IdentityManager interface.
|
||||
type IdentityManager interface {
|
||||
// NodeID returns the node id (public key)
|
||||
NodeID() StrIdentifier
|
||||
|
||||
// NodeIDNumeric returns the node registered ID.
|
||||
NodeIDNumeric() (uint32, error)
|
||||
|
||||
// FarmID return the farm id this node is part of. this is usually a configuration
|
||||
// that the node is booted with. An error is returned if the farmer id is not configured
|
||||
FarmID() (FarmID, error)
|
||||
|
||||
// Farm returns name of the farm. Or error
|
||||
Farm() (string, error)
|
||||
|
||||
//FarmSecret get the farm secret as defined in the boot params
|
||||
FarmSecret() (string, error)
|
||||
|
||||
// Sign signs the message with privateKey and returns a signature.
|
||||
Sign(message []byte) ([]byte, error)
|
||||
|
||||
// Verify reports whether sig is a valid signature of message by publicKey.
|
||||
Verify(message, sig []byte) error
|
||||
|
||||
// Encrypt encrypts message with the public key of the node
|
||||
Encrypt(message []byte) ([]byte, error)
|
||||
|
||||
// Decrypt decrypts message with the private of the node
|
||||
Decrypt(message []byte) ([]byte, error)
|
||||
|
||||
// EncryptECDH aes encrypt msg using a shared key derived from private key of the node and public key of the other party using Elliptic curve Diffie Helman algorithm
|
||||
// the nonce if prepended to the encrypted message
|
||||
EncryptECDH(msg []byte, publicKey []byte) ([]byte, error)
|
||||
|
||||
// DecryptECDH decrypt aes encrypted msg using a shared key derived from private key of the node and public key of the other party using Elliptic curve Diffie Helman algorithm
|
||||
DecryptECDH(msg []byte, publicKey []byte) ([]byte, error)
|
||||
|
||||
// PrivateKey sends the keypair
|
||||
PrivateKey() []byte
|
||||
}
|
||||
|
||||
// FarmID is the identification of a farm
|
||||
type FarmID uint32
|
||||
```
|
@@ -0,0 +1,8 @@
|
||||
<h1> Identity Module </h1>
|
||||
|
||||
Identity daemon is responsible for two major operations that are crucial for the node operation.
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Node ID Generation](identity.md)
|
||||
- [Node Live Software Update](upgrade.md)
|
@@ -0,0 +1,98 @@
|
||||
<h1> Node Upgrade</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Philosophy](#philosophy)
|
||||
- [Booting a new node](#booting-a-new-node)
|
||||
- [Runtime upgrade of a node](#runtime-upgrade-of-a-node)
|
||||
- [Technical](#technical)
|
||||
- [Flist layout](#flist-layout)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
We provide information concerning node upgrade with ZOS. We also explain the philosophy behind ZOS.
|
||||
|
||||
## Philosophy
|
||||
|
||||
0-OS is meant to be a black box no one can access. While this provide some nice security features it also makes it harder to manage. Specially when it comes to update/upgrade.
|
||||
|
||||
Hence, zos only trust few sources for upgrade packages. When the node boots up it checks the sources for the latest release and make sure all the local binaries are up-to-date before continuing the booting. The flist source must be rock-solid secured, that's another topic for different documentation.
|
||||
|
||||
The run mode defines which flist the node is going to use to boot. Run mode can be specified by passing `runmode=<mode>` to the kernel boot params. Currently we have those different run modes.
|
||||
|
||||
- dev: ephemeral network only setup to develop and test new features. Can be created and reset at anytime
|
||||
- test: Mostly stable features that need to be tested at scale, allow preview and test of new features. Always the latest and greatest. This network can be reset sometimes, but should be relatively stable.
|
||||
- prod: Released of stable version. Used to run the real grid with real money. Cannot be reset ever. Only stable and battle tested feature reach this level.
|
||||
|
||||
## Booting a new node
|
||||
|
||||
The base image for zos contains a very small subset of tools, plus the boot program. Standing alone, the image is not really useful. On boot and
|
||||
after initial start of the system, the boot program kicks in and it does the following:
|
||||
|
||||
- Detect the boot flist that the node must use to fully start. The default is hard-coded into zos, but this can be overridden by the `flist=` kernel param. The `flist=` kernel param can get deprecated without a warning, since it's a development flag.
|
||||
- The bootstrap, will then mount this flist using 0-fs, this of course requires a working connection to the internet. Hence bootstrap is configured to wait for the `internet` service.
|
||||
- The flist information (name, and version) is saved under `/tmp/flist.name` and `/tmp/flist.info`.
|
||||
- The bootstrap makes sure to copy all files in the flist to the proper locations under the system rootfs, this include `zinit` config files.
|
||||
- Then zinit is asked to monitor new installed services, zinit takes care of those services and make sure they are properly working at all times.
|
||||
- Bootstrap, umounts the flist, cleans up before it exits.
|
||||
- Boot process continues.
|
||||
|
||||
## Runtime upgrade of a node
|
||||
|
||||
Once the node is up and running, identityd takes over and it does the following:
|
||||
|
||||
- It loads the boot info files `/tmp/flist.name` and `/tmp/flist.info`
|
||||
- If the `flist.name` file does **not** exist, `identityd` will assume the node is booted with other means than an flist (for example overlay). In that case, identityd will log this, and disable live upgrade of the node.
|
||||
- If the `flist.name` file exists, the flist will be monitored on the `https://hub.grid.tf` for changes. Any change in the version will initiate a life upgrade routine.
|
||||
- Once the flist change is detected, identityd will mount the flist, make sure identityd is running the latest version. If not, identityd will update itself first before continuing.
|
||||
- services that will need update will be gracefully stopped.
|
||||
- `identityd` will then make sure to update all services from the flist, and config files. and restart the services properly.
|
||||
- services are started again after all binaries has been copied
|
||||
|
||||
## Technical
|
||||
|
||||
0-OS is designed to provide maximum uptime for its workload, rebooting a node should never be required to upgrade any of its component (except when we push a kernel upgrade).
|
||||
|
||||

|
||||
|
||||
### Flist layout
|
||||
|
||||
The files in the upgrade flist needs to be located in the filesystem tree at the same destination they would need to be in 0-OS. This allow the upgrade code to stays simple and only does a copy from the flist to the root filesystem of the node.
|
||||
|
||||
Booting a new node, or updating a node uses the same flist. Hence, a boot flist must container all required services for node operation.
|
||||
|
||||
Example:
|
||||
|
||||
0-OS filesystem:
|
||||
|
||||
```
|
||||
/etc/zinit/identityd.yaml
|
||||
/etc/zinit/networkd.yaml
|
||||
/etc/zinit/contd.yaml
|
||||
/etc/zinit/init/node-ready.sh
|
||||
/etc/zinit/init
|
||||
/etc/zinit/redis.yaml
|
||||
/etc/zinit/storaged.yaml
|
||||
/etc/zinit/flistd.yaml
|
||||
/etc/zinit/readme.md
|
||||
/etc/zinit/internet.yaml
|
||||
/etc/zinit/containerd.yaml
|
||||
/etc/zinit/boot.yaml
|
||||
/etc/zinit/provisiond.yaml
|
||||
/etc/zinit/node-ready.yaml
|
||||
/etc/zinit
|
||||
/etc
|
||||
/bin/zlf
|
||||
/bin/provisiond
|
||||
/bin/flistd
|
||||
/bin/identityd
|
||||
/bin/contd
|
||||
/bin/capacityd
|
||||
/bin/storaged
|
||||
/bin/networkd
|
||||
/bin/internet
|
||||
/bin
|
||||
```
|
88
collections/developers/internals/zos/internals/internals.md
Normal file
@@ -0,0 +1,88 @@
|
||||
<h1> Internal Modules</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Booting](#booting)
|
||||
- [Bootstrap](#bootstrap)
|
||||
- [Zinit](#zinit)
|
||||
- [Architecture](#architecture)
|
||||
- [IPC](#ipc)
|
||||
- [ZOS Processes (modules)](#zos-processes-modules)
|
||||
- [Capacity](#capacity)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
This document explains in a nutshell the internals of ZOS. This includes the boot process, architecture, the internal modules (and their responsibilities), and the inter-process communication.
|
||||
|
||||
## Booting
|
||||
|
||||
ZOS is a linux based operating system in the sense that we use the main-stream linux kernel with no modifications (but heavily customized). The base image of ZOS includes linux, busybox, [zinit](https://github.com/threefoldtech/zinit) and other required tools that are needed during the boot process. The base image is also shipped with a bootstrap utility that is self-updating on boot which kick starts everything.
|
||||
|
||||
For more details about the ZOS base image please check [0-initramfs](https://github.com/threefoldtech/0-initramfs).
|
||||
|
||||
`ZOS` uses zinit as its `init` or `PID 1` process. `zinit` acts as a process manager and it takes care of starting all required services in the right order. Using simple configuration that is available under `/etc/zinit`.
|
||||
|
||||
The base `ZOS` image has a zinit config to start the basic services that are required for booting. These include (mainly) but are not limited to:
|
||||
|
||||
- internet: A very basic service that tries to connect zos to the internet as fast (and as simple) as possible (over ethernet) using dhcp. This is needed so the system can continue the boot process. Once this one succeeds, it exits and leaves node network management to the more sophisticated ZOS module `networkd` which is yet to be downloaded and started by bootstrap.
|
||||
- redis: This is required by all zos modules for its IPC (inter process communication).
|
||||
- bootstrap: The bootstrap process which takes care of downloading all required zos binaries and modules. This one requires the `internet` service to actually succeed.
|
||||
|
||||
## Bootstrap
|
||||
|
||||
`bootstrap` is a utility that resides on the base image. It takes care of downloading and configuring all zos main services by doing the following:
|
||||
|
||||
- It checks if there is a more recent version of itself available. If it exists, the process first updates itself before proceeding.
|
||||
- It checks zos boot parameters (for example, which network you are booting into) as set by <https://bootstrap.grid.tf/>.
|
||||
- Once the network is known, let's call it `${network}`. This can either be `production`, `testing`, or `development`. The proper release is downloaded as follows:
|
||||
- All flists are downloaded from one of the [hub](https://hub.grid.tf/) `tf-zos-v3-bins.dev`, `tf-zos-v3-bins.test`, or `tf-zos-v3-bins` repos. Based on the network, only one of those repos is used to download all the support tools and binaries. Those are not included in the base image because they can be updated, added, or removed.
|
||||
- The flist `https://hub.grid.tf/tf-zos/zos:${network}-3:latest.flist.md` is downloaded (note that ${network} is replaced with the actual value). This flist includes all zos services from this repository. More information about the zos modules are explained later.
|
||||
- Once all binaries are downloaded, `bootstrap` finishes by asking zinit to start monitoring the newly installed services. The bootstrap exits and will never be started again as long as zos is running.
|
||||
- If zos is restarted the entire bootstrap process happens again including downloading the binaries because ZOS is completely stateless (except for some cached runtime data that is preserved across reboots on a cache disk).
|
||||
|
||||
## Zinit
|
||||
|
||||
As mentioned earlier, `zinit` is the process manager of zos. Bootstrap makes sure it registers all zos services for zinit to monitor. This means that zinit will take care that those services are always running, and restart them if they have crashed for any reason.
|
||||
|
||||
## Architecture
|
||||
|
||||
For `ZOS` to be able to run workloads of different types it has split its functionality into smaller modules. Where each module is responsible for providing a single functionality. For example `storaged` which manages machine storages, hence it can provide low level storage capacity to other services that need it.
|
||||
|
||||
As an example, imagine that you want to start a `virtual machine`. For a `virtual machine` to be able to run it will require a `rootfs` image or the image of the VM itself this is normally provided via an `flist` (managed by `flistd`), then you would need an actual persistent storage (managed by `storaged`), a virtual nic (managed by `networkd`), another service that can put everything together in a form of a VM (`vmd`). Then finally a service that orchestrates all of this and translates the user request to an actual workload `provisiond`, you get the picture.
|
||||
|
||||
### IPC
|
||||
|
||||
All modules running in zos needs to be able to interact with each other. As it shows from the previous example. For example, `provision` daemon need to be able to ask `storage` daemon to prepare a virtual disk. A new `inter-process communication` protocol and library was developed to enable this with those extra features:
|
||||
|
||||
- Modules do not need to know where other modules live, there are no ports, and/or urls that have to be known by all services.
|
||||
- A single module can run multiple versions of an API.
|
||||
- Ease of development.
|
||||
- Auto generated clients.
|
||||
|
||||
For more details about the message bus please check [zbus](https://github.com/threefoldtech/zbus)
|
||||
|
||||
`zbus` uses redis as a message bus, hence redis is started in the early stages of zos booting.
|
||||
|
||||
`zbus` allows auto generation of `stubs` which are generated clients against a certain module interface. Hence a module X can interact with a module Y by importing the generated clients and then start making function calls.
|
||||
|
||||
## ZOS Processes (modules)
|
||||
|
||||
Modules of zos are completely internal. There is no way for an external user to talk to them directly. The idea is the node exposes a public API over rmb, while internally this API can talk to internal modules over `zbus`.
|
||||
|
||||
Here is a list of the major ZOS modules.
|
||||
|
||||
- [Identity](identity/index.md)
|
||||
- [Node](node/index.md)
|
||||
- [Storage](storage/index.md)
|
||||
- [Network](network/index.md)
|
||||
- [Flist](flist/index.md)
|
||||
- [Container](container/index.md)
|
||||
- [VM](vmd/index.md)
|
||||
- [Provision](provision/index.md)
|
||||
|
||||
## Capacity
|
||||
|
||||
In [this document](./capacity.md), you can find detail description of how ZOS does capacity planning.
|
@@ -0,0 +1,57 @@
|
||||
> Note: This is unmaintained, try on your own responsibility
|
||||
|
||||
# MacOS Developer
|
||||
|
||||
0-OS (v2) uses a Linux kernel and is really build with a linux environment in mind.
|
||||
As a developer working from a MacOS environment you will have troubles running the 0-OS code.
|
||||
|
||||
Using [Docker][docker] you can work from a Linux development environment, hosted from your MacOS Host machine.
|
||||
In this README we'll do exactly that using the standard Ubuntu [Docker][docker] container as our base.
|
||||
|
||||
## Setup
|
||||
|
||||
0. Make sure to have Docker installed, and configured (also make sure you have your code folder path shared in your Docker preferences).
|
||||
1. Start an _Ubuntu_ Docker container with your shared code directory mounted as a volume:
|
||||
```bash
|
||||
docker run -ti -v "$HOME/oss":/oss ubuntu /bin/bash
|
||||
```
|
||||
2. Make sure your environment is updated and upgraded using `apt-get`.
|
||||
3. Install Go (`1.13`) from src using the following link or the one you found on [the downloads page](https://golang.org/dl/):
|
||||
```bash
|
||||
wget https://dl.google.com/go/go1.13.3.linux-amd64.tar.gz
|
||||
sudo tar -xvf go1.13.3.linux-amd64.tar.gz
|
||||
sudo mv go /usr/local
|
||||
```
|
||||
4. Add the following to your `$HOME/.bashrc` and `source` it:
|
||||
```vim
|
||||
export GOROOT=/usr/local/go
|
||||
export GOPATH=$HOME/go
|
||||
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
|
||||
```
|
||||
5. Confirm you have Go installed correctly:
|
||||
```
|
||||
go version && go env
|
||||
```
|
||||
6. Go to your `zos` code `pkg` directory hosted from your MacOS development machine within your docker `/bin/bash`:
|
||||
```bash
|
||||
cd /oss/github.com/threefoldtech/zos/pkg
|
||||
```
|
||||
7. Install the dependencies for testing:
|
||||
```bash
|
||||
make getdeps
|
||||
```
|
||||
8. Run tests and verify all works as expected:
|
||||
```bash
|
||||
make test
|
||||
```
|
||||
9. Build `zos`:
|
||||
```bash
|
||||
make build
|
||||
```
|
||||
|
||||
If you can successfully do step (8) and step (9) you
|
||||
can now contribute to `zos` as a MacOS developer.
|
||||
Testing and compiling you'll do from within your container's shell,
|
||||
coding you can do from your beloved IDE on your MacOS development environment.
|
||||
|
||||
[docker]: https://www.docker.com
|
@@ -0,0 +1,74 @@
|
||||
# 0-OS v2 and it's network setup
|
||||
|
||||
## Introduction
|
||||
|
||||
0-OS nodes participating in the Threefold grid, need connectivity of course. They need to be able to communicate over
|
||||
the Internet with each-other in order to do various things:
|
||||
|
||||
- download it's OS modules
|
||||
- perform OS module upgrades
|
||||
- register itself to the grid, and send regular updates about it's status
|
||||
- query the grid for tasks to execute
|
||||
- build and run the Overlay Network
|
||||
- download flists and the effective files to cache
|
||||
|
||||
The nodes themselves can have connectivity in a few different ways:
|
||||
|
||||
- Only have RFC1918 private addresses, connected to the Internet through NAT, NO IPv6
|
||||
Mostly, these are single-NIC (Network card) machines that can host some workloads through the Overlay Network, but
|
||||
cant't expose services directly. These are HIDDEN nodes, and are mostly booted with an USB stick from
|
||||
bootstrap.grid.tf .
|
||||
- Dual-stacked: having RFC1918 private IPv4 and public IPv6 , where the IPv6 addresses are received from a home router,
|
||||
but firewalled for outgoing traffic only. These nodes are effectively also HIDDEN
|
||||
- Nodes with 2 NICs, one that has effectively a NIC connected to a segment that has real public
|
||||
addresses (IPv4 and/or IPv6) and one NIC that is used for booting and local
|
||||
management. (OOB) (like in the drawing for farmer setup)
|
||||
|
||||
For Farmers, we need to have Nodes to be reachable over IPv6, so that the nodes can:
|
||||
|
||||
- expose services to be proxied into containers/vms
|
||||
- act as aggregating nodes for Overlay Networks for HIDDEN Nodes
|
||||
|
||||
Some Nodes in Farms should also have a publicly reachable IPv4, to make sure that clients that only have IPv4 can
|
||||
effectively reach exposed services.
|
||||
|
||||
But we need to stress the importance of IPv6 availability when you're running a multi-node farm in a datacentre: as the
|
||||
grid is boldly claiming to be a new Internet, we should make sure we adhere to the new protocols that are future-proof.
|
||||
Hence: IPv6 is the base, and IPv4 is just there to accomodate the transition.
|
||||
|
||||
Nowadays, RIPE can't even hand out consecutive /22 IPv4 blocks any more for new LIRs, so you'll be bound to market to
|
||||
get IPv4, mostly at rates of 10-15 Euro per IP. Things tend to get costly that way.
|
||||
|
||||
So anyway, IPv6 is not an afterthought in 0-OS, we're starting with it.
|
||||
|
||||
## Network setup for farmers
|
||||
|
||||
This is a quick manual to what is needed for connecting a node with zero-OS V2.0
|
||||
|
||||
### Step 1. Testing for IPv6 availability in your location
|
||||
As descibed above the network in which the node is instaleld has to be IPv6 enabled. This is not an afterthought as we are building a new internet it has to ba based on the new and forward looking IP addressing scheme. This is something you have to investigate, negotiate with you connectivity provider. Many (but not all home connectivity products and certainly most datacenters can provide you with IPv6. There are many sources of infromation on how to test and check whether your connection is IPv6 enabled, [here is a starting point](http://www.ipv6enabled.org/ipv6_enabled/ipv6_enable.php)
|
||||
|
||||
### Step 2. Choosing you setup for connecitng you nodes.
|
||||
|
||||
Once you have established that you have IPv6 enabled on the network you are about to deploy, you have to make sure that there is an IPv6 DHCP facility available. Zero-OS does not work with static IPv6 addresses (at this point in time). So you have choose and create one of the following setups:
|
||||
|
||||
#### 2.1 Home setup
|
||||
|
||||
Use your (home) ISP router Ipv6 DHCP capabilities to provide (private) IPv6 addresses. The principle will work the same as for IPv4 home connections, everything happens enabled by Network Adress Translation (just like anything else that uses internet connectivity). This should be relatively straightforward if you have established that your conenction has IPv6 enabled.
|
||||
|
||||
#### 2.2 Datacenter / Expert setup
|
||||
|
||||
In this situation there are many options on how to setup you node. This requires you as the expert to make a few decisions on how to connect what what the best setup is that you can support for the operaitonal time of your farm. The same basics principles apply:
|
||||
- You have to have a block of (public) IPv6 routed to you router, or you have to have your router setup to provide Network Address Translation (NAT)
|
||||
- You have to have a DHCP server in your network that manages and controls IPV6 ip adress leases. Depending on your specific setup you have this DHCP server manage a public IPv6y range which makes all nodes directly connected to the public internet or you have this DHCP server manage a private block og IPv6 addresses which makes all you nodes connect to the internet through NAT.
|
||||
|
||||
As a farmer you are in charge of selecting and creating the appropriate network setup for your farm.
|
||||
|
||||
## General notes
|
||||
|
||||
The above setup will allows your node(s) to appear in explorer on the TF Grid and will allowd you to earn farming tokens. At stated in the introduction ThreeFold is creating next generation internet capacity and therefore has IPv6 as it's base building block. Connecting to the current (dominant) IPv4 network happens for IT workloads through so called webgateways. As the word sais these are gateways that provide connectivity between the currenct leading IPv4 adressing scheme and IPv6.
|
||||
|
||||
We have started a forum where people share their experiences and configurations. This will be work in progress and forever growing.
|
||||
|
||||
**IMPORTANT**: You as a farmer do not need access to IPV4 to be able to rent capacity for IT workloads that need to be visible on IPV4, this is something that can happen elswhere on the TF Grid.
|
||||
|
After Width: | Height: | Size: 61 KiB |
After Width: | Height: | Size: 39 KiB |
@@ -0,0 +1,315 @@
|
||||
# 0-OS v2 and it's network
|
||||
|
||||
## Introduction
|
||||
|
||||
0-OS nodes participating in the Threefold grid, need connectivity of course. They need to be able to communicate over
|
||||
the Internet with each-other in order to do various things:
|
||||
|
||||
- download it's OS modules
|
||||
- perform OS module upgrades
|
||||
- register itself to the grid, and send regular updates about it's status
|
||||
- query the grid for tasks to execute
|
||||
- build and run the Overlay Network
|
||||
- download flists and the effective files to cache
|
||||
|
||||
The nodes themselves can have connectivity in a few different ways:
|
||||
|
||||
- Only have RFC1918 private addresses, connected to the Internet through NAT, NO IPv6
|
||||
Mostly, these are single-NIC (Network card) machines that can host some workloads through the Overlay Network, but
|
||||
cant't expose services directly. These are HIDDEN nodes, and are mostly booted with an USB stick from
|
||||
bootstrap.grid.tf .
|
||||
- Dual-stacked: having RFC1918 private IPv4 and public IPv6 , where the IPv6 addresses are received from a home router,
|
||||
but firewalled for outgoing traffic only. These nodes are effectively also HIDDEN
|
||||
- Nodes with 2 NICs, one that has effectively a NIC connected to a segment that has real public
|
||||
addresses (IPv4 and/or IPv6) and one NIC that is used for booting and local
|
||||
management. (OOB) (like in the drawing for farmer setup)
|
||||
|
||||
For Farmers, we need to have Nodes to be reachable over IPv6, so that the nodes can:
|
||||
|
||||
- expose services to be proxied into containers/vms
|
||||
- act as aggregating nodes for Overlay Networks for HIDDEN Nodes
|
||||
|
||||
Some Nodes in Farms should also have a publicly reachable IPv4, to make sure that clients that only have IPv4 can
|
||||
effectively reach exposed services.
|
||||
|
||||
But we need to stress the importance of IPv6 availability when you're running a multi-node farm in a datacentre: as the
|
||||
grid is boldly claiming to be a new Internet, we should make sure we adhere to the new protocols that are future-proof.
|
||||
Hence: IPv6 is the base, and IPv4 is just there to accomodate the transition.
|
||||
|
||||
Nowadays, RIPE can't even hand out consecutive /22 IPv4 blocks any more for new LIRs, so you'll be bound to market to
|
||||
get IPv4, mostly at rates of 10-15 Euro per IP. Things tend to get costly that way.
|
||||
|
||||
So anyway, IPv6 is not an afterthought in 0-OS, we're starting with it.
|
||||
|
||||
## Physical setup for farmers
|
||||
|
||||
```text
|
||||
XXXXX XXX
|
||||
XX XXX XXXXX XXX
|
||||
X X XXX
|
||||
X X
|
||||
X INTERNET X
|
||||
XXX X X
|
||||
XXXXX XX XX XXXX
|
||||
+X XXXX XX XXXXX
|
||||
|
|
||||
|
|
||||
|
|
||||
|
|
||||
|
|
||||
+------+--------+
|
||||
| FIREWALL/ |
|
||||
| ROUTER |
|
||||
+--+----------+-+
|
||||
| |
|
||||
+-----------+----+ +-+--------------+
|
||||
| switch/ | | switch/ |
|
||||
| vlan segment | | vlan segment |
|
||||
+-+---------+----+ +---+------------+
|
||||
| | |
|
||||
+-------+-------+ |OOB | PUBLIC
|
||||
| PXE / dhcp | | |
|
||||
| Ser^er | | |
|
||||
+---------------+ | |
|
||||
| |
|
||||
+-----+------------+----------+
|
||||
| |
|
||||
| +--+
|
||||
| | |
|
||||
| NODES | +--+
|
||||
+--+--------------------------+ | |
|
||||
| | |
|
||||
+--+--------------------------+ |
|
||||
| |
|
||||
+-----------------------------+
|
||||
```
|
||||
|
||||
The PXE/dhcp can also be done by the firewall, your mileage may vary.
|
||||
|
||||
## Switch and firewall configs
|
||||
|
||||
Single switch, multiple switch, it all boils down to the same:
|
||||
|
||||
- one port is an access port on an OOB vlan/segment
|
||||
- one port is connected to a public vlan/segment
|
||||
|
||||
The farmer makes sure that every node receives properly an IPv4 address in the OOB segment through means of dhcp, so
|
||||
that with a PXE config or USB, a node can effectively start it's boot process:
|
||||
|
||||
- Download kernel and initrd
|
||||
- Download and mount the system flists so that the 0-OS daemons can start
|
||||
- Register itself on the grid
|
||||
- Query the grid for tasks to execute
|
||||
|
||||
For the PUBLIC side of the Nodes, there are a few things to consider:
|
||||
|
||||
- It's the farmer's job to inform the grid what node gets an IP address, be it IPv4 or IPv4.
|
||||
- Nodes that don't receive and IPv4 address will connect to the IPv4 net through the NATed OOB network
|
||||
- A farmer is responsible to provide and IPv6 prefix on at least one segment, and have a Router Advertisement daemon
|
||||
runnig to provide for SLAAC addressin on that segment.
|
||||
- That IPv6 Prefix on the public segment should not be firewalled, as it's impossible to know in your firewall what
|
||||
ports will get exposed for the proxies.
|
||||
|
||||
The Nodes themselves have nothing listening that points into the host OS itself, and are by themselves also firewalled.
|
||||
In dev mode, there is an ssh server with a key-only login, accessible by a select few ;-)
|
||||
|
||||
## DHCP/Radvd/RA/DHCP6
|
||||
|
||||
For home networks, there is not much to do, a Node will get an IPv4 Private(rfc1918) address , and most probaly and
|
||||
ipv6 address in a /64 prefix, but is not reachable over ipv6, unless the firewall is disabled for IPv6. As we can't
|
||||
rely on the fact that that is possible, we assume these nodes to be HIDDEN.
|
||||
|
||||
A normal self-respecting Firewall or IP-capable switch can hand out IP[46] addresses, some can
|
||||
even bootp/tftp to get nodes booted over the network.
|
||||
We are (full of hope) assuming that you would have such a beast to configure and splice your network
|
||||
in multiple segments.
|
||||
A segment is a physical network separation. That can be port-based vlans, or even separate switches, whatver rocks your
|
||||
boat, the keyword is here **separate**.
|
||||
|
||||
On both segments you will need a way to hand out IPv4 addresses based on MAC addresses of the nodes. Yes, there is some
|
||||
administration to do, but it's a one-off, and really necessary, because you really need to know whic physical machine
|
||||
has which IP. For lights-out management and location of machines that is a must.
|
||||
|
||||
So you'll need a list of mac addresses to add to your dhcp server for IPv4, to make sure you know which machine has
|
||||
received what IPv4 Address.
|
||||
That is necessary for 2 things:
|
||||
|
||||
- locate the node if something is amiss, like be able to pinpoint a node's disk in case it broke (which it will)
|
||||
- have the node be reachable all the time, without the need to update the grid and network configs every time the node
|
||||
boots.
|
||||
|
||||
## What happens under the hood (farmer)
|
||||
|
||||
While we did our uttermost best to keep IPv4 address needs to a strict minimum, at least one Node will need an IPv4 address for handling everything that is Overlay Networks.
|
||||
For Containers to reach the Internet, any type of connectivity will do, be it NAT or though an Internal DMZ that has a
|
||||
routable IPv4 address.
|
||||
|
||||
Internally, a lot of things are being set-up to have a node properly participate in the grid, as well to be prepared to partake in the User's Overlay Networks.
|
||||
|
||||
A node connects itself to 'the Internet' depending on a few states.
|
||||
|
||||
1. It lives in a fully private network (like it would be connected directly to a port on a home router)
|
||||
|
||||
```
|
||||
XX XXX
|
||||
XXX XXXXXX
|
||||
X Internet X
|
||||
XXXXXXX XXXXX
|
||||
XX XXX
|
||||
XX X
|
||||
X+X
|
||||
|
|
||||
|
|
||||
+--------+-----------+
|
||||
| HOME / |
|
||||
| SOHO router |
|
||||
| |
|
||||
+--------+-----------+
|
||||
|
|
||||
| Private space IPv4
|
||||
| (192.168.1.0/24)
|
||||
|
|
||||
+---------+------------+
|
||||
| |
|
||||
| NODE |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+----------------------+
|
||||
```
|
||||
|
||||
1. It lives in a fully public network (like it is connected directly to an uplink and has a public ipv4 address)
|
||||
|
||||
```
|
||||
XX XXX
|
||||
XXX XXXXXX
|
||||
X Internet X
|
||||
XXXXXXX XXXXX
|
||||
XX XXX
|
||||
XX X
|
||||
X+X
|
||||
|
|
||||
| fully public space ipv4/6
|
||||
| 185.69.166.0/24
|
||||
| 2a02:1802:5e:0:1000::abcd/64
|
||||
|
|
||||
+---------+------------+
|
||||
| |
|
||||
| NODE |
|
||||
| |
|
||||
+----------------------+
|
||||
|
||||
```
|
||||
The node is fully reachable
|
||||
|
||||
1. It lives in a datacentre, where a farmer manages the Network.
|
||||
|
||||
A little Drawing :
|
||||
|
||||
```text
|
||||
+----------------------------------------------------+
|
||||
| switch |
|
||||
| |
|
||||
| |
|
||||
+----------+-------------------------------------+---+
|
||||
| |
|
||||
access | |
|
||||
mgmt | +---------------+
|
||||
vlan | | access
|
||||
| | public
|
||||
| | vlan
|
||||
| |
|
||||
+-------+---------------------+------+
|
||||
| |
|
||||
| nic1 nic2 |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| NODE |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+------------------------------------+
|
||||
|
||||
```
|
||||
|
||||
Or the more elaborate drawing on top that should be sufficient for a sysadmin to comprehend.
|
||||
|
||||
Although:
|
||||
|
||||
- we don't (yet) support nic bonding (next release)
|
||||
- we don't (yet) support vlans, so your ports on switch/router need to be access ports to vlans to your router/firewall
|
||||
|
||||
|
||||
## yeayea, but really ... what now ?
|
||||
|
||||
Ok, what are the constraints?
|
||||
|
||||
A little foreword:
|
||||
ZosV2 uses IPv6 as it's base for networking, where the oldie IPv4 is merely an afterthought. So for it to work properly in it's actual incantation (we are working to get it to do IPv4-only too), for now, we need the node to live in a space that provides IPv6 __too__ .
|
||||
IPV4 and IPv6 are very different beasts, so any machine connected to the Internet wil do both on the same network. So basically your computer talks 2 different languages, when it comes to communicating. That is the same for ZOS, where right now, it's mother tongue is IPv6.
|
||||
|
||||
So your zos for V2 can start in different settings
|
||||
1) you are a farmer, your ISP can provide you with IPv6
|
||||
Ok, you're all set, aside from a public IPv4 DHCP, you need to run a Stateless-Only SLAAC Router Advertiser (ZOS does NOT do DHCP6).
|
||||
|
||||
1) you are a farmer, your ISP asks you what the hell IPv6 is
|
||||
That is problematic right now, wait for the next release of ZosV2
|
||||
|
||||
1) you are a farmer, with only one node , at home, and on your PC https://ipv6.net tells you you have IPv6 on your PC.
|
||||
That means your home router received an IPV6 allocation from the ISP,
|
||||
Your'e all set, your node will boot, and register to the grid. If you know what you're doing, you can configure your router to allow all ipv6 traffic in forwarding mode to the specifice mac address of your node. (we'll explain later)
|
||||
1) you are a farmer, with a few nodes somewhere that are registered on the grid in V1, but you have no clue if IPv6 is supported where these nodes live
|
||||
1) you have a ThreefoldToken node at home, and still do not have a clue
|
||||
|
||||
Basically it boils down also in a few other cases
|
||||
|
||||
1) the physical network where a node lives has: IPv6 and Private space IPv4
|
||||
1) the physical network where a node lives has: IPv6 and Public IPv4
|
||||
1) the physical network where a node lives has: only IPv4
|
||||
|
||||
But it bloils down to : call your ISP, ask for IPv6. It's the future, for yout ISP, it's time. There is no way to circumvent it. No way.
|
||||
|
||||
|
||||
OK, then, now what.
|
||||
|
||||
1) you're a farmer with a bunch of nodes somewhere in a DC
|
||||
|
||||
- your nodes are connected once (with one NIC) to a switch/router
|
||||
Then your router will have :
|
||||
- a segment that carries IPv4 __and__ IPv6:
|
||||
|
||||
- for IPv4, there are 2 possibilities:
|
||||
- it's RFC1918 (Private space) -> you NAT that subnet (e.g. 192.168.1.0/24) towards the Public Internet
|
||||
|
||||
- you __will__ have difficulty to designate a IPv4 public entrypoint into your farm
|
||||
- your workloads will be only reachable through the overlay
|
||||
- your storage will not be reachable
|
||||
|
||||
- you received a (small, because of the scarceness of IPv4 addresses, your ISP will give you only limited and pricy IPv4 adresses) IPv4 range you can utilise
|
||||
|
||||
- things are better, the nodes can live in public ipv4 space, where they can be used as entrypoint
|
||||
- standard configuration that works
|
||||
|
||||
- for IPv6, your router is a Routing advertiser that provides SLAAC (Stateless, unmanaged) for that segment, working witha /64 prefix
|
||||
|
||||
- the nodes will reachable over IPv6
|
||||
- storage backend will be available for the full grid
|
||||
- everything will just work
|
||||
|
||||
Best solution for single NIC:
|
||||
- an ipv6 prefx
|
||||
- an ipv4 subnet (however small)
|
||||
|
||||
- your nodes have 2 connections, and you wnat to differ management from user traffic
|
||||
|
||||
- same applies as above, where the best outcome will be obtained with a real IPv6 prefix allocation and a small public subnet that is routable.
|
||||
- the second NIC (typically 10GBit) will then carry everything public, and the first nic will just be there for managent, living in Private space for IPv4, mostly without IPv6
|
||||
- your switch needs to be configured to provide port-based vlans, so the segments are properly separated, and your router needs to reflect that vlan config so that separation is handeled by the firewall in the router (iptables, pf, acl, ...)
|
||||
|
||||
|
||||
|
||||
|
||||
|
@@ -0,0 +1,66 @@
|
||||
## Farmers providing transit for Tenant Networks (TN or Network)
|
||||
|
||||
For networks of a user to be reachable, these networks need penultimate Network resources that act as exit nodes for the WireGuard mesh.
|
||||
|
||||
For that Users need to sollicit a routable network with farmers that provide such a service.
|
||||
|
||||
### Global registry for network resources. (`GRNR`?)
|
||||
|
||||
Threefold through BCDB shoud keep a store where Farmers can register also a network service for Tenant Network (TN) reachablility.
|
||||
|
||||
In a network transaction the first thing asked should be where a user wants to purchase it's transit. That can be with a nearby (latency or geolocation) Exit Provider (can e.g. be a Farmer), or with a Exit Provider outside of the geolocation for easier routing towards the primary entrypoint. (VPN-like services coming to mind)
|
||||
|
||||
With this, we could envision in a later stage to have the Network Resources to be IPv6 multihomed with policy-based routing. That adds the possibiltiy to have multiple exit nodes for the same Network, with different IPv6 routes to them.
|
||||
|
||||
### Datastructure
|
||||
|
||||
A registered Farmer can also register his (dc-located?) network to be sold as transit space. For that he registers:
|
||||
- the IPv4 addresses that can be allocated to exit nodes.
|
||||
- the IPv6 prefix he obtained to be used in the Grid
|
||||
- the nodes that will serve as exit nodes.
|
||||
These nodes need to have IPv[46] access to routable address space through:
|
||||
- Physical access in an interface of the node
|
||||
- Access on a public `vlan` or via `vxlan / mpls / gre`
|
||||
|
||||
Together with the registered nodes that will be part of that Public segment, the TNoDB (BCDB) can verify a Network Object containing an ExitPoint for a Network and add it to the queue for ExitNodes to fetch and apply.
|
||||
|
||||
Physcally Nodes can be connected in several ways:
|
||||
- living directly on the Internet (with a routable IPv4 and/or IPv6 Address) without Provider-enforced firewalling (outgoing traffic only)
|
||||
- having an IPv4 allocation --and-- and IPv6 allocation
|
||||
- having a single IPv4 address --and-- a single IPv6 allocation (/64) or even (Oh God Why) a single IPv6 addr.
|
||||
- living in a Farm that has Nodes only reachable through NAT for IPv4 and no IPv6
|
||||
- living in a Farm that has NAT IPv4 and routable IPv6 with an allocation
|
||||
- living in a single-segment having IPv4 RFC1918 and only one IPv6 /64 prefix (home Nodes mostly)
|
||||
|
||||
#### A Network resource allocation.
|
||||
We define Network Resource (NR) as a routable IPv6 `/64` Prefix, so for every time a new TNo is generated and validated, containing a new serial number and an added/removed NR, there has been a request to obtain a valid IPv6 Prefix (/64) to be added to the TNo.
|
||||
|
||||
Basically it's just a list of allocations in that prefix, that are in use. Any free Prefix will do, as we do routing in the exit nodes with a `/64` granularity.
|
||||
|
||||
The TNoDB (BCDB) then validates/updates the Tenant Network object with that new Network Resource and places it on a queue to be fetched by the interested Nodes.
|
||||
|
||||
#### The Nodes responsible for ExitPoints
|
||||
|
||||
A Node responsible for ExitPoints as wel as a Public endpoint will know so because of how it's registered in the TNoDB (BCDB). That is :
|
||||
- it is defined as an exit node
|
||||
- the TNoDB hands out an Object that describes it's public connectivity. i.e. :
|
||||
- the public IPv4 address(es) it can use
|
||||
- the IPv6 Prefix in the network segment that contains the penultimate default route
|
||||
- an eventual Private BGP AS number for announcing the `/64` Prefixes of a Tenant Network, and the BGP peer(s).
|
||||
|
||||
With that information, a Node can then build the Network Namespace from which it builds the Wireguard Interfaces prior to sending them in the ExitPoint Namespace.
|
||||
|
||||
So the TNoDB (BCDB) hands out
|
||||
- Tenant Network Objects
|
||||
- Public Interface Objects
|
||||
|
||||
They are related :
|
||||
- A Node can have Network Resources
|
||||
- A Network Resource can have (1) Public Interface
|
||||
- Both are part of a Tenant Network
|
||||
|
||||
A TNo defines a Network where ONLY the ExitPoint is flagged as being one. No more.
|
||||
When the Node (networkd) needs to setup a Public node, it will need to act differently.
|
||||
- Verify if the Node is **really** public, if so use standard WG interface setup
|
||||
- If not, verify if there is already a Public Exit Namespace defined, create WG interface there.
|
||||
- If there is Public Exit Namespace, request one, and set it up first.
|
@@ -0,0 +1,264 @@
|
||||
# Network
|
||||
|
||||
- [How does a farmer configure a node as exit node](#How-does-a-farmer-configure-a-node-as-exit-node)
|
||||
- [How to create a user private network](#How-to-create-a-user-private-network)
|
||||
|
||||
## How does a farmer configure a node as exit node
|
||||
|
||||
For the network of the grid to work properly, some of the nodes in the grid need to be configured as "exit nodes". An "exit node" is a node that has a publicly accessible IP address and that is responsible routing IPv6 traffic, or proxy IPv4 traffic.
|
||||
|
||||
A farmer that wants to configure one of his nodes as "exit node", needs to register it in the TNODB. The node will then automatically detect it has been configured to be an exit node and do the necessary network configuration to start acting as one.
|
||||
|
||||
At the current state of the development, we have a [TNODB mock](../../tools/tnodb_mock) server and a [tffarmer CLI](../../tools/tffarm) tool that can be used to do these configuration.
|
||||
|
||||
Here is an example of how a farmer could register one of his node as "exit node":
|
||||
|
||||
1. Farmer needs to create its farm identity
|
||||
|
||||
```bash
|
||||
tffarmer register --seed myfarm.seed "mytestfarm"
|
||||
Farm registered successfully
|
||||
Name: mytestfarm
|
||||
Identity: ZF6jtCblLhTgAqp2jvxKkOxBgSSIlrRh1mRGiZaRr7E=
|
||||
```
|
||||
|
||||
2. Boot your nodes with your farm identity specified in the kernel parameters.
|
||||
|
||||
Take that farm identity create at step 1 and boot your node with the kernel parameters `farmer_id=<identity>`
|
||||
|
||||
for your test farm that would be `farmer_id=ZF6jtCblLhTgAqp2jvxKkOxBgSSIlrRh1mRGiZaRr7E=`
|
||||
|
||||
Once the node is booted, it will automatically register itself as being part of your farm into the [TNODB](../../tools/tnodb_mock) server.
|
||||
|
||||
You can verify that you node registered itself properly by listing all the node from the TNODB by doing a GET request on the `/nodes` endpoints:
|
||||
|
||||
```bash
|
||||
curl http://tnodb_addr/nodes
|
||||
[{"node_id":"kV3u7GJKWA7Js32LmNA5+G3A0WWnUG9h+5gnL6kr6lA=","farm_id":"ZF6jtCblLhTgAqp2jvxKkOxBgSSIlrRh1mRGiZaRr7E=","Ifaces":[]}]
|
||||
```
|
||||
|
||||
3. Farmer needs to specify its public allocation range to the TNODB
|
||||
|
||||
```bash
|
||||
tffarmer give-alloc 2a02:2788:0000::/32 --seed myfarm.seed
|
||||
prefix registered successfully
|
||||
```
|
||||
|
||||
4. Configure the public interface of the exit node if needed
|
||||
|
||||
In this step the farmer will tell his node how it needs to connect to the public internet. This configuration depends on the farm network setup, this is why this is up to the farmer to provide the detail on how the node needs to configure itself.
|
||||
|
||||
In a first phase, we create the internet access in 2 ways:
|
||||
|
||||
- the node is fully public: you don't need to configure a public interface, you can skip this step
|
||||
- the node has a management interface and a nic for public
|
||||
then `configure-public` is required, and the farmer has the public interface connected to a specific public segment with a router to the internet in front.
|
||||
|
||||
```bash
|
||||
tffarmer configure-public --ip 172.20.0.2/24 --gw 172.20.0.1 --iface eth1 kV3u7GJKWA7Js32LmNA5+G3A0WWnUG9h+5gnL6kr6lA=
|
||||
#public interface configured on node kV3u7GJKWA7Js32LmNA5+G3A0WWnUG9h+5gnL6kr6lA=
|
||||
```
|
||||
|
||||
We still need to figure out a way to get the routes properly installed, we'll do static on the toplevel router for now to do a demo.
|
||||
|
||||
The node is now configured to be used as an exit node.
|
||||
|
||||
5. Mark a node as being an exit node
|
||||
|
||||
The farmer then needs to select which node he agrees to use as an exit node for the grid
|
||||
|
||||
```bash
|
||||
tffarmer select-exit kV3u7GJKWA7Js32LmNA5+G3A0WWnUG9h+5gnL6kr6lA=
|
||||
#Node kV3u7GJKWA7Js32LmNA5+G3A0WWnUG9h+5gnL6kr6lA= marked as exit node
|
||||
```
|
||||
|
||||
## How to create a user private network
|
||||
|
||||
1. Choose an exit node
|
||||
2. Request an new allocation from the farm of the exit node
|
||||
- a GET request on the tnodb_mock at `/allocations/{farm_id}` will give you a new allocation
|
||||
3. Creates the network schema
|
||||
|
||||
Steps 1 and 2 are easy enough to be done even manually but step 3 requires a deep knowledge of how networking works
|
||||
as well as the specific requirement of 0-OS network system.
|
||||
This is why we provide a tool that simplify this process for you, [tfuser](../../tools/tfuser).
|
||||
|
||||
Using tfuser creating a network becomes trivial:
|
||||
|
||||
```bash
|
||||
# creates a new network with node DLFF6CAshvyhCrpyTHq1dMd6QP6kFyhrVGegTgudk6xk as exit node
|
||||
# and output the result into network.json
|
||||
tfuser generate --schema network.json network create --node DLFF6CAshvyhCrpyTHq1dMd6QP6kFyhrVGegTgudk6xk
|
||||
```
|
||||
|
||||
network.json will now contains something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "",
|
||||
"tenant": "",
|
||||
"reply-to": "",
|
||||
"type": "network",
|
||||
"data": {
|
||||
"network_id": "J1UHHAizuCU6s9jPax1i1TUhUEQzWkKiPhBA452RagEp",
|
||||
"resources": [
|
||||
{
|
||||
"node_id": {
|
||||
"id": "DLFF6CAshvyhCrpyTHq1dMd6QP6kFyhrVGegTgudk6xk",
|
||||
"farmer_id": "7koUE4nRbdsqEbtUVBhx3qvRqF58gfeHGMRGJxjqwfZi",
|
||||
"reachability_v4": "public",
|
||||
"reachability_v6": "public"
|
||||
},
|
||||
"prefix": "2001:b:a:8ac6::/64",
|
||||
"link_local": "fe80::8ac6/64",
|
||||
"peers": [
|
||||
{
|
||||
"type": "wireguard",
|
||||
"prefix": "2001:b:a:8ac6::/64",
|
||||
"Connection": {
|
||||
"ip": "2a02:1802:5e::223",
|
||||
"port": 1600,
|
||||
"key": "PK1L7n+5Fo1znwD/Dt9lAupL19i7a6zzDopaEY7uOUE=",
|
||||
"private_key": "9220e4e29f0acbf3bd7ef500645b78ae64b688399eb0e9e4e7e803afc4dd72418a1c5196208cb147308d7faf1212758042f19f06f64bad6ffe1f5ed707142dc8cc0a67130b9124db521e3a65e4aee18a0abf00b6f57dd59829f59662"
|
||||
}
|
||||
}
|
||||
],
|
||||
"exit_point": true
|
||||
}
|
||||
],
|
||||
"prefix_zero": "2001:b:a::/64",
|
||||
"exit_point": {
|
||||
"ipv4_conf": null,
|
||||
"ipv4_dnat": null,
|
||||
"ipv6_conf": {
|
||||
"addr": "fe80::8ac6/64",
|
||||
"gateway": "fe80::1",
|
||||
"metric": 0,
|
||||
"iface": "public"
|
||||
},
|
||||
"ipv6_allow": []
|
||||
},
|
||||
"allocation_nr": 0,
|
||||
"version": 0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Which is a valid network schema. This network only contains a single exit node though, so not really useful.
|
||||
Let's add another node to the network:
|
||||
|
||||
```bash
|
||||
tfuser generate --schema network.json network add-node --node 4hpUjrbYS4YeFbvLoeSR8LGJKVkB97JyS83UEhFUU3S4
|
||||
```
|
||||
|
||||
result looks like:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "",
|
||||
"tenant": "",
|
||||
"reply-to": "",
|
||||
"type": "network",
|
||||
"data": {
|
||||
"network_id": "J1UHHAizuCU6s9jPax1i1TUhUEQzWkKiPhBA452RagEp",
|
||||
"resources": [
|
||||
{
|
||||
"node_id": {
|
||||
"id": "DLFF6CAshvyhCrpyTHq1dMd6QP6kFyhrVGegTgudk6xk",
|
||||
"farmer_id": "7koUE4nRbdsqEbtUVBhx3qvRqF58gfeHGMRGJxjqwfZi",
|
||||
"reachability_v4": "public",
|
||||
"reachability_v6": "public"
|
||||
},
|
||||
"prefix": "2001:b:a:8ac6::/64",
|
||||
"link_local": "fe80::8ac6/64",
|
||||
"peers": [
|
||||
{
|
||||
"type": "wireguard",
|
||||
"prefix": "2001:b:a:8ac6::/64",
|
||||
"Connection": {
|
||||
"ip": "2a02:1802:5e::223",
|
||||
"port": 1600,
|
||||
"key": "PK1L7n+5Fo1znwD/Dt9lAupL19i7a6zzDopaEY7uOUE=",
|
||||
"private_key": "9220e4e29f0acbf3bd7ef500645b78ae64b688399eb0e9e4e7e803afc4dd72418a1c5196208cb147308d7faf1212758042f19f06f64bad6ffe1f5ed707142dc8cc0a67130b9124db521e3a65e4aee18a0abf00b6f57dd59829f59662"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "wireguard",
|
||||
"prefix": "2001:b:a:b744::/64",
|
||||
"Connection": {
|
||||
"ip": "<nil>",
|
||||
"port": 0,
|
||||
"key": "3auHJw3XHFBiaI34C9pB/rmbomW3yQlItLD4YSzRvwc=",
|
||||
"private_key": "96dc64ff11d05e8860272b91bf09d52d306b8ad71e5c010c0ccbcc8d8d8f602c57a30e786d0299731b86908382e4ea5a82f15b41ebe6ce09a61cfb8373d2024c55786be3ecad21fe0ee100339b5fa904961fbbbd25699198c1da86c5"
|
||||
}
|
||||
}
|
||||
],
|
||||
"exit_point": true
|
||||
},
|
||||
{
|
||||
"node_id": {
|
||||
"id": "4hpUjrbYS4YeFbvLoeSR8LGJKVkB97JyS83UEhFUU3S4",
|
||||
"farmer_id": "7koUE4nRbdsqEbtUVBhx3qvRqF58gfeHGMRGJxjqwfZi",
|
||||
"reachability_v4": "hidden",
|
||||
"reachability_v6": "hidden"
|
||||
},
|
||||
"prefix": "2001:b:a:b744::/64",
|
||||
"link_local": "fe80::b744/64",
|
||||
"peers": [
|
||||
{
|
||||
"type": "wireguard",
|
||||
"prefix": "2001:b:a:8ac6::/64",
|
||||
"Connection": {
|
||||
"ip": "2a02:1802:5e::223",
|
||||
"port": 1600,
|
||||
"key": "PK1L7n+5Fo1znwD/Dt9lAupL19i7a6zzDopaEY7uOUE=",
|
||||
"private_key": "9220e4e29f0acbf3bd7ef500645b78ae64b688399eb0e9e4e7e803afc4dd72418a1c5196208cb147308d7faf1212758042f19f06f64bad6ffe1f5ed707142dc8cc0a67130b9124db521e3a65e4aee18a0abf00b6f57dd59829f59662"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "wireguard",
|
||||
"prefix": "2001:b:a:b744::/64",
|
||||
"Connection": {
|
||||
"ip": "<nil>",
|
||||
"port": 0,
|
||||
"key": "3auHJw3XHFBiaI34C9pB/rmbomW3yQlItLD4YSzRvwc=",
|
||||
"private_key": "96dc64ff11d05e8860272b91bf09d52d306b8ad71e5c010c0ccbcc8d8d8f602c57a30e786d0299731b86908382e4ea5a82f15b41ebe6ce09a61cfb8373d2024c55786be3ecad21fe0ee100339b5fa904961fbbbd25699198c1da86c5"
|
||||
}
|
||||
}
|
||||
],
|
||||
"exit_point": false
|
||||
}
|
||||
],
|
||||
"prefix_zero": "2001:b:a::/64",
|
||||
"exit_point": {
|
||||
"ipv4_conf": null,
|
||||
"ipv4_dnat": null,
|
||||
"ipv6_conf": {
|
||||
"addr": "fe80::8ac6/64",
|
||||
"gateway": "fe80::1",
|
||||
"metric": 0,
|
||||
"iface": "public"
|
||||
},
|
||||
"ipv6_allow": []
|
||||
},
|
||||
"allocation_nr": 0,
|
||||
"version": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Our network schema is now ready, but before we can provision it onto a node, we need to sign it and send it to the bcdb.
|
||||
To be able to sign it we need to have a pair of key. You can use `tfuser id` command to create an identity:
|
||||
|
||||
```bash
|
||||
tfuser id --output user.seed
|
||||
```
|
||||
|
||||
We can now provision the network on both nodes:
|
||||
|
||||
```bash
|
||||
tfuser provision --schema network.json \
|
||||
--node DLFF6CAshvyhCrpyTHq1dMd6QP6kFyhrVGegTgudk6xk \
|
||||
--node 4hpUjrbYS4YeFbvLoeSR8LGJKVkB97JyS83UEhFUU3S4 \
|
||||
--seed user.seed
|
||||
```
|
@@ -0,0 +1,54 @@
|
||||
#!/usr/bin/bash
|
||||
|
||||
mgmtnic=(
|
||||
0c:c4:7a:51:e3:6a
|
||||
0c:c4:7a:51:e9:e6
|
||||
0c:c4:7a:51:ea:18
|
||||
0c:c4:7a:51:e3:78
|
||||
0c:c4:7a:51:e7:f8
|
||||
0c:c4:7a:51:e8:ba
|
||||
0c:c4:7a:51:e8:0c
|
||||
0c:c4:7a:51:e7:fa
|
||||
)
|
||||
|
||||
ipminic=(
|
||||
0c:c4:7a:4c:f3:b6
|
||||
0c:c4:7a:4d:02:8c
|
||||
0c:c4:7a:4d:02:91
|
||||
0c:c4:7a:4d:02:62
|
||||
0c:c4:7a:4c:f3:7e
|
||||
0c:c4:7a:4d:02:98
|
||||
0c:c4:7a:4d:02:19
|
||||
0c:c4:7a:4c:f2:e0
|
||||
)
|
||||
cnt=1
|
||||
for i in ${mgmtnic[*]} ; do
|
||||
cat << EOF
|
||||
config host
|
||||
option name 'zosv2tst-${cnt}'
|
||||
option dns '1'
|
||||
option mac '${i}'
|
||||
option ip '10.5.0.$((${cnt} + 10))'
|
||||
|
||||
EOF
|
||||
let cnt++
|
||||
done
|
||||
|
||||
|
||||
|
||||
cnt=1
|
||||
for i in ${ipminic[*]} ; do
|
||||
cat << EOF
|
||||
config host
|
||||
option name 'ipmiv2tst-${cnt}'
|
||||
option dns '1'
|
||||
option mac '${i}'
|
||||
option ip '10.5.0.$((${cnt} + 100))'
|
||||
|
||||
EOF
|
||||
let cnt++
|
||||
done
|
||||
|
||||
for i in ${mgmtnic[*]} ; do
|
||||
echo ln -s zoststconf 01-$(echo $i | sed s/:/-/g)
|
||||
done
|
@@ -0,0 +1,35 @@
|
||||
<h1> Definitions</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Node](#node)
|
||||
- [TNo : Tenant Network Object](#tno--tenant-network-object)
|
||||
- [NR: Network Resource](#nr-network-resource)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
We present definitions of words used through the documentation.
|
||||
|
||||
## Node
|
||||
|
||||
TL;DR: Computer.
|
||||
A Node is a computer with CPU, Memory, Disks (or SSD's, NVMe) connected to _A_ network that has Internet access. (i.e. it can reach www.google.com, just like you on your phone, at home)
|
||||
That Node will, once it has received an IP address (IPv4 or IPv6), register itself when it's new, or confirm it's identity and it's online-ness (for lack of a better word).
|
||||
|
||||
## TNo : Tenant Network Object
|
||||
|
||||
TL;DR: The Network Description.
|
||||
We named it so, because it is a data structure that describes the __whole__ network a user can request (or setup).
|
||||
That network is a virtualized overlay network.
|
||||
Basically that means that transfer of data in that network *always* is encrypted, protected from prying eyes, and __resources in that network can only communicate with each other__ **unless** there is a special rule that allows access. Be it by allowing access through firewall rules, *and/or* through a proxy (a service that forwards requests on behalf of, and ships replies back to the client).
|
||||
|
||||
## NR: Network Resource
|
||||
|
||||
TL;DR: the Node-local part of a TNo.
|
||||
The main building block of a TNo; i.e. each service of a user in a Node lives in an NR.
|
||||
Each Node hosts User services, whatever type of service that is. Every service in that specific node will always be solely part of the Tenant's Network. (read that twice).
|
||||
So: A Network Resource is the thing that interconnects all other network resources of the TN (Tenant Network), and provides routing/firewalling for these interconnects, including the default route to the BBI (Big Bad Internet), aka ExitPoint.
|
||||
All User services that run in a Node are in some way or another connected to the Network Resource (NR), which will provide ip packet forwarding and firewalling to all other network resources (including the Exitpoint) of the TN (Tenant Network) of the user. (read that three times, and the last time, read it slowly and out loud)
|
@@ -0,0 +1,87 @@
|
||||
<h1> Introduction to Networkd</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Boot and initial setup](#boot-and-initial-setup)
|
||||
- [Networkd functionality](#networkd-functionality)
|
||||
- [Techie talk](#techie-talk)
|
||||
- [Wireguard explanations](#wireguard-explanations)
|
||||
- [Caveats](#caveats)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
We provide an introduction to Networkd, the network manager of 0-OS.
|
||||
|
||||
## Boot and initial setup
|
||||
|
||||
At boot, be it from an usb stick or PXE, ZOS starts up the kernel, with a few necessary parameters like farm ID and/or possible network parameters, but basically once the kernel has started, [zinit](https://github.com/threefoldtech/zinit) among other things, starts the network initializer.
|
||||
|
||||
In short, that process loops over the available network interfaces and tries to obtain an IP address that also provides for a default gateway. That means: it tries to get Internet connectivity. Without it, ZOS stops there, as not being able to register itself, nor start other processes, there wouldn't be any use for it to be started anyway.
|
||||
|
||||
Once it has obtained Internet connectivity, ZOS can then proceed to make itself known to the Grid, and acknowledge it's existence. It will then regularly poll the Grid for tasks.
|
||||
|
||||
Once initialized, with the network daemon running (a process that will handle all things related to networking), ZOS will set up some basic services so that workloads can themselves use that network.
|
||||
|
||||
## Networkd functionality
|
||||
|
||||
The network daemon is in itself responsible for a few tasks, and working together with the [provision daemon](../provision) it mainly sets up the local infrastructure to get the user network resources, together with the wireguard configurations for the user's mesh network.
|
||||
|
||||
The Wireguard mesh is an overlay network. That means that traffic of that network is encrypted and encapsulated in a new traffic frame that the gets transferred over the underlay network, here in essence the network that has been set up during boot of the node.
|
||||
|
||||
For users or workloads that run on top of the mesh, the mesh network looks and behaves like any other directly connected workload, and as such that workload can reach other workloads or services in that mesh with the added advantage that that traffic is encrypted, protecting services and communications over that mesh from too curious eyes.
|
||||
|
||||
That also means that workloads between nodes in a local network of a farmer is even protected from the farmer himself, in essence protecting the user from the farmer in case that farmer could become too curious.
|
||||
|
||||
As the nodes do not have any way to be accessed, be it over the underlaying network or even the local console of the node, a user can be sure that his workload cannot be snooped upon.
|
||||
|
||||
## Techie talk
|
||||
|
||||
- **boot and initial setup**
|
||||
For ZOS to work at all (the network is the computer), it needs an internet connection. That is: it needs to be able to communicate with the BCDB over the internet.
|
||||
So ZOS starts with that: with the `internet` process, that tries go get the node to receive an IP address. That process will have set-up a bridge (`zos`), connected to an interface that is on an Internet-capable network. That bridge will have an IP address that has Internet access.
|
||||
Also, that bridge is there for future public interfaces into workloads.
|
||||
Once ZOS can reach the Internet, the rest of the system can be started, where ultimately, the `networkd` daemon is started.
|
||||
|
||||
- **networkd initial setup**
|
||||
`networkd` starts with recensing the available Network interfaces, and registers them to the BCDB (grid database), so that farmers can specify non-standard configs like for multi-nic machines. Once that is done, `networkd` registers itself to the zbus, so it can receive tasks to execute from the provsioning daemon (`provisiond`).
|
||||
These tasks are mostly setting up network resources for users, where a network resource is a subnet in the user's wireguard mesh.
|
||||
|
||||
- **multi-nic setups**
|
||||
|
||||
When someone is a farmer, exploiting nodes somewhere in a datacentre, where the nodes have multiple NICs, it is advisable (though not necessary) to differentiate OOB traffic (like initial boot setup) from user traffic (as well the overlay network as the outgoing NAT for nodes for IPv4) to be on a different NIC. With these parameters, a user will have to make sure their switches are properly configured, more in docs later.
|
||||
|
||||
- **registering and configurations**
|
||||
|
||||
Once a node has booted and properly initialized, registering and configuring the node to be able to accept workloads and their associated network configs, is a two-step process.
|
||||
First, the node registers it's live network setup to the BCDB. That is : all NICs with their associated IP addresses and routes are registered so a farm admin can in a second phase configure eventual separate NICs to handle different kinds of workloads.
|
||||
In that secondary phase, a farm admin can then set-up the NICs and their associated IP's manually, so that workloads can start using them.
|
||||
|
||||
## Wireguard explanations
|
||||
|
||||
- **wireguard as pointopoint links and what that means**
|
||||
Wireguard is a special type of VPN, where every instance is as well server for multiple peers as client towards multiple peers. That way you can create fanning-out connections als receive connections from multiple peers, creating effectively a mesh of connections Like this : 
|
||||
|
||||
- **wireguard port management**
|
||||
Every wireguard point (a network resource point) needs a destination/port combo when it's publicly reachable. The destination is a public ip, but the port is the differentiator. So we need to make sure every network wireguard listening port is unique in the node where it runs, and can be reapplied in case of a node's reboot.
|
||||
ZOS registers the ports **already in use** to the BCDB, so a user can the pick a port that is not yet used.
|
||||
|
||||
- **wireguard and hidden nodes**
|
||||
Hidden nodes are nodes that are in essence hidden behind a firewall, and unreachable from the Internet to an internal network, be it as an IPv4 NATed host or an IPv6 host that is firewalled in any way, where it's impossible to have connection initiations form the Internet to the node.
|
||||
As such, these nodes can only partake in a network as client-only towards publicly reachable peers, and can only initiate the connections themselves. (ref previous drawing).
|
||||
To make sure connectivity stays up, the clients (all) have a keepalive towards all their peers so that communications towards network resources in hidden nodes can be established.
|
||||
|
||||
## Caveats
|
||||
|
||||
- **hidden nodes**
|
||||
Hidden nodes live (mostly) behind firewalls that keep state about connections and these states have a lifetime. We try at best to keep these communications going, but depending of the firewall your mileage may vary (YMMV ;-))
|
||||
|
||||
- **local underlay network reachability**
|
||||
When multiple nodes live in a same hidden network, at the moment we don't try to have the nodes establish connectivity between themselves, so all nodes in that hidden network can only reach each other through the intermediary of a node that is publicly reachable. So to get some performance, a farmer will have to have real routable nodes available in the vicinity.
|
||||
So for now, a farmer is better off to have his nodes really reachable over a public network.
|
||||
|
||||
- **IPv6 and IPv4 considerations**
|
||||
While the mesh can work over IPv4 __and__ IPv6 at the same time, the peers can only be reached through one protocol at the same time. That is a peer is IPv4 __or__ IPv6, not both. Hence if a peer is reachable over IPv4, the client towards that peer needs to reach it over IPv4 too and thus needs an IPv4 address.
|
||||
We advise strongly to have all nodes properly set-up on a routable unfirewalled IPv6 network, so that these problems have no reason to exist.
|
134
collections/developers/internals/zos/internals/network/mesh.md
Normal file
@@ -0,0 +1,134 @@
|
||||
<h1> Zero-Mesh</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [What It Is](#what-it-is)
|
||||
- [Overlay Network](#overlay-network)
|
||||
- [ZOS networkd](#zos-networkd)
|
||||
- [Internet reachability per Network Resource](#internet-reachability-per-network-resource)
|
||||
- [Interworkings](#interworkings)
|
||||
- [Network Resource Internals](#network-resource-internals)
|
||||
|
||||
***
|
||||
|
||||
## What It Is
|
||||
|
||||
When a user wants to deploy a workload, whatever that may be, that workload needs connectivity.
|
||||
If there is just one service to be run, things can be simple, but in general there are more than one services that need to interact to provide a full stack. Sometimes these services can live on one node, but mostly these service will be deployed over multiple nodes, in different containers.
|
||||
The Mesh is created for that, where containers can communicate over an encrypted path, and that network can be specified in terms of IP addresses by the user.
|
||||
|
||||
## Overlay Network
|
||||
|
||||
Zero-Mesh is an overlay network. That requires that nodes need a proper working network with existing access to the Internet in the first place, being full-blown public access, or behind a firewall/home router that provides for Private IP NAT to the internet.
|
||||
|
||||
Right now Zero-Mesh has support for both, where nodes behind a firewall are HIDDEN nodes, and nodes that are directly connected, be it over IPv6 or IPv4 as 'normal' nodes.
|
||||
Hidden nodes can thus only be participating as client nodes for a specific user Mesh, and all publicly reachable nodes can act as aggregators for hidden clients in that user Mesh.
|
||||
|
||||
Also, a Mesh is static: once it is configured, and thus during the lifetime of the network, there is one node containing the aggregator for Mesh clients that live on hidden nodes. So if then an aggregator node has died or is not reachable any more, the mesh needs to be reapplied, with __some__ publicly reachable node as aggregator node.
|
||||
|
||||
So it goes a bit like 
|
||||
The Exit labeled NR in that graph is the point where Network Resources in Hidden Nodes connect to. These Exit NRs are then the transfer nodes between Hidden NRs.
|
||||
|
||||
## ZOS networkd
|
||||
|
||||
The networkd daemon receives tasks from the provisioning daemon, so that it can create the necessary resources for a Mesh participator in the User Network (A network Resource - NR).
|
||||
|
||||
A network is defined as a whole by the User, using the tools in the 3bot to generate a proper configuration that can be used by the network daemon.
|
||||
|
||||
What networkd takes care of, is the establishment of the mesh itself, in accordance with the configuration a farmer has given to his nodes. What is configured on top of the Mesh is user defined, and applied as such by the networkd.
|
||||
|
||||
## Internet reachability per Network Resource
|
||||
|
||||
Every node that participates in a User mesh, will also provide for Internet access for every network resource.
|
||||
that means that every NR has the same Internet access as the node itself. Which also means, in terms of security, that a firewall in the node takes care of blocking all types of entry to the NR, effectively being an Internet access diode, for outgoing and related traffic only.
|
||||
In a later phase a user will be able to define some network resource as __sole__ outgoing Internet Access point, but for now that is not yet defined.
|
||||
|
||||
## Interworkings
|
||||
|
||||
So How is that set up ?
|
||||
|
||||
Every node participating in a User Network, sets up a Network Resource.
|
||||
Basically, it's a Linux Network Namespace (sort of a network virtual machine), that contains a wireguard interface that has a list of other Network resources it needs to route encrypted packets toward.
|
||||
|
||||
As a User Network has a range typically a `/16` (like `10.1.0.0/16`), that is user defined. The User then picks a subnet from that range (like e.g. `10.1.1.0/24`) to assign that to every new NR he wants to participate in that Network.
|
||||
|
||||
Workloads that are then provisioned are started in a newly created Container, and that container gets a User assigned IP __in__ that subnet of the Network Resource.
|
||||
|
||||
The Network resource itself then handles the routing and firewalling for the containers that are connected to it. Also, the Network Resource takes care of internet connectivity, so that the container can reach out to other services on the Internet.
|
||||
|
||||

|
||||
|
||||
Also in a later phase, a User will be able to add IPv6 prefixes to his Network Resources, so that containers are reachable over IPv6.
|
||||
|
||||
Fully-routed IPv6 will then be available, where an Exit NR will be the entrypoint towards that network.
|
||||
|
||||
## Network Resource Internals
|
||||
|
||||
Each NR is basically a router for the User Network, but to allow NRs to access the Internet through the Node's local connection, there are some other internal routers to be added.
|
||||
|
||||
Internally it looks like this :
|
||||
|
||||
```text
|
||||
+------------------------------------------------------------------------------+
|
||||
| |wg mesh |
|
||||
| +-------------+ +-----+-------+ |
|
||||
| | | | NR cust1 | 100.64.0.123/16 |
|
||||
| | container +----------+ 10.3.1.0/24 +----------------------+ |
|
||||
| | cust1 | veth| | public | |
|
||||
| +-------------+ +-------------+ | |
|
||||
| | |
|
||||
| +-------------+ +-------------+ | |
|
||||
| | | | NR cust200 | 100.64.4.200/16 | |
|
||||
| | container +----------+ 10.3.1.0/24 +----------------------+ |
|
||||
| | cust200 | veth| | public | |
|
||||
| +-------------+ +------+------+ | |
|
||||
| |wg mesh | |
|
||||
| 10.101.123.34/16 | |
|
||||
| +------------+ |tonrs |
|
||||
| | | +------------------+ |
|
||||
| | zos +------+ | 100.64.0.1/16 | |
|
||||
| | | | 10.101.12.231/16| ndmz | |
|
||||
| +---+--------+ NIC +-----------------------------+ | |
|
||||
| | | public +------------------+ |
|
||||
| +--------+------+ |
|
||||
| | |
|
||||
| | |
|
||||
+------------------------------------------------------------------------------+
|
||||
|
|
||||
|
|
||||
|
|
||||
| 10.101.0.0/16 10.101.0.1
|
||||
+------------------+------------------------------------------------------------
|
||||
|
||||
NAT
|
||||
--------
|
||||
rules NR custA
|
||||
nft add rule inet nat postrouting oifname public masquerade
|
||||
nft add rule inet filter input iifname public ct state { established, related } accept
|
||||
nft add rule inet filter input iifname public drop
|
||||
|
||||
rules NR custB
|
||||
nft add rule inet nat postrouting oifname public masquerade
|
||||
nft add rule inet filter input iifname public ct state { established, related } accept
|
||||
nft add rule inet filter input iifname public drop
|
||||
|
||||
rules ndmz
|
||||
nft add rule inet nat postrouting oifname public masquerade
|
||||
nft add rule inet filter input iifname public ct state { established, related } accept
|
||||
nft add rule inet filter input iifname public drop
|
||||
|
||||
|
||||
Routing
|
||||
|
||||
if NR only needs to get out:
|
||||
ip route add default via 100.64.0.1 dev public
|
||||
|
||||
if an NR wants to use another NR as exitpoint
|
||||
ip route add default via destnr
|
||||
with for AllowedIPs 0.0.0.0/0 on that wg peer
|
||||
|
||||
```
|
||||
|
||||
During startup of the Node, the ndmz is put in place, following the configuration if it has a single internet connection , or that with a dual-nic setup, a separate nic is used for internet access.
|
||||
|
||||
The ndmz network has the carrier-grade nat allocation assigned, so we don'tinterfere with RFC1918 private IPv4 address space, so users can use any of them (and not any of `100.64.0.0/10`, of course)
|
@@ -0,0 +1,8 @@
|
||||
<h1> Zero-OS Networking </h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction to networkd](./introduction.md)
|
||||
- [Vocabulary Definitions](./definitions.md)
|
||||
- [Wireguard Mesh Details](./mesh.md)
|
||||
- [Farm Network Setup](./setup_farm_network.md)
|
@@ -0,0 +1,123 @@
|
||||
<h1>Setup</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Running ZOS (v2) at home](#running-zos-v2-at-home)
|
||||
- [Running ZOS (v2) in a multi-node farm in a DC](#running-zos-v2-in-a-multi-node-farm-in-a-dc)
|
||||
- [Necessities](#necessities)
|
||||
- [IPv6](#ipv6)
|
||||
- [Routing/firewalling](#routingfirewalling)
|
||||
- [Multi-NIC Nodes](#multi-nic-nodes)
|
||||
- [Farmers and the grid](#farmers-and-the-grid)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
We present ZOSv2 network considerations.
|
||||
|
||||
Running ZOS on a node is just a matter of booting it with a USB stick, or with a dhcp/bootp/tftp server with the right configuration so that the node can start the OS.
|
||||
Once it starts booting, the OS detects the NICs, and starts the network configuration. A Node can only continue it's boot process till the end when it effectively has received an IP address and a route to the Internet. Without that, the Node will retry indefinitely to obtain Internet access and not finish it's startup.
|
||||
|
||||
So a Node needs to be connected to a __wired__ network, providing a dhcp server and a default gateway to the Internet, be it NATed or plainly on the public network, where any route to the Internet, be it IPv4 or IPv6 or both is sufficient.
|
||||
|
||||
For a node to have that ability to host user networks, we **strongly** advise to have a working IPv6 setup, as that is the primary IP stack we're using for the User Network's Mesh to function.
|
||||
|
||||
## Running ZOS (v2) at home
|
||||
|
||||
Running a ZOS Node at home is plain simple. Connect it to your router, plug it in the network, insert the preconfigured USB stick containing the bootloader and the `farmer_id`, power it on.
|
||||
You will then see it appear in the Cockpit (`https://cockpit.testnet.grid.tf/capacity`), under your farm.
|
||||
|
||||
## Running ZOS (v2) in a multi-node farm in a DC
|
||||
|
||||
Multi-Node Farms, where a farmer wants to host the nodes in a data centre, have basically the same simplicity, but the nodes can boot from a boot server that provides for DHCP, and also delivers the iPXE image to load, without the need for a USB stick in every Node.
|
||||
|
||||
A boot server is not really necessary, but it helps ;-). That server has a list of the MAC addresses of the nodes, and delivers the bootloader over PXE. The farmer is responsible to set-up the network, and configure the boot server.
|
||||
|
||||
### Necessities
|
||||
|
||||
The Farmer needs to:
|
||||
|
||||
- Obtain an IPv6 prefix allocation from the provider. A `/64` will do, that is publicly reachable, but a `/48` is advisable if the farmer wants to provide IPv6 transit for User Networks
|
||||
- If IPv6 is not an option, obtain an IPv4 subnet from the provider. At least one IPv4 address per node is needed, where all IP addresses are publicly reachable.
|
||||
- Have the Nodes connected on that public network with a switch so that all Nodes are publicly reachable.
|
||||
- In case of multiple NICS, also make sure his farm is properly registered in BCDB, so that the Node's public IP Addresses are registered.
|
||||
- Properly list the MAC addresses of the Nodes, and configure the DHCP server to provide for an IP address, and in case of multiple NICs also provide for private IP addresses over DHCP per Node.
|
||||
- Make sure that after first boot, the Nodes are reachable.
|
||||
|
||||
### IPv6
|
||||
|
||||
IPv6, although already a real protocol since '98, has seen reluctant adoption over the time it exists. That mostly because ISPs and Carriers were reluctant to deploy it, and not seeing the need since the advent of NAT and private IP space, giving the false impression of security.
|
||||
But this month (10/2019), RIPE sent a mail to all it's LIRs that the last consecutive /22 in IPv4 has been allocated. Needless to say, but that makes the transition to IPv6 in 2019 of utmost importance and necessity.
|
||||
Hence, ZOS starts with IPv6, and IPv4 is merely an afterthought ;-)
|
||||
So in a nutshell: we greatly encourage Farmers to have IPv6 on the Node's network.
|
||||
|
||||
### Routing/firewalling
|
||||
|
||||
Basically, the Nodes are self-protecting, in the sense that they provide no means at all to be accessed through listening processes at all. No service is active on the node itself, and User Networks function solely on an overlay.
|
||||
That also means that there is no need for a Farm admin to protect the Nodes from exterior access, albeit some DDoS protection might be a good idea.
|
||||
In the first phase we will still allow the Host OS (ZOS) to reply on ICMP ping requests, but that 'feature' might as well be blocked in the future, as once a Node is able to register itself, there is no real need to ever want to try to reach it.
|
||||
|
||||
### Multi-NIC Nodes
|
||||
|
||||
Nodes that Farmers deploy are typically multi-NIC Nodes, where one (typically a 1GBit NIC) can be used for getting a proper DHCP server running from where the Nodes can boot, and one other NIC (1Gbit or even 10GBit), that then is used for transfers of User Data, so that there is a clean separation, and possible injections bogus data is not possible.
|
||||
|
||||
That means that there would be two networks, either by different physical switches, or by port-based VLANs in the switch (if there is only one).
|
||||
|
||||
- Management NICs
|
||||
The Management NIC will be used by ZOS to boot, and register itself to the GRID. Also, all communications from the Node to the Grid happens from there.
|
||||
- Public NICs
|
||||
|
||||
### Farmers and the grid
|
||||
|
||||
A Node, being part of the Grid, has no concept of 'Farmer'. The only relationship for a Node with a Farmer is the fact that that is registered 'somewhere (TM)', and that a such workloads on a Node will be remunerated with Tokens. For the rest, a Node is a wholly stand-alone thing that participates in the Grid.
|
||||
|
||||
```text
|
||||
172.16.1.0/24
|
||||
2a02:1807:1100:10::/64
|
||||
+--------------------------------------+
|
||||
| +--------------+ | +-----------------------+
|
||||
| |Node ZOS | +-------+ | |
|
||||
| | +-------------+1GBit +--------------------+ 1GBit switch |
|
||||
| | | br-zos +-------+ | |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | +------------------+----+
|
||||
| +--------------+ | | +-----------+
|
||||
| | OOB Network | | |
|
||||
| | +----------+ ROUTER |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| +------------+ | +----------+ |
|
||||
| | Public | | | | |
|
||||
| | container | | | +-----+-----+
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| +---+--------+ | +-------------------+--------+ |
|
||||
| | | | 10GBit Switch | |
|
||||
| br-pub| +-------+ | | |
|
||||
| +-----+10GBit +-------------------+ | +---------->
|
||||
| +-------+ | | Internet
|
||||
| | | |
|
||||
| | +----------------------------+
|
||||
+--------------------------------------+
|
||||
185.69.167.128/26 Public network
|
||||
2a02:1807:1100:0::/64
|
||||
|
||||
```
|
||||
|
||||
Where the underlay part of the wireguard interfaces get instantiated in the Public container (namespace), and once created these wireguard interfaces get sent into the User Network (Network Resource), where a user can then configure the interface a he sees fit.
|
||||
|
||||
The router of the farmer fulfills 2 roles:
|
||||
|
||||
- NAT everything in the OOB network to the outside, so that nodes can start and register themselves, as well get tasks to execute from the BCDB.
|
||||
- Route the assigned IPv4 subnet and IPv6 public prefix on the public segment, to which the public container is connected.
|
||||
|
||||
As such, in case that the farmer wants to provide IPv4 public access for grid proxies, the node will need at least one (1) IPv4 address. It's free to the farmer to assign IPv4 addresses to only a part of the Nodes.
|
||||
On the other hand, it is quite important to have a proper IPv6 setup, because things will work out better.
|
||||
|
||||
It's the Farmer's task to set up the Router and the switches.
|
||||
|
||||
In a simpler setup (small number of nodes for instance), the farmer could setup a single switch and make 2 port-based VLANs to separate OOB and Public, or even wit single-nic nodes, just put them directly on the public segment, but then he will have to provide a DHCP server on the Public network.
|
After Width: | Height: | Size: 52 KiB |
After Width: | Height: | Size: 50 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 25 KiB |
After Width: | Height: | Size: 12 KiB |
@@ -0,0 +1,68 @@
|
||||
# On boot
|
||||
> this is setup by `internet` daemon, which is part of the bootstrap process.
|
||||
|
||||
the first basic network setup is done, the point of this setup is to connect the node to the internet, to be able to continue the rest of the boot process.
|
||||
|
||||
- Go over all **PLUGGED, and PHYSICAL** interfaces
|
||||
- For each matching interface, the interface is tested if it can get both IPv4 and IPv6
|
||||
- If multiple interfaces have been found to receive ipv4 from dhcp, we find the `smallest` ip, with the private gateway IP, otherwise if no private gateway ip found, we only find the one with the smallest IP.
|
||||
- Once the interface is found we do the following: (we will call this interface **eth**)
|
||||
- Create a bridge named `zos`
|
||||
- Disable IPv6 on this bridge, and ipv6 forwarding
|
||||
- Run `udhcpc` on zos bridge
|
||||

|
||||
|
||||
Once this setup complete, the node now has access to the internet which allows it to download and run `networkd` which takes over the network stack and continue the process as follows.
|
||||
|
||||
# Network Daemon
|
||||
- Validate zos setup created by the `internet` on boot daemon
|
||||
- Send information about all local nics to the explorer (?)
|
||||
|
||||
## Setting up `ndmz`
|
||||
First we need to find the master interface for ndmz, we have the following cases:
|
||||
- master of `public_config` if set. Public Config is an external configuration that is set by the farmer on the node object. that information is retrieved by the node from the public explorer.
|
||||
- otherwise (if public_config is not set) check if the public namespace is set (i think that's a dead branch because if this exist (or can exist) it means the master is always set. which means it will get used always.
|
||||
- otherwise find first interface with ipv6
|
||||
- otherwise check if zos has global unicast ipv6
|
||||
- otherwise hidden node (still uses zos but in hidden node setup)
|
||||
|
||||
### Hidden node ndmz
|
||||

|
||||
|
||||
### Dualstack ndmz
|
||||

|
||||
|
||||
## Setting up Public Config
|
||||
this is an external configuration step that is configured by the farmer on the node object. The node then must have setup in the explorer.
|
||||
|
||||

|
||||
|
||||
## Setting up Yggdrasil
|
||||
- Get a list of all public peers with status `up`
|
||||
- If hidden node:
|
||||
- Find peers with IPv4 addresses
|
||||
- If dual stack node:
|
||||
- Filter out all peers with same prefix as the node, to avoid connecting locally only
|
||||
- write down yggdrasil config, and start yggdrasil daemon via zinit
|
||||
- yggdrasil runs inside the ndmz namespace
|
||||
- add an ipv6 address to npub in the same prefix as yggdrasil. this way when npub6 is used as a gateway for this prefix, traffic
|
||||
will be routed through yggdrasil.
|
||||
|
||||
# Creating a network resource
|
||||
A network resource (`NR` for short) as a user private network that lives on the node and can span multiple nodes over wireguard. When a network is deployed the node builds a user namespace as follows:
|
||||
- A unique network id is generated by md5sum(user_id + network_name) then only take first 13 bytes. We will call this `net-id`.
|
||||
|
||||

|
||||
|
||||
## Create the wireguard interface
|
||||
if the node has `public_config` so the `public` namespace exists. then the wireguard device is first created inside the `public` namespace then moved
|
||||
to the network-resource namespace.
|
||||
|
||||
Otherwise, the port is created on the host namespace and then moved to the network-resource namespace. The final result is
|
||||
|
||||

|
||||
|
||||
Finally the wireguard peer list is applied and configured, routing rules is also configured to route traffic to the wireguard interface
|
||||
|
||||
# Member joining a user network (network resource)
|
||||

|
@@ -0,0 +1,57 @@
|
||||
@startuml
|
||||
[zos\nbridge] as zos
|
||||
[br-pub\nbridge] as brpub
|
||||
[br-ndmz\nbridge] as brndmz
|
||||
note top of brndmz
|
||||
disable ipv6
|
||||
- net.ipv6.conf.br-ndmz.disable_ipv6 = 1
|
||||
end note
|
||||
' brpub -left- zos : veth pair\n(tozos)
|
||||
brpub -down- master
|
||||
note right of master
|
||||
master is found as described
|
||||
in the readme (this can be zos bridge)
|
||||
in case of a single node machine
|
||||
end note
|
||||
|
||||
package "ndmz namespace" {
|
||||
[tonrs\nmacvlan] as tonrs
|
||||
note bottom of tonrs
|
||||
- net.ipv4.conf.tonrs.proxy_arp = 0
|
||||
- net.ipv6.conf.tonrs.disable_ipv6 = 0
|
||||
|
||||
Addresses:
|
||||
100.127.0.1/16
|
||||
fe80::1/64
|
||||
fd00::1
|
||||
end note
|
||||
tonrs - brndmz: macvlan
|
||||
|
||||
[npub6\nmacvlan] as npub6
|
||||
npub6 -down- brpub: macvlan
|
||||
|
||||
[npub4\nmacvlan] as npub4
|
||||
npub4 -down- zos: macvlan
|
||||
|
||||
note as MAC
|
||||
gets static mac address generated
|
||||
from node id. to make sure it receives
|
||||
same ip address.
|
||||
end note
|
||||
|
||||
MAC .. npub4
|
||||
MAC .. npub6
|
||||
|
||||
note as setup
|
||||
- net.ipv6.conf.all.forwarding = 1
|
||||
end note
|
||||
|
||||
[ygg0]
|
||||
note bottom of ygg0
|
||||
this will be added by yggdrasil setup
|
||||
in the next step
|
||||
end note
|
||||
}
|
||||
|
||||
footer (hidden node) no master with global unicast ipv6 found
|
||||
@enduml
|
@@ -0,0 +1,55 @@
|
||||
@startuml
|
||||
[zos\nbridge] as zos
|
||||
note left of zos
|
||||
current select master
|
||||
for hiddent ndmz setup
|
||||
end note
|
||||
[br-pub\nbridge] as brpub
|
||||
[br-ndmz\nbridge] as brndmz
|
||||
note top of brndmz
|
||||
disable ipv6
|
||||
- net.ipv6.conf.br-ndmz.disable_ipv6 = 1
|
||||
end note
|
||||
brpub -left- zos : veth pair\n(tozos)
|
||||
|
||||
package "ndmz namespace" {
|
||||
[tonrs\nmacvlan] as tonrs
|
||||
note bottom of tonrs
|
||||
- net.ipv4.conf.tonrs.proxy_arp = 0
|
||||
- net.ipv6.conf.tonrs.disable_ipv6 = 0
|
||||
|
||||
Addresses:
|
||||
100.127.0.1/16
|
||||
fe80::1/64
|
||||
fd00::1
|
||||
end note
|
||||
tonrs - brndmz: macvlan
|
||||
|
||||
[npub6\nmacvlan] as npub6
|
||||
npub6 -right- brpub: macvlan
|
||||
|
||||
[npub4\nmacvlan] as npub4
|
||||
npub4 -down- zos: macvlan
|
||||
|
||||
note as MAC
|
||||
gets static mac address generated
|
||||
from node id. to make sure it receives
|
||||
same ip address.
|
||||
end note
|
||||
|
||||
MAC .. npub4
|
||||
MAC .. npub6
|
||||
|
||||
note as setup
|
||||
- net.ipv6.conf.all.forwarding = 1
|
||||
end note
|
||||
|
||||
[ygg0]
|
||||
note bottom of ygg0
|
||||
this will be added by yggdrasil setup
|
||||
in the next step
|
||||
end note
|
||||
}
|
||||
|
||||
footer (hidden node) no master with global unicast ipv6 found
|
||||
@enduml
|
@@ -0,0 +1,23 @@
|
||||
@startuml
|
||||
|
||||
component "br-pub" as public
|
||||
component "b-<netid>\nbridge" as bridge
|
||||
package "<reservation-id> namespace" {
|
||||
component eth0 as eth
|
||||
note right of eth
|
||||
set ip as configured in the reservation
|
||||
it must be in the subnet assinged to n-<netid>
|
||||
in the user resource above.
|
||||
- set default route through n-<netid>
|
||||
end note
|
||||
eth .. bridge: veth
|
||||
|
||||
component [pub\nmacvlan] as pub
|
||||
pub .. public
|
||||
|
||||
note right of pub
|
||||
only if public ipv6 is requests
|
||||
also gets a consistent MAC address
|
||||
end note
|
||||
}
|
||||
@enduml
|
@@ -0,0 +1,31 @@
|
||||
@startuml
|
||||
component [b-<netid>] as bridge
|
||||
note left of bridge
|
||||
- net.ipv6.conf.b-<netid>.disable_ipv6 = 1
|
||||
end note
|
||||
|
||||
package "n-<netid> namespace" {
|
||||
component [n-<netid>\nmacvlan] as nic
|
||||
bridge .. nic: macvlan
|
||||
|
||||
note bottom of nic
|
||||
- nic gets the first ip ".1" in the assigned
|
||||
user subnet.
|
||||
- an ipv6 driven from ipv4 that is driven from the assigned ipv4
|
||||
- fe80::1/64
|
||||
end note
|
||||
component [public\nmacvlan] as public
|
||||
note bottom of public
|
||||
- gets an ipv4 in 100.127.0.9/16 range
|
||||
- get an ipv6 in the fd00::/64 prefix
|
||||
- route over 100.127.0.1
|
||||
- route over fe80::1/64
|
||||
end note
|
||||
note as G
|
||||
- net.ipv6.conf.all.forwarding = 1
|
||||
end note
|
||||
}
|
||||
|
||||
component [br-ndmz] as brndmz
|
||||
brndmz .. public: macvlan
|
||||
@enduml
|
@@ -0,0 +1,33 @@
|
||||
@startuml
|
||||
component [b-<netid>] as bridge
|
||||
note left of bridge
|
||||
- net.ipv6.conf.b-<netid>.disable_ipv6 = 1
|
||||
end note
|
||||
|
||||
package "n-<netid> namespace" {
|
||||
component [n-<netid>\nmacvlan] as nic
|
||||
bridge .. nic: macvlan
|
||||
|
||||
note bottom of nic
|
||||
- nic gets the first ip ".1" in the assigned
|
||||
user subnet.
|
||||
- an ipv6 driven from ipv4 that is driven from the assigned ipv4
|
||||
- fe80::1/64
|
||||
end note
|
||||
component [public\nmacvlan] as public
|
||||
note bottom of public
|
||||
- gets an ipv4 in 100.127.0.9/16 range
|
||||
- get an ipv6 in the fd00::/64 prefix
|
||||
- route over 100.127.0.1
|
||||
- route over fe80::1/64
|
||||
end note
|
||||
note as G
|
||||
- net.ipv6.conf.all.forwarding = 1
|
||||
end note
|
||||
component [w-<netid>\nwireguard]
|
||||
}
|
||||
|
||||
|
||||
component [br-ndmz] as brndmz
|
||||
brndmz .. public: macvlan
|
||||
@enduml
|
@@ -0,0 +1,29 @@
|
||||
@startuml
|
||||
|
||||
() "br-pub (Public Bridge)" as brpub
|
||||
|
||||
note bottom of brpub
|
||||
This bridge is always created on boot, and is either
|
||||
connected to the zos bridge (in single nic setup).
|
||||
or to the seond nic with public IPv6 (in dual nic setup)
|
||||
end note
|
||||
|
||||
|
||||
package "public namespace" {
|
||||
|
||||
[public\nmacvlan] as public
|
||||
public -down- brpub: macvlan
|
||||
note right of public
|
||||
- have a static mac generated from node id
|
||||
- set the ips as configured
|
||||
- set the default gateways as configured
|
||||
end note
|
||||
|
||||
note as global
|
||||
inside namespace
|
||||
- net.ipv6.conf.all.accept_ra = 2
|
||||
- net.ipv6.conf.all.accept_ra_defrtr = 1
|
||||
end note
|
||||
}
|
||||
|
||||
@enduml
|
@@ -0,0 +1,16 @@
|
||||
@startuml
|
||||
() eth
|
||||
[zos]
|
||||
eth -up- zos
|
||||
note left of zos
|
||||
bridge takes same mac address as eth
|
||||
(ipv6 is enabled on the bridge)
|
||||
- net.ipv6.conf.zos.disable_ipv6 = 0
|
||||
end note
|
||||
note left of eth
|
||||
disable ipv6 on interface:
|
||||
(ipv6 is disabled on the nic)
|
||||
- net.ipv6.conf.<eth>.disable_ipv6 = 1
|
||||
- net.ipv6.conf.all.forwarding = 0
|
||||
end note
|
||||
@enduml
|
@@ -0,0 +1,25 @@
|
||||
# Yggdrasil integration in 0-OS
|
||||
|
||||
Since day one, 0-OS v2 networking has been design around IPv6. The goal was avoid having to deal with exhausted IPV4 address and be ready for the future.
|
||||
|
||||
While this decision made sense on the long term, it pose trouble on the short term for farmer that only have access to ipv4 and are unable to ask for an upgrade to their IPS.
|
||||
|
||||
In order to allow these ipv4 only nodes to join the grid, an other overlay network has to be created between all the nodes. To achieve this, Yggdrasil has been selected.
|
||||
|
||||
## Yggdrasil
|
||||
|
||||
[Yggdrasil network project](https://yggdrasil-network.github.io/) has been selected to be integrated into 0-OS. All 0-OS node will runs an yggdrasil daemon which means all 0-OS nodes can now communicate over the yggdrasil network. The yggdrasil integration is an experiment planned in multiple phase:
|
||||
|
||||
Phase 1: Allow 0-DB container to be exposed over yggdrasil network. Implemented in v0.3.5
|
||||
Phase 2: Allow containers to request an interface with an yggdrasil IP address.
|
||||
|
||||
## networkd bootstrap
|
||||
|
||||
When booting, networkd will wait for 2 minute to receive an IPv6 address through router advertisement for it's `npub6` interface in the ndmz network namspace.
|
||||
If after 2 minutes, no IPv6 is received, networkd will consider the node to be an IPv4 only nodes, switch to this mode and continue booting.
|
||||
|
||||
### 0-DB containers
|
||||
|
||||
For ipv4 only nodes, the 0-DB container will be exposed on top an yggdrasil IPv6 address. Since all the 0-OS node will also run yggdrasil, these 0-DB container will always be reachable from any container in the grid.
|
||||
|
||||
For dual stack nodes, the 0-DB container will also get an yggdrasil IP in addition to the already present public IPv6.
|
@@ -0,0 +1,46 @@
|
||||
# Network module
|
||||
|
||||
## ZBus
|
||||
|
||||
Network module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
| network|[network](#interface)| 0.0.1|
|
||||
|
||||
## Home Directory
|
||||
|
||||
network keeps some data in the following locations
|
||||
| directory | path|
|
||||
|----|---|
|
||||
| root| `/var/cache/modules/network`|
|
||||
|
||||
|
||||
## Interface
|
||||
|
||||
```go
|
||||
//Networker is the interface for the network module
|
||||
type Networker interface {
|
||||
// Create a new network resource
|
||||
CreateNR(Network) (string, error)
|
||||
// Delete a network resource
|
||||
DeleteNR(Network) error
|
||||
|
||||
// Join a network (with network id) will create a new isolated namespace
|
||||
// that is hooked to the network bridge with a veth pair, and assign it a
|
||||
// new IP from the network resource range. The method return the new namespace
|
||||
// name.
|
||||
// The member name specifies the name of the member, and must be unique
|
||||
// The NetID is the network id to join
|
||||
Join(networkdID NetID, containerID string, addrs []string) (join Member, err error)
|
||||
|
||||
// ZDBPrepare creates a network namespace with a macvlan interface into it
|
||||
// to allow the 0-db container to be publicly accessible
|
||||
// it retusn the name of the network namespace created
|
||||
ZDBPrepare() (string, error)
|
||||
|
||||
// Addrs return the IP addresses of interface
|
||||
// if the interface is in a network namespace netns needs to be not empty
|
||||
Addrs(iface string, netns string) ([]net.IP, error)
|
||||
}
|
||||
```
|
@@ -0,0 +1,50 @@
|
||||
<h1> Node Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Zbus](#zbus)
|
||||
- [Example](#example)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
This module is responsible of registering the node on the grid, and handling of grid events. The node daemon broadcast the intended events on zbus for other modules that are interested in those events.
|
||||
|
||||
The node also provide zbus interfaces to query some of the node information.
|
||||
|
||||
## Zbus
|
||||
|
||||
Node module is available on [zbus](https://github.com/threefoldtech/zbus) over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
|host |host| 0.0.1
|
||||
|system |system| 0.0.1
|
||||
|events |events| 0.0.1
|
||||
|
||||
## Example
|
||||
|
||||
```go
|
||||
|
||||
//SystemMonitor interface (provided by noded)
|
||||
type SystemMonitor interface {
|
||||
NodeID() uint32
|
||||
Memory(ctx context.Context) <-chan VirtualMemoryStat
|
||||
CPU(ctx context.Context) <-chan TimesStat
|
||||
Disks(ctx context.Context) <-chan DisksIOCountersStat
|
||||
Nics(ctx context.Context) <-chan NicsIOCounterStat
|
||||
}
|
||||
|
||||
// HostMonitor interface (provided by noded)
|
||||
type HostMonitor interface {
|
||||
Uptime(ctx context.Context) <-chan time.Duration
|
||||
}
|
||||
|
||||
// Events interface
|
||||
type Events interface {
|
||||
PublicConfigEvent(ctx context.Context) <-chan PublicConfigEvent
|
||||
ContractCancelledEvent(ctx context.Context) <-chan ContractCancelledEvent
|
||||
}
|
||||
```
|
@@ -0,0 +1,35 @@
|
||||
<h1>Provision Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [ZBus](#zbus)
|
||||
- [Introduction](#introduction)
|
||||
- [Supported workload](#supported-workload)
|
||||
|
||||
|
||||
***
|
||||
|
||||
## ZBus
|
||||
|
||||
This module is autonomous module and is not reachable over `zbus`.
|
||||
|
||||
## Introduction
|
||||
|
||||
This module is responsible to provision/decommission workload on the node.
|
||||
|
||||
It accepts new deployment over `rmb` and tries to bring them to reality by running a series of provisioning workflows based on the workload `type`.
|
||||
|
||||
`provisiond` knows about all available daemons and it contacts them over `zbus` to ask for the needed services. The pull everything together and update the deployment with the workload state.
|
||||
|
||||
If node was restarted, `provisiond` tries to bring all active workloads back to original state.
|
||||
## Supported workload
|
||||
|
||||
0-OS currently support 8 type of workloads:
|
||||
- network
|
||||
- `zmachine` (virtual machine)
|
||||
- `zmount` (disk): usable only by a `zmachine`
|
||||
- `public-ip` (v4 and/or v6): usable only by a `zmachine`
|
||||
- [`zdb`](https://github.com/threefoldtech/0-DB) `namespace`
|
||||
- [`qsfs`](https://github.com/threefoldtech/quantum-storage)
|
||||
- `zlogs`
|
||||
- `gateway`
|
153
collections/developers/internals/zos/internals/storage/readme.md
Normal file
@@ -0,0 +1,153 @@
|
||||
<h1> Storage Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [ZBus](#zbus)
|
||||
- [Overview](#overview)
|
||||
- [List of sub-modules](#list-of-sub-modules)
|
||||
- [On Node Booting](#on-node-booting)
|
||||
- [zinit unit](#zinit-unit)
|
||||
- [Interface](#interface)
|
||||
|
||||
***
|
||||
|
||||
## Introduction
|
||||
|
||||
This module is responsible to manage everything related with storage.
|
||||
|
||||
## ZBus
|
||||
|
||||
Storage module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
| storage|[storage](#interface)| 0.0.1|
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
On start, storaged holds ownership of all node disks, and it separate it into 2 different sets:
|
||||
|
||||
- SSD Storage: For each ssd disk available, a storage pool of type SSD is created
|
||||
- HDD Storage: For each HDD disk available, a storage pool of type HDD is created
|
||||
|
||||
|
||||
Then `storaged` can provide the following storage primitives:
|
||||
- `subvolume`: (with quota). The btrfs subvolume can be used by used by `flistd` to support read-write operations on flists. Hence it can be used as rootfs for containers and VMs. This storage primitive is only supported on `ssd` pools.
|
||||
- On boot, storaged will always create a permanent subvolume with id `zos-cache` (of 100G) which will be used by the system to persist state and to hold cache of downloaded files.
|
||||
- `vdisk`: Virtual disk that can be attached to virtual machines. this is only possible on `ssd` pools.
|
||||
- `device`: that is a full disk that gets allocated and used by a single `0-db` service. Note that a single 0-db instance can serve multiple zdb namespaces for multiple users. This is only possible for on `hdd` pools.
|
||||
|
||||
You already can tell that ZOS can work fine with no HDD (it will not be able to server zdb workloads though), but not without SSD. Hence a zos with no SSD will never register on the grid.
|
||||
|
||||
## List of sub-modules
|
||||
|
||||
- disks
|
||||
- 0-db
|
||||
- booting
|
||||
|
||||
## On Node Booting
|
||||
|
||||
When the module boots:
|
||||
|
||||
- Make sure to mount all available pools
|
||||
- Scan available disks that are not used by any pool and create new pools on those disks. (all pools now are created with `RaidSingle` policy)
|
||||
- Try to find and mount a cache sub-volume under /var/cache.
|
||||
- If no cache sub-volume is available a new one is created and then mounted.
|
||||
|
||||
### zinit unit
|
||||
|
||||
The zinit unit file of the module specify the command line, test command, and the order where the services need to be booted.
|
||||
|
||||
Storage module is a dependency for almost all other system modules, hence it has high boot presidency (calculated on boot) by zinit based on the configuration.
|
||||
|
||||
The storage module is only considered running, if (and only if) the /var/cache is ready
|
||||
|
||||
```yaml
|
||||
exec: storaged
|
||||
test: mountpoint /var/cache
|
||||
```
|
||||
|
||||
### Interface
|
||||
|
||||
```go
|
||||
|
||||
// StorageModule is the storage subsystem interface
|
||||
// this should allow you to work with the following types of storage medium
|
||||
// - full disks (device) (these are used by zdb)
|
||||
// - subvolumes these are used as a read-write layers for 0-fs mounts
|
||||
// - vdisks are used by zmachines
|
||||
// this works as following:
|
||||
// a storage module maintains a list of ALL disks on the system
|
||||
// separated in 2 sets of pools (SSDs, and HDDs)
|
||||
// ssd pools can only be used for
|
||||
// - subvolumes
|
||||
// - vdisks
|
||||
// hdd pools are only used by zdb as one disk
|
||||
type StorageModule interface {
|
||||
// Cache method return information about zos cache volume
|
||||
Cache() (Volume, error)
|
||||
|
||||
// Total gives the total amount of storage available for a device type
|
||||
Total(kind DeviceType) (uint64, error)
|
||||
// BrokenPools lists the broken storage pools that have been detected
|
||||
BrokenPools() []BrokenPool
|
||||
// BrokenDevices lists the broken devices that have been detected
|
||||
BrokenDevices() []BrokenDevice
|
||||
//Monitor returns stats stream about pools
|
||||
Monitor(ctx context.Context) <-chan PoolsStats
|
||||
|
||||
// Volume management
|
||||
|
||||
// VolumeCreate creates a new volume
|
||||
VolumeCreate(name string, size gridtypes.Unit) (Volume, error)
|
||||
|
||||
// VolumeUpdate updates the size of an existing volume
|
||||
VolumeUpdate(name string, size gridtypes.Unit) error
|
||||
|
||||
// VolumeLookup return volume information for given name
|
||||
VolumeLookup(name string) (Volume, error)
|
||||
|
||||
// VolumeDelete deletes a volume by name
|
||||
VolumeDelete(name string) error
|
||||
|
||||
// VolumeList list all volumes
|
||||
VolumeList() ([]Volume, error)
|
||||
|
||||
// Virtual disk management
|
||||
|
||||
// DiskCreate creates a virtual disk given name and size
|
||||
DiskCreate(name string, size gridtypes.Unit) (VDisk, error)
|
||||
|
||||
// DiskResize resizes the disk to given size
|
||||
DiskResize(name string, size gridtypes.Unit) (VDisk, error)
|
||||
|
||||
// DiskWrite writes the given raw image to disk
|
||||
DiskWrite(name string, image string) error
|
||||
|
||||
// DiskFormat makes sure disk has filesystem, if it already formatted nothing happens
|
||||
DiskFormat(name string) error
|
||||
|
||||
// DiskLookup looks up vdisk by name
|
||||
DiskLookup(name string) (VDisk, error)
|
||||
|
||||
// DiskExists checks if disk exists
|
||||
DiskExists(name string) bool
|
||||
|
||||
// DiskDelete deletes a disk
|
||||
DiskDelete(name string) error
|
||||
|
||||
DiskList() ([]VDisk, error)
|
||||
// Device management
|
||||
|
||||
//Devices list all "allocated" devices
|
||||
Devices() ([]Device, error)
|
||||
|
||||
// DeviceAllocate allocates a new device (formats and give a new ID)
|
||||
DeviceAllocate(min gridtypes.Unit) (Device, error)
|
||||
|
||||
// DeviceLookup inspects a previously allocated device
|
||||
DeviceLookup(name string) (Device, error)
|
||||
}
|
||||
```
|
66
collections/developers/internals/zos/internals/vmd/readme.md
Normal file
@@ -0,0 +1,66 @@
|
||||
<h1>VMD Module</h1>
|
||||
|
||||
<h2> Table of Contents </h2>
|
||||
|
||||
- [ZBus](#zbus)
|
||||
- [Home Directory](#home-directory)
|
||||
- [Introduction](#introduction)
|
||||
- [zinit unit](#zinit-unit)
|
||||
- [Interface](#interface)
|
||||
|
||||
***
|
||||
|
||||
## ZBus
|
||||
|
||||
Storage module is available on zbus over the following channel
|
||||
|
||||
| module | object | version |
|
||||
|--------|--------|---------|
|
||||
| vmd|[vmd](#interface)| 0.0.1|
|
||||
|
||||
## Home Directory
|
||||
|
||||
contd keeps some data in the following locations
|
||||
| directory | path|
|
||||
|----|---|
|
||||
| root| `/var/cache/modules/containerd`|
|
||||
|
||||
## Introduction
|
||||
|
||||
The vmd module, manages all virtual machines processes, it provide the interface to, create, inspect, and delete virtual machines. It also monitor the vms to make sure they are re-spawned if crashed. Internally it uses `cloud-hypervisor` to start the Vm processes.
|
||||
|
||||
It also provide the interface to configure VM logs streamers.
|
||||
|
||||
### zinit unit
|
||||
|
||||
`contd` must run after containerd is running, and the node boot process is complete. Since it doesn't keep state, no dependency on `stroaged` is needed
|
||||
|
||||
```yaml
|
||||
exec: vmd --broker unix:///var/run/redis.sock
|
||||
after:
|
||||
- boot
|
||||
- networkd
|
||||
```
|
||||
|
||||
## Interface
|
||||
|
||||
```go
|
||||
|
||||
// VMModule defines the virtual machine module interface
|
||||
type VMModule interface {
|
||||
Run(vm VM) error
|
||||
Inspect(name string) (VMInfo, error)
|
||||
Delete(name string) error
|
||||
Exists(name string) bool
|
||||
Logs(name string) (string, error)
|
||||
List() ([]string, error)
|
||||
Metrics() (MachineMetrics, error)
|
||||
|
||||
// VM Log streams
|
||||
|
||||
// StreamCreate creates a stream for vm `name`
|
||||
StreamCreate(name string, stream Stream) error
|
||||
// delete stream by stream id.
|
||||
StreamDelete(id string) error
|
||||
}
|
||||
```
|