GitLab CI: Cache and Artifacts explained by example

Hi, DEV Community! I’ve been working in the software testing field for more than eight years. Apart from web services testing, I maintain CI/CD Pipelines in our team’s GitLab.

Let’s discuss the difference between GitLab cache and artifacts. I’ll show …


This content originally appeared on DEV Community and was authored by Anton Yakutovich

Hi, DEV Community! I've been working in the software testing field for more than eight years. Apart from web services testing, I maintain CI/CD Pipelines in our team's GitLab.

Let's discuss the difference between GitLab cache and artifacts. I'll show how to configure the Pipeline for the Node.js app in a pragmatic way to achieve good performance and resource utilization.

There are three things you can watch forever: fire burning, water falling, and the build is passing after your next commit. Nobody wants to wait for the CI completion too much, it's better to set up all the tweaks to avoid long waiting between the commit the build status. Cache and artifacts to the rescue! They help reduce the time it takes to run a Pipeline drastically.

People are confused when they have to choose between cache and artifacts. GitLab has bright documentation, but the Node.js app with cache example and the Pipeline template for Node.js contradict each other.

Let's see what the Pipeline in GitLab terms means. The Pipeline is a set of stages and each stage can have one or more jobs. Jobs work on a distributed farm of runners. When we start a Pipeline, a random runner with free resources executes the needed job. The GitLab-runner is the agent that can run jobs. For simplicity, let's consider Docker as an executor for all runners.

Each job starts with a clean slate and doesn't know the results of the previous one. If you don't use cache and artifacts, the runner will have to go to the internet or local registry and download the necessary packages when installing project dependencies.

What is cache?

It's a set of files that a job can download before running and upload after execution. By default, the cache is stored in the same place where GitLab Runner is installed. If the distributed cache is configured, S3 works as storage.
GitLab Cache
Let's suppose you run a Pipeline for the first time with a local cache. The job will not find the cache but will upload one after the execution to runner01. The second job will execute on runner02, it won't find the cache on it either and will work without it. The result will be saved to runner02. Lint, the third job, will find the cache on runner01 and use it (pull). After execution, it will upload the cache back (push).

What are artifacts?

Artifacts are files stored on the GitLab server after a job is executed. Subsequent jobs will download the artifact before script execution.
GitLab artifacts
Build job creates a DEF artifact and saves it on the server. The second job, Test, downloads the artifact from the server before running the commands. The third job, Lint, similarly downloads the artifact from the server.

To compare the artifact is created in the first job and is used in the following ones. The cache is created within each job.

Consider the CI template example for Node.js recommended by GitLab:

image: node:latest # (1)

# This folder is cached between builds
cache:
  paths:
    - node_modules/ # (2)

test_async:
  script:
    - npm install # (3)
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  script:
    - npm install # (4)
    - node ./specs/start.js ./specs/db-postgres.spec.js

Line #1 specifies the docker image, which will be used in all jobs. The first problem is the latest tag. This tag ruins the reproducibility of the builds. It always points to the latest release of Node.js. If the GitLab runner caches docker images, the first run will download the image, and all subsequent runs will use the locally available image. So, even if a node is upgraded from version XX to YY, our Pipeline will know nothing about it. Therefore, I suggest specifying the version of the image. And not just the release branch (node:14), but the full version tag (node:14.2.5).

Line #2 is related to lines 3 and 4. The node_modules directory is specified for caching, the installation of packages (npm install) is performed for every job. The installation should be faster because packages are available inside node_modules. Since no key is specified for the cache, the word default will be used as a key. It means that the cache will be permanent, shared between all git branches.

Let me remind you, the main goal is to keep the pipeline reproducible. The Pipeline launched today should work the same way in a year.

NPM stores dependencies in two files — package.json and package-lock.json. If you use package.json, the build is not reproducible. When you run npm install the package manager puts the last minor release for not strict dependencies. To fix the dependency tree, we use the package-lock.json file. All versions of packages are strictly specified there.

But there is another problem, npm install rewrites package-lock.json, and this is not what we expect. Therefore, we use the special command npm ci which:

  • removes the node_modules directory;
  • installs packages from package-lock.json.

What shall we do if node_modules will be deleted every time? We can specify NPM cache using the environment variable npm_config_cache.

And the last thing, the config does not explicitly specify the stage where jobs are executed. By default, the job runs inside the test stage. It turns out that both jobs will run in parallel. Perfect! Let's add jobs stages and fix all the issues we found.

What we got after the first iteration:

image: node: 16.3.0 # (1)

stages:
  - test

variables:
  npm_config_cache: "$CI_PROJECT_DIR/.npm" (5)

# This folder is cached between builds
cache:
  key:
    files:
      - package-lock.json (6)
  paths:
    - .npm # (2)

test_async:
  stage: test
  script:
    - npm ci # (3)
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  stage: test
  script:
    - npm ci # (4)
    - node ./specs/start.js ./specs/db-postgres.spec.js

We improved Pipeline and make it reproducible. There are two drawbacks left. First, the cache is shared. Every job will pull the cache and push the new version after executing the job. It's a good practice to update cache only once inside Pipeline. Second, every job installs the package dependencies and wastes time.

To fix the first problem we describe the cache management explicitly. Let's add a "hidden" job and enable only pull policy (download cache without updating):

# Define a hidden job to be used with extends
# Better than default to avoid activating cache for all jobs
.dependencies_cache:
  cache:
    key:
      files:
        - package-lock.json
    paths:
      - .npm
    policy: pull

To connect the cache you need to inherit the job via extends keyword.

...
extends: .dependencies_cache
...

To fix the second issue we use artifacts. Let's create the job that archives package dependencies and passes the artifact with node_modules further. Subsequent jobs will run tests from the spot.

setup:
  stage: setup
  script:
    - npm ci
  extends: .dependencies_cache
  cache:
    policy: pull-push
  artifacts:
    expire_in: 1h
    paths:
      - node_modules

We install the npm dependencies and use the cache described in the hidden dependencies_cache job. Then we specify how to update the cache via a pull-push policy. A short lifetime (1 hour) helps to save space for the artifacts. There is no need to keep node_modules artifact for a long time on the GitLab server.

The full config after the changes:

image: node: 16.3.0 # (1)

stages:
  - setup
  - test

variables:
  npm_config_cache: "$CI_PROJECT_DIR/.npm" (5)

# Define a hidden job to be used with extends
# Better than default to avoid activating cache for all jobs
.dependencies_cache:
  cache:
    key:
      files:
        - package-lock.json
    paths:
      - .npm
    policy: pull

setup:
  stage: setup
  script:
    - npm ci
  extends: .dependencies_cache
  cache:
    policy: pull-push
  artifacts:
    expire_in: 1h
    paths:
      - node_modules

test_async:
  stage: test
  script:
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  stage: test
  script:
    - node ./specs/start.js ./specs/db-postgres.spec.js

We learned what's the difference between cache and artifacts. We built a reproducible Pipeline that works predictably and uses resources efficiently. This article shows some common mistakes and how to avoid them when you are setting up CI in GitLab.
I wish you green builds and fast pipelines. Would appreciate your feedback in the comments!

Links


This content originally appeared on DEV Community and was authored by Anton Yakutovich


Print Share Comment Cite Upload Translate Updates
APA

Anton Yakutovich | Sciencx (2021-08-04T10:13:46+00:00) GitLab CI: Cache and Artifacts explained by example. Retrieved from https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/

MLA
" » GitLab CI: Cache and Artifacts explained by example." Anton Yakutovich | Sciencx - Wednesday August 4, 2021, https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/
HARVARD
Anton Yakutovich | Sciencx Wednesday August 4, 2021 » GitLab CI: Cache and Artifacts explained by example., viewed ,<https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/>
VANCOUVER
Anton Yakutovich | Sciencx - » GitLab CI: Cache and Artifacts explained by example. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/
CHICAGO
" » GitLab CI: Cache and Artifacts explained by example." Anton Yakutovich | Sciencx - Accessed . https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/
IEEE
" » GitLab CI: Cache and Artifacts explained by example." Anton Yakutovich | Sciencx [Online]. Available: https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/. [Accessed: ]
rf:citation
» GitLab CI: Cache and Artifacts explained by example | Anton Yakutovich | Sciencx | https://www.scien.cx/2021/08/04/gitlab-ci-cache-and-artifacts-explained-by-example/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.