Airflow2 日志使用vector写入Elastic Search_大数据系统-终身学习网

Airflow2 日志使用vector写入Elastic Search

大数据系统更新时间：2026-03-20 22:20:30发布时间：1573天前百科书网趣学号

文章目录

- 组件
- 步骤
- 坑

组件

ElasticSearch - 7.15.2
vector - 0.17.3
Airflow - 2.2.1

步骤

启动elasticsearch跟kibana（可选，但是启动起来能看es里面的数据，方便调试）
先写好vector解析toml，新版本的vector可以使用Vector Remap Language来添加字段、删除字段，还有一些内置函数，变量等，挺好用的。

[sources.airflow_log]
    type = "file"
    ignore_older_secs = 86400
    include = [ "/home/greetlist/airflow/logs*/*.log" ]
    read_from = "beginning"
    data_dir = "."

[transforms.transform_get_unique_id]
    type = "remap"
    inputs = [ "airflow_log" ]
    source = """
        . |= parse_regex!(.file, r'/home/greetlist/airflow/logs/(?P.*)/(?P.*)/(?P.*)/(?P.*).log$')
    """

[transforms.transform_remove_file_field]
    type = "remap"
    inputs = [ "transform_get_unique_id" ]
    source = """
        del(.file)
        del(.host)
    """

[transforms.transform_add_log_id_field]
    type = "remap"
    inputs = [ "transform_remove_file_field" ]
    source = """
        .log_id = join!([.dag_id, .task_id, .run_id, .try_number], "-")
        .offset = 1
    """

[sinks.airflow_log_sink]
    type = "console"
    inputs = [ "transform_add_log_id_field" ]
    target = "stdout"
    encoding.codec = "json"

[sinks.to_elasticsearch]
    type = "elasticsearch"
    inputs = [ "transform_add_log_id_field" ]
    endpoint = "http://127.0.0.1:9200"
    index = ".log_id"
    mode = "data_stream"
    #pipeline = "pipeline-name"
    compression = "none"

上面配置文件需要注意几点：

根据Airflow elasticsearch里面查找日志的唯一键：

{dag_id}-{task_id}-{run_id}-{try_number}

我们需要添加log_id这个字段到json里面

offset字段记得加，要不然airflow那边代码会把offset <= 0的记录过滤掉。
删除host这个字段(好像是跟es的host冲突)
把file这种长度太长的字段剔除掉
如果es没有pipline的话，记得把pipline这个注释掉
endpoint改成自己的es地址

启动vector

vector -c vector.toml

修改elastic search provider代码：

这个就很恶心，vector里面可能读取一行的时候是以n或者nr这种来取的，但是这边的python代码在日志结束的时候不写换行符，就可能会导致vector一直卡在读取行，而不把end_of_log日志结尾标志写进es里面。进而在airflow web那边看日志的时候，就会一直有load组件（转圈圈）在界面上消不掉。

修改airflow.cfg

[logging]
remote_logging = True

[elasticsearch]
# Elasticsearch host
host = http://localhost:9200

# Format of the log_id, which is used to query for a given tasks logs
log_id_template = {dag_id}-{task_id}-{run_id}-{try_number}

注意：

版本问题，在execution_date还没删除之前，第三个大括号里面是execution_date,后面版本是run_id。（具体哪个版本不太清楚）
记得修改自己的es的地址。

修改完代码之后就可以重启scheduler跟webserver了。
完工

坑

使用kibana挺方便的，可以看每一条写入es的日志长什么样。
offset这个变量也挺坑的，记得写入es的时候加上。
换行符这个纯恶心人，只能看代码。

Airflow2 日志使用vector写入Elastic Search

大数据系统相关栏目本月热门文章