Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial logger support #24

Merged
merged 13 commits into from
Jul 4, 2023
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,26 +20,33 @@ pip install -e .

## Usage
```bash
% ./trainer.py --help
usage: trainer.py [-h] --config CONFIG [--temporary-directory TEMPORARY_DIR] [--state STATE_FILE] [--do-not-resume] [--sync] [trainer-command [arguments]]
% opustrainer-train --help
usage: opustrainer-train [-h] --config CONFIG [--state STATE] [--sync] [--temporary-directory TEMPORARY_DIRECTORY] [--do-not-resume] [--no-shuffle] [--log-level LOG_LEVEL] [--log-file LOG_FILE] ...

Feeds marian tsv data for training.

positional arguments:
trainer Trainer program that gets fed the input. If empty it is read from config.

options:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
YML configuration input.
--temporary-directory TEMPORARY_DIR, -t TEMPORARY_DIR
--state STATE, -s STATE
YML state file, defaults to ${CONFIG}.state.
--sync Do not shuffle async
--temporary-directory TEMPORARY_DIRECTORY, -T TEMPORARY_DIRECTORY
Temporary dir, used for shuffling and tracking state
--state STATE_FILE Path to trainer state file which stores how much of
each dataset has been read. Defaults to ${CONFIG}.state
--sync Do not shuffle in the background
--do-not-resume, -d Do not resume from the previous training state
--no-shuffle, -n Do not shuffle, for debugging
--log-level LOG_LEVEL
Set log level. Available levels: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default is INFO
--log-file LOG_FILE, -l LOG_FILE
Target location for logging. Always logs to stderr and optionally to a file.
```
Once you fix the paths in the configuration file, `train_config.yml` you can run a test case by doing:
```bash
./trainer.py -c train_config.yml /path/to/marian -c marian_config --any --other --flags
opustrainer-train -c train_config.yml /path/to/marian -c marian_config --any --other --flags
```
You can check resulting mixed file in `/tmp/test`. If your neural network trainer doesn't support training from `stdin`, you can use this tool to generate a training dataset and then disable data reordering or shuffling at your trainer implementation, as your training input should be balanced.

Expand Down
12 changes: 0 additions & 12 deletions contrib/test-data/test_enzh_config.expected.out
Original file line number Diff line number Diff line change
Expand Up @@ -98,15 +98,3 @@ On 1 March, Australia Reported The First Death From COVID-19: A 78-year-old Pert
Food Poisoning and Food Hygiene. 微生物检验与食品安全控制.
On 20 September 201, The Australian Olympic Committee Announced The First Set Of Sailors Selected For Tokyo 2020, Namely Rio 2016 Silver Medalists And Deending World 470 Champions Mathew Belcher And William Ryan And World's Current Top-ranked Laser Sailor Matthew Wearn. 2019年 9月 20日, 澳大利亚奥林匹克委员会公布了第一批入选奥运阵容的帆船选手名单, 名单中包括 2016年里约奥运银牌得主 Mathew Belcher 和 William Ryan 。 2020年 2月 27日, 第二批入选奥运阵容的帆船选手名单正式公布。 2020年 3月 19日, Mara Stransky 确认获得代表澳大利亚参加女子辐射型的资格, 成为第三批入选的帆船选手。
429 __source__ TRANSPORT __target__ 运输 __done__ SQUADRON (429 BIISON SQUADRON) - FYLING THE CC-17 429 运输中队 (429 野牛), 使用 CC - 177
[Trainer] Starting stage start
[Trainer] Reading clean for epoch 0
[Trainer] Reading clean for epoch 1
[Trainer] Reading clean for epoch 2
[Trainer] Reading clean for epoch 3
[Trainer] Reading clean for epoch 4
[Trainer] Reading clean for epoch 5
[Trainer] Reading clean for epoch 6
[Trainer] Reading clean for epoch 7
[Trainer] Reading clean for epoch 8
[Trainer] Reading clean for epoch 9
[Trainer] waiting for trainer to exit. Press ctrl-c to be more aggressive
12 changes: 12 additions & 0 deletions contrib/test-data/test_enzh_config_plain_expected.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[2023-07-04 02:48:11] [Trainer] [INFO] Starting stage start
[2023-07-04 02:48:11] [Trainer] [INFO] Reading clean for epoch 0
[2023-07-04 02:48:11] [Trainer] [INFO] Reading clean for epoch 1
[2023-07-04 02:48:11] [Trainer] [INFO] Reading clean for epoch 2
[2023-07-04 02:48:11] [Trainer] [INFO] Reading clean for epoch 3
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 4
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 5
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 6
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 7
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 8
[2023-07-04 02:48:12] [Trainer] [INFO] Reading clean for epoch 9
[2023-07-04 02:48:12] [Trainer] [INFO] waiting for trainer to exit. Press ctrl-c to be more aggressive
23 changes: 0 additions & 23 deletions contrib/test-data/test_enzh_tags_advanced_config.expected.out
Original file line number Diff line number Diff line change
Expand Up @@ -198,26 +198,3 @@ FOOD POISONING AND FOOD HYGIENE. 微生物检验与食品安全控制.
TOGETHER __TARGET__ 共同 __DONE__ AT HOME (ALSO KNOWN AS ONE WORLD: TOGETHER AT HOME) WAS A VIRTUAL __TARGET__ 虚拟 __DONE__ CONCERT __TARGET__ 演唱会 __DONE__ SERIES ORGANISED BY GLOBAL CITIZEN __TARGET__ 公民 __DONE__ AND CURATED BY SINGER LADY GAGA, IN SUPPORT __TARGET__ 支持 __DONE__ OF THE __TARGET__ , __DONE__ WORLD __TARGET__ 世界 __DONE__ HEALTH __TARGET__ 卫生 __DONE__ ORGANIZATION. 同一个世界: 共同在家 (英语: ONE WORLD: TOGETHER AT HOME), 是于 2020年 4月 18日举行的虚拟系列演唱会, 推广在 2019 冠状病毒病疫情期间保持社交距离等防疫理念, 由全球公民和歌手嘎嘎小姐共同组织发起, 以支持世界卫生组织。
On 20 September 2019, __target__ 2019年 __done__ the __target__ 了 __done__ Australian __target__ 澳大利亚 __done__ Olympic Committee __target__ 委员会 __done__ announced __target__ 公布 __done__ the first set of sailors selected __target__ 入选 __done__ for Tokyo 2020, namely Rio __target__ 里约 __done__ 2016 silver medalists and __target__ 和 __done__ defending world 470 champions Mathew Belcher and William Ryan and world's current top-ranked Laser sailor Matthew Wearn. 2019年 9月 20日, 澳大利亚奥林匹克委员会公布了第一批入选奥运阵容的帆船选手名单, 名单中包括 2016年里约奥运银牌得主 Mathew Belcher 和 William Ryan 。 2020年 2月 27日, 第二批入选奥运阵容的帆船选手名单正式公布。 2020年 3月 19日, Mara Stransky 确认获得代表澳大利亚参加女子辐射型的资格, 成为第三批入选的帆船选手。
TRANSFUSION-RELATED ACUTE LUNG __TARGET__ 肺 __DONE__ INJURY (TRALI) IS A SERIOUS BLOOD TRANSFUSION COMPLICATION CHARACTERIZED BY THE ACUTE __TARGET__ 急性 __DONE__ ONSET __TARGET__ 引发 __DONE__ OF NON-CARDIOGENIC PULMONARY EDEMA FOLLOWING TRANSFUSION __TARGET__ 输血併 __DONE__ OF BLOOD PRODUCTS. __TARGET__ 。 __DONE__ 输血相关急性肺损伤 (TRANSFUSION RELATED ACUTE LUNG INJURY; TRALI) 是一种会引发急性肺水肿的严重输血併发症。
[Trainer] Starting stage start
[Trainer] Reading clean for epoch 0
[Trainer] Reading clean for epoch 1
[Trainer] Reading clean for epoch 2
[Trainer] Reading clean for epoch 3
[Trainer] Reading clean for epoch 4
[Trainer] Reading clean for epoch 5
[Trainer] Reading clean for epoch 6
[Trainer] Reading clean for epoch 7
[Trainer] Reading clean for epoch 8
[Trainer] Reading clean for epoch 9
[Trainer] Starting stage end
[Trainer] Reading clean for epoch 10
[Trainer] Reading clean for epoch 11
[Trainer] Reading clean for epoch 12
[Trainer] Reading clean for epoch 13
[Trainer] Reading clean for epoch 14
[Trainer] Reading clean for epoch 15
[Trainer] Reading clean for epoch 16
[Trainer] Reading clean for epoch 17
[Trainer] Reading clean for epoch 18
[Trainer] Reading clean for epoch 19
[Trainer] waiting for trainer to exit. Press ctrl-c to be more aggressive
23 changes: 0 additions & 23 deletions contrib/test-data/test_enzh_tags_stage_config.expected.out
Original file line number Diff line number Diff line change
Expand Up @@ -198,26 +198,3 @@ Food Poisoning and Food Hygiene. 微生物检验与食品安全控制.
Together at Home __target__ 在家 __done__ (also __target__ ( __done__ known as One World: Together at Home) was a virtual concert __target__ 演唱会 __done__ series organised __target__ 组织 __done__ by Global __target__ 全球 __done__ Citizen __target__ 公民 __done__ and curated by singer Lady Gaga, in __target__ 以 __done__ support of the __target__ , __done__ World __target__ 世界 __done__ Health Organization. 同一个世界: 共同在家 (英语: One World: Together at Home), 是于 2020年 4月 18日举行的虚拟系列演唱会, 推广在 2019 冠状病毒病疫情期间保持社交距离等防疫理念, 由全球公民和歌手嘎嘎小姐共同组织发起, 以支持世界卫生组织。
On 20 September __target__ 9月 __done__ 2019, the __target__ 了 __done__ Australian __target__ 澳大利亚 __done__ Olympic Committee __target__ 委员会 __done__ announced the first set of sailors selected __target__ 入选 __done__ for Tokyo 2020, namely Rio 2016 silver __target__ 银牌 __done__ medalists __target__ 阵容 __done__ and __target__ 和 __done__ defending world 470 champions Mathew Belcher __target__ Mara __done__ and William Ryan and world's __target__ 成为 __done__ current top-ranked Laser sailor Matthew Wearn. 2019年 9月 20日, 澳大利亚奥林匹克委员会公布了第一批入选奥运阵容的帆船选手名单, 名单中包括 2016年里约奥运银牌得主 Mathew Belcher 和 William Ryan 。 2020年 2月 27日, 第二批入选奥运阵容的帆船选手名单正式公布。 2020年 3月 19日, Mara Stransky 确认获得代表澳大利亚参加女子辐射型的资格, 成为第三批入选的帆船选手。
Transfusion-related __target__ 相关 __done__ acute __target__ 急性 __done__ lung injury (TRALI) is a serious __target__ 严重 __done__ blood transfusion complication __target__ TRALI) __done__ characterized by the acute __target__ 急性 __done__ onset __target__ 引发 __done__ of non-cardiogenic pulmonary edema following transfusion of blood products. 输血相关急性肺损伤 (Transfusion related acute lung injury; TRALI) 是一种会引发急性肺水肿的严重输血併发症。
[Trainer] Starting stage start
[Trainer] Reading clean for epoch 0
[Trainer] Reading clean for epoch 1
[Trainer] Reading clean for epoch 2
[Trainer] Reading clean for epoch 3
[Trainer] Reading clean for epoch 4
[Trainer] Reading clean for epoch 5
[Trainer] Reading clean for epoch 6
[Trainer] Reading clean for epoch 7
[Trainer] Reading clean for epoch 8
[Trainer] Reading clean for epoch 9
[Trainer] Starting stage end
[Trainer] Reading clean for epoch 10
[Trainer] Reading clean for epoch 11
[Trainer] Reading clean for epoch 12
[Trainer] Reading clean for epoch 13
[Trainer] Reading clean for epoch 14
[Trainer] Reading clean for epoch 15
[Trainer] Reading clean for epoch 16
[Trainer] Reading clean for epoch 17
[Trainer] Reading clean for epoch 18
[Trainer] Reading clean for epoch 19
[Trainer] waiting for trainer to exit. Press ctrl-c to be more aggressive
12 changes: 0 additions & 12 deletions contrib/test-data/test_zhen_config.expected.out
Original file line number Diff line number Diff line change
Expand Up @@ -98,15 +98,3 @@ SIR S 标准成立于 1992年, 是美国胸科医师学会 / 重症监护医学
2020年 1月 20日, 华盛顿州确诊首例 COVID - 19 患者。 1月 29日, 成立白宫冠状病毒工作组。 1月 31日, 特朗普政府 __source__ 宣布 __target__ declared __done__ 进入公共卫生紧急状态。 On 30 January, the WHO declared a Public Health Emergency of International Concern and on January 31, the Trump administration declared a public health emergency, and placed travel restrictions on entry for travellers from China.
同日, 一个曾为钻石公主号邮轮乘客的 78 岁男性老人宣布死亡, 为澳大利亚首例因感染 COVID - 19 死亡的 __source__ 病例 __target__ reported __done__ 。他曾在西澳大利亚州 __source__ Sir __target__ the __done__ Charles Gairdner Hospital 治疗。 On 1 March, Australia reported the first death from COVID-19: a 78-year-old Perth man, who was one of the passengers from the Diamond Princess, and who had been evacuated and was being treated in Western Australia.
429 运输中队 (429 野牛), 使用 CC - 177 429 Transport Squadron (429 Bison Squadron) - Flying the CC-177
[Trainer] Starting stage start
[Trainer] Reading clean for epoch 0
[Trainer] Reading clean for epoch 1
[Trainer] Reading clean for epoch 2
[Trainer] Reading clean for epoch 3
[Trainer] Reading clean for epoch 4
[Trainer] Reading clean for epoch 5
[Trainer] Reading clean for epoch 6
[Trainer] Reading clean for epoch 7
[Trainer] Reading clean for epoch 8
[Trainer] Reading clean for epoch 9
[Trainer] waiting for trainer to exit. Press ctrl-c to be more aggressive
12 changes: 0 additions & 12 deletions contrib/test-data/test_zhen_config_prefix.expected.out
Original file line number Diff line number Diff line change
Expand Up @@ -98,15 +98,3 @@ Together At Home (also Known As One World: Together At Home) Was A Virtual Conce
Together at Home (also known as One World: Together at Home) was a virtual concert series organised by Global Citizen and curated by singer Lady Gaga, in support of the World Health Organization. 同一个 世界 : 共同 在家 ( 英语 : One World : Together at Home) , 是 于 2020年 4月 18日 举行 的 虚拟 系列 演唱会 , 推广 在 2019 冠状 病毒 病 疫情 期间 保持 社交 距离 等 防疫 理念 , 由 全球 公民 和 歌手 嘎嘎 小姐 共同 组织 发起 , 以 支持 世界 卫生 组织 。 0-3 2-4 3-5 4-31 6-8 7-7 7-10 8-11 9-12 11-15 11-16 12-23 13-22 14-24 15-23 16-49 17-41 18-42 19-43 20-44 21-41 22-41 23-45 24-46 25-46 25-47 26-52 27-53 29-51 30-54 31-55 32-56 32-57
__start__ 虚拟 系列 __end__ Together at Home (also known as One World: Together at Home) was a virtual concert series organised by Global Citizen and curated by singer Lady Gaga, in support of the World Health Organization. 同一个 世界 : 共同 在家 ( 英语 : One World : Together at Home) , 是 于 2020年 4月 18日 举行 的 虚拟 系列 演唱会 , 推广 在 2019 冠状 病毒 病 疫情 期间 保持 社交 距离 等 防疫 理念 , 由 全球 公民 和 歌手 嘎嘎 小姐 共同 组织 发起 , 以 支持 世界 卫生 组织 。 0-3 2-4 3-5 4-31 6-8 7-7 7-10 8-11 9-12 11-15 11-16 12-23 13-22 14-24 15-23 16-49 17-41 18-42 19-43 20-44 21-41 22-41 23-45 24-46 25-46 25-47 26-52 27-53 29-51 30-54 31-55 32-56 32-57
On 20 September 2019, the Australian Olympic Committee announced the first set of sailors selected for Tokyo 2020, namely Rio 2016 silver medalists and defending world 470 champions Mathew Belcher and William Ryan and world's current top-ranked Laser sailor Matthew Wearn. 2019年 9月 20日 , 澳大利亚 奥林匹克 委员会 公布 了 第一 批 入选 奥运 阵容 的 帆船 选手 名单 , 名单 中 包括 2016年 里约 奥运 银牌 得主 Mathew Belcher 和 William Ryan 。 2020年 2月 27日 , 第二 批 入选 奥运 阵容 的 帆船 选手 名单 正式 公布 。 2020年 3月 19日 , Mara Stransky 确认 获得 代表 澳大利亚 参加 女子 辐射型 的 资格 , 成为 第三 批 入选 的 帆船 选手 。 0-2 0-3 1-2 2-1 3-0 4-8 5-4 6-5 7-6 8-7 10-9 10-10 11-10 13-16 14-39 17-33 19-23 20-22 21-25 22-13 23-29 28-27 29-53 31-30 32-31 34-65 36-61 36-62 37-69 40-70 40-71
[Trainer] Starting stage start
[Trainer] Reading clean for epoch 0
[Trainer] Reading clean for epoch 1
[Trainer] Reading clean for epoch 2
[Trainer] Reading clean for epoch 3
[Trainer] Reading clean for epoch 4
[Trainer] Reading clean for epoch 5
[Trainer] Reading clean for epoch 6
[Trainer] Reading clean for epoch 7
[Trainer] Reading clean for epoch 8
[Trainer] Reading clean for epoch 9
[Trainer] waiting for trainer to exit. Press ctrl-c to be more aggressive
40 changes: 40 additions & 0 deletions src/opustrainer/logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from io import TextIOWrapper
import logging
from sys import stderr
from typing import List, TextIO
from functools import lru_cache

def getLogLevel(name: str) -> int:
"""Incredibly, i can't find a function that will do this conversion, other
than setLevel, but setLevel doesn't work for calling the different log level logs."""
if name.upper() in logging.getLevelNamesMapping():
return logging.getLevelNamesMapping()[name.upper()]
else:
logging.log(logging.WARNING, "unknown log level level used: " + name + " assuming warning...")
return logging.WARNING

def log(msg: str, loglevel: str = "INFO") -> None:
level = getLogLevel(loglevel)
logging.log(level, msg)


@lru_cache(None)
def log_once(msg: str, loglevel: str = "INFO") -> None:
"""A wrapper to log, to make sure that we only print things once"""
log(msg, loglevel)


def setup_logger(outputfilename: str | None = None, loglevel: str = "INFO", disable_stderr: bool=False) -> None:
"""Sets up the logger with the necessary settings. Outputs to both file and stderr"""
loggingformat = '[%(asctime)s] [Trainer] [%(levelname)s] %(message)s'
handlers: List[logging.StreamHandler[TextIO] | logging.StreamHandler[TextIOWrapper]] = []
# disable_stderr is to be used only when testing the logger
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grumpy noises

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want strings for loglevel? Why not just export the log levels we support and reference logger.WARNING instead of "WARNING" in code? Additional upside is that you can't reference log levels that don't exist because the symbols to do so are not there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why StreamHandler[TextIO]? I can't find any reference of StreamHandler inheriting Generic.

Btw If you're okay with it I want to go through your changes and make them Python 3.8 compatible (which basically means not using generics where they're not supported, and using Union instead of |)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My type checker flagged it as an incompatible type assignment and this was the inferred type. I guess UNION should be used instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the WARNING I thought it'd be easier to use a text based log level from modifiers, in case users want to add their own logging, rather than getting to python internals.

That being said, logging.getLevelNamesMapping() is only available in python 3.11 so I shouldn't be using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we define the level at runtime, I had to have str -> loglevel conversion anyways..

# When testing the logger directly, we don't want to write to stderr, because in order to read
# our stderr output, we have to use redirect_stderr, which however makes all other tests spit
# as it interferes with unittest' own redirect_stderr. How nice.
if not disable_stderr:
handlers.append(logging.StreamHandler(stream=stderr))
if outputfilename is not None:
handlers.append(logging.FileHandler(filename=outputfilename))
logging.basicConfig(handlers=handlers, encoding='utf-8', level=getLogLevel(loglevel), format=loggingformat, datefmt='%Y-%m-%d %H:%M:%S')

Loading