从一次Linux计划任务不执行的问题排查过程说起

=Start=

在VPS上有一个扫描的定时任务没有正常运行，所以需要定位一下问题到底出在哪了：cron没有正常运行？还是因为脚本运行时长太长了导致执行不成功？亦或是脚本中某个命令太长导致执行失败？

查看文档（RTFM）

/etc/crontab

/var/spool/cron/*

/etc/cron.{hourly,daily,weekly,monthly}/

/etc/anacrontab

检查日志（RTFL）

/var/log/cron *

/var/spool/mail/*

1.直接执行

在命令行上直接执行（Tips：需要 在多个不同目录中进行测试 ，防止因为路径问题导致的错误）；

通过另外一个脚本调用执行（Tips：有些程序在非终端情况下无法正常执行）

2.查看cron中的环境变量

在自己写的脚本中手动设定环境变量PATH （也可以通过source命令载入其它脚本用来设置环境变量）；

先分析一下现有cron中的环境变量「`* * * * * env > /tmp/env.output`」；

3.查看邮件

正常情况下，crontab执行失败，在没有屏蔽错误的情况下会产生一封系统邮件，位于 /var/spool/mail/root ，可以通过查看邮件内容来分析错误原因。

原因总结：

crontab服务未正常运行（ 查看 /var/log/cron 或 /var/log/messages 日志 ）；

脚本权限问题（`ls -l`查看脚本的属主以及可执行位是否已设置）；

环境变量问题（ 手动设置PATH ）；

路径问题（ 所有需要用到路径的地方都使用绝对路径 ）；

另，在最初以为是cron本身不支持「长时间任务」或「命令过长」导致定时任务无法正常执行，但是后来通过阅读crontab源码发现并没有这些限制，所以以后如果碰到cron任务有问题了，老老实实按照上面的几个原因进行分析，肯定可以解决问题的。

参考链接：

分享一次Linux任务计划crontab不执行的问题排查过程

http://serverfault.com/questions/449651/why-is-my-crontab-not-working-and-how-can-i-troubleshoot-it

http://askubuntu.com/questions/23009/reasons-why-crontab-does-not-work

https://www.digitalocean.com/community/tutorials/how-to-schedule-routine-tasks-with-cron-and-anacron-on-a-vps

http://www.cyberciti.biz/faq/how-do-i-add-jobs-to-cron-under-linux-or-unix-oses/

=END=

7月 15, 2016

admin

crontab , debug , Linux

最近在Linux系统上用Cron定时执行一些任务时总是失败，查看日志发现原来是crond不知什么时候挂了，重启多次也没用。。。查看「/var/log/cron」日志发现内容为空，重启crond也不起作用，无奈，现在只能用另一个daemon进程(Supervisord)去监控crond的状态了
http://supervisord.org/installing.html
http://ju.outofmemory.cn/entry/201847
http://www.ziqiangxuetang.com/django/django-nginx-deploy.html
http://blog.wiseturtles.com/posts/centos-supervisor.html
http://wiki.hustlzp.com/post/linux/centos-7-python
sudo pip install supervisor
sudo echo_supervisord_conf > /etc/supervisord.conf
sudo mkdir -p /etc/supervisord.d/
# [include]
# files = /etc/supervisord.d/*.conf
sudo supervisord -c /etc/supervisord.conf
# vim /etc/supervisord.d/crond.conf
[program:start_crond]
command=/etc/init.d/crond start
startsecs=0
stopwaitsecs=0
autorestart=true
user=root
supervisorctl status
supervisorctl help
supervisorctl start start_crond
https://github.com/dshearer/jobber
http://stackoverflow.com/questions/288349/alternative-to-cron
https://unix.stackexchange.com/questions/164424/fault-tolerant-crond-replacement
https://unix.stackexchange.com/questions/278564/cron-vs-systemd-timers

生产环境中使用Cron
https://wanqu.co/a/2098/2015-10-18-cron-in-production-that-is-a-double-edged-sword.html
https://medium.com/airbnb-engineering/chronos-a-replacement-for-cron-f05d7d986a9d
https://apscheduler.readthedocs.io/en/latest/

【应急响应】定时任务知识梳理
http://vinc.top/2017/07/19/%E3%80%90%E5%BA%94%E6%80%A5%E5%93%8D%E5%BA%94%E3%80%91%E5%AE%9A%E6%97%B6%E4%BB%BB%E5%8A%A1%E7%9F%A5%E8%AF%86%E6%A2%B3%E7%90%86/
应急响应中关于定时任务应该排查的文件有：
/etc/crontab,/etc/cron.d,/var/spool/cron/{user},/etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly, /etc/cron.monthly，/etc/anacrontab

其中比较容易忽视的是 /etc/anacrontab

如何查看和CentOS下定时任务cron有关的文件列表？
# rpm -qa | grep “cron”
# rpm -ql crontabs
# rpm -ql cronie
# rpm -ql cronie-anacron

利用定时任务（Cronjobs）进行Linux提权
https://xz.aliyun.com/t/2401
http://www.hackingarticles.in/linux-privilege-escalation-by-exploiting-cron-jobs/
Cron jobs
Crontab文件覆写
创建一个定时任务
Crontab Tar Wildcard注入
创建定时任务
https://mp.weixin.qq.com/s/KdgIYX4zb_xoYNVIugBLvg
由于在部分机器上，crontab对于执行程序的输出有大小限制，输出超出一定的字节之后就会自动停止程序。

而我的程序每发送1000条数据即会输出一条log，所以每次正好输出49000这条log之后，就超出了大小限制，因此每次都会自动停止在48999条了。

解决方案：可以重定向输出至 >/dev/null，如：
1 8 * * * cd /path/to/your/bin/ && ./your_bin –flagfile=../conf/your.flags >/dev/null 2>&1 &

希望遇到类似问题的同学，可以尽快的搜索到我这篇文章。
https://mp.weixin.qq.com/s/B8Lgiw69Z_Fi46019eCTJg
Cron 是 *nix 系统中常见的有一个 daemon，用于定时执行任务。cron 的实现非常简单，以最常用的 vixie cron 为例，大概分为三步：
1. 每分钟读取 crontab 配置
2. 计算需要执行的任务
3. 执行任务，主进程执行或者开启一个 worker 进程执行

Cron 的实现每次都是重新加载 crontab，哪怕计算出来下次可执行时间是 30 分钟之后，也不会说 sleep(30)，这样做是为了能够在每次 crontab 变更的时候及时更新。

cron 是没有运行记录的，并且每次都会重新加载 crontab，所以总体来说 cron 是一个无状态的服务。
在大多数情况下，这种简单的机制是非常高效且稳健的，但是考虑到一些复杂的场景也会有一些问题，包括本文标题中的问题：
1. 如果某个任务在下次触发的时候，上次运行还没有结束怎么办？
2. 当系统关机的时候有任务需要触发，开机后 cron 还会补充执行么？
3. 如果错过了好多次执行，那么补充执行的时候需要执行多少次呢？

Unix 上传统的 cron daemon 没有考虑以上三个问题，也就是说错过就错过了，不会再执行。为了解决这个问题，又一个辅助工具被开发出来了——anacron, ana 是 anachronistic（时间错误）的缩写。anacron 通过文件的时间戳来追踪任务的上次运行时间。具体的细节就不展开了，可以参考文章后面的参考文献。

总之，如果只有 cron，那么不会执行错过的任务，但是配合上 anacron，还是有机会执行错过的任务的。

定时执行任务是一个普遍存在的需求，除了在系统层面以外，多种不同的软件中都实现了，我们可以认为他们是广义的 cron。这些广义的 cron 大多考虑了这些问题，下面以 apscheduler 和 kubernetes 为例说明一下。
https://segmentfault.com/a/1190000011084828
APScheduler是一个python的第三方库，用来提供python的后台程序。包含四个组件，分别是：
triggers：任务触发器组件，提供任务触发方式
job stores：任务商店组件，提供任务持久化方式
executors：任务调度组件，提供任务调度方式
schedulers：任务调度组件，提供任务工作方式

How can I run task every 10 minutes on the 5s, using BlockingScheduler?
https://stackoverflow.com/questions/66662408/how-can-i-run-task-every-10-minutes-on-the-5s-using-blockingscheduler

APScheduler User guide
https://apscheduler.readthedocs.io/en/3.x/userguide.html

python apscheduler – skipped: maximum number of running instances reached
https://stackoverflow.com/questions/34020161/python-apscheduler-skipped-maximum-number-of-running-instances-reached

Configuring the scheduler¶
https://apscheduler.readthedocs.io/en/latest/userguide.html#configuring-the-scheduler
scheduler = apscheduler.schedulers.background.BackgroundScheduler(job_defaults={‘max_instances’: 2})
apscheduler提示maximum错误
https://blog.csdn.net/wyongqing/article/details/50320601

apscheduler调度器异常错误：skipped: maximum number of running instances reached (1)
https://blog.csdn.net/weixin_43343144/article/details/104256280