iLeLe 2018-3-22 18:27作者

高压力场景下数据库服务器蓝屏卡死问题和gbased进程陷入系统kernel(进程为D状态)问 ...

字数 4427 阅读 118 评论 0 赞 0
最近某个项目又发生了个诡异的现象:CPU突然之间发生100%的sys占用,导致系统卡死。
在dmesg日志中,发现blocked for more than 120 seconds的问题,可能和vfs有关。dmesg如下:
INFO: task gbased:21667 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gbased        D 0000000000000005     0 21667   6061 0x00000080
 ffff88029f2199d8 0000000000000086 ffff88029f219a28 ffff881086842e40
 ffff88029f2199b8 ffff8802ab417898 ffff880b60155f18 ffffffffa00fd760
 ffff8802b6369058 ffff88029f219fd8 000000000000fbc8 ffff8802b6369058
Call Trace:
 [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa009f08a>] start_this_handle+0x25a/0x480 [jbd2]
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa009f495>] jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa00deeb6>] ext4_journal_start_sb+0x56/0xe0 [ext4]
 [<ffffffffa00ccd81>] ext4_da_write_begin+0x91/0x200 [ext4]
 [<ffffffffa00cc855>] ? ext4_da_write_end+0x105/0x2d0 [ext4]
 [<ffffffff811202d3>] generic_file_buffered_write+0x123/0x2e0
 [<ffffffffa00c7f4f>] ? ext4_dirty_inode+0x4f/0x60 [ext4]
 [<ffffffff81121d30>] __generic_file_aio_write+0x260/0x490
 [<ffffffff811aaa20>] ? mntput_no_expire+0x30/0x110
 [<ffffffff81121fe8>] generic_file_aio_write+0x88/0x100
 [<ffffffffa00c1fd8>] ext4_file_write+0x58/0x190 [ext4]
 [<ffffffff811862e4>] ? nameidata_to_filp+0x54/0x70
 [<ffffffff81188c7a>] do_sync_write+0xfa/0x140
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

该现场在几个月前,发生系统在高压力下蓝屏和gbased陷入vfs_fsync处理的问题,与这次的问题有些相像。
--------- 几个月前的问题分析结论:
-- 蓝屏问题
系统messages日志中,有linux内核的block,如下:
INFO: task jbd2/sdb1-8:2137 blocked for more than 120 seconds.
Aug 11 13:42:45 localhost kernel:      Not tainted 2.6.32-431.el6.x86_64 #1
被卡住在sdb1-8:2137 上 ext4 文件系统一个程序
Aug 11 13:42:45 localhost kernel: Call Trace:
Aug 11 13:42:45 localhost kernel: [<ffffffff811bebb7>] ? __set_page_dirty+0x87/0xf0
Aug 11 13:42:45 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
Aug 11 13:42:45 localhost kernel: [<ffffffffa00a080f>] jbd2_journal_commit_transaction+0x19f/0x1500 [jbd2]
Aug 11 13:42:45 localhost kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Aug 11 13:42:45 localhost kernel: [<ffffffff8108412c>] ? lock_timer_base+0x3c/0x70
Aug 11 13:42:45 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
根据日志描述可能是kernel的bug
请见https://bugzilla.kernel.org/show_bug.cgi?id=44731

-- gbased进程D状态,进程陷在linux kernel中,机器无响应,ssh 无法登陆。
message 有如下类型信息:
Aug 31 20:31:44 localhost kernel: INFO: task gbased:28989 blocked for more than 120 seconds.
Aug 31 20:31:44 localhost kernel: Call Trace:
Aug 31 20:31:44 localhost kernel: [<ffffffffa00c2231>] ext4_sync_file+0x121/0x1d0 [ext4]
Aug 31 20:31:44 localhost kernel: [<ffffffff811baa61>] vfs_fsync_range+0xa1/0x100
Aug 31 20:31:44 localhost kernel: [<ffffffff811bab2d>] vfs_fsync+0x1d/0x20
Aug 31 20:31:44 localhost kernel: [<ffffffff811bab6e>] do_fsync+0x3e/0x60
Aug 31 20:31:44 localhost kernel: [<ffffffff811baba3>] sys_fdatasync+0x13/0x20
Aug 31 20:31:44 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
分析为gbased 进行fdatasync 系统调用时, 陷入文件系统的vfs_fsync处理中。

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

0 条评论

您需要登录后才可以评论 登录 | 立即注册

作者其他文章