2008-04-27 Sun
Author:拖雷 posted on Taobao.com
一个IDC机房断电的几率有多大?连续两年,每年都断一次电的几率又有多大?我们的网通机房就出现了这样的事情,而我,就是连续两年都碰上这个断电灾难的人,真的很想知道,有多少DBA是如此“幸运”。
去年6月28日,网通IDC机房因为UPS故障,导致整个机房停电,影响taobao整个业务达到120分钟,今年的4月25日下午,这一灾难又重演了,稍微不同的是,今年不是UPS的问题,而是线路故障,这次的影响时间更是高达210分钟。
这个机房的整体环境不是很好,我们的服务器也在慢慢的搬离这个机房了。其实,出问题的时候,我仅仅有一台核心数据库是运行在这个机房的,再等一个星期,这个数据库也会撤离这个机房,但是,事故往往就是发生在这个时候。
还记得这次断电,也就是4月25日下午13:08分,又是一个业务高峰时期,看起来与平常没有两样。先是收到运维经理的电话,告诉我网通将在5分钟以后停电,弄的我一愣一愣的,在确定不是愚人节玩笑以后,我把情况赶紧告诉相关人员,离我接到电话不到一分钟,也大概就在我刚说完情况的时候,就有人在叫了,“网通连不上了”。
我差点没有倒下去,机房停电没有这么通知的吧,就给一分钟的缓冲时间?后来才知道,线路出问题以后,UPS开始供电,但是,最初供电的宝贵时间中,机房人员居然没有觉察到,等发现情况的时候,已经晚了。
其实,这个数据库在另外一个机房是有备用环境的,但是因为是data guard异步模式,切换过去也可以,但是肯定要丢失部分数据(根据应用的日志,可以补上,但是稍微麻烦一些)。我们先决定评估来电的时间,如果在半小时以内,我们就等,否则,就切换。根据最先的信息,应当基本上半小时可以恢复供电,但是实际过了半小时,电力还是没有恢复,于是,我们决定启动另外一个机房的备用环境,刚刚把数据库从standby起动到open状态,就接到最新信息,说网通IDC机房故障排除,已经恢复电力供应了,我再次晕倒。在决定用备用数据库,但是补部分数据与起用已经断电的设备,不补数据之间,我还是选择了后者。代价是我这个800多GB的Standby,因为Open了,不得不重建了。
IBM P590断电后重新启动的时间真长,整个上电加起OS整整花了一个小时,加上再起应用,到应用恢复正常,宝贵的210分钟就过去了,这可是星期五的下午,一个业务高峰时期的210分钟。幸好的是,这次断电对我们DBA的硬件设备没有造成很大的冲击,除了有CX700坏了几块硬盘,SA的服务器因为突然的停电与来电,很多服务器都损坏了。
啥也不想说了,命背,最近的事情本来就多,加上这档子事情,江枫到今天为止,已经是连续5个熬夜了,大家都辛苦了,我们会挺过这一关的。
在刚刚结束的2008年 世界扑克巡回赛(World Poker Tour)报名费为两万五千美元的总决赛中,中国大陆(广西)旅美牌手David Chiu 荣获冠军, 并赢得 三百四十万美元的奖金。 老邱来自中国广西,在扑克界成名已久, 拥有5条WSOP金手链。我在2004年WSOP冠军赛和2005年WSOP $1500有限豪胆赛中曾两次和老邱同桌竞技,对老邱的牌技留下深刻印象。


2008-04-26 Sat
Let us talk few more about disks. You might have read my previous post and Matt’s Reply and it looks like there are few more things to clarify and explain.
Before I get to main topic of the article lets comment on IO vs Disk question. If you look at Disk Based databases all data accesses are treated as IOs - it can be “logical” if they are cached or “phyiscal” if they require actual IO done but in the general sense all data accesses are IOs. If you use this terminology when most of the problems would come down to IO - making queries to touch fewer rows (or row portions) or having these “touches” resolved as logical IO rather than physical. There is still locking ,networking etc to deal with but it is minor story.
This is not however as Most of the people understand IO and as not as I typically use these terms. For me IO is IO bound workload - disks are moving and CPU sits idle. With such terminology there is instantly much smaller amount of cases are about IO because we would call cases when too much of logical IO is happening CPU bound. The beauty of this terminlogy (and so why I use it) - it is very easy to see if system is IO bound or CPU bound, while to understand if MySQL goes through more rows than it needs to requires look at the queries and schema.
Ok Let us new get to back to main point of the article.
In original article I mentioned having multiple hard drives does not help if you have single query (or stream of queries) which you need to deal with. This is indeed not exactly the case - my point is you should not expect so much gains as you would expect having say 8 hard drive instead of one.
Let us first look how single query is executed for Innodb storage engine to be more specific. Lets look at update queries (from replication thread for example). When update is performed first problem is actually reading the data. If you’re updating the row you need to fetch the page containing old row version (and possibly index pages you’re to modify) - even if you’re doing INSERT you will need to fetch clustered index page at least to do it. These reads are issues by Innodb one by one - next read request can only be issued after previous request is completed.
Innodb tries to optimize these reads a bit - there is sequential read ahead and random read ahead which are designed to spot data access patterns and preform the data before it is needed. They can execute in the parallel to normal read operations issued by the thread and so can result in multiple outstanding requests to the disk. Though this does not help dramatically for many “random” update queries.
After pages are fetched and modified they need to be written to the disk, this however does not happen at once - the thread executing request does not have to wait for the dirty pages to flush - it happens in the background on its own schedule. Such flush activity is another activity which will happen in parallel even if you have single running query. Though this one is specific to write requests - if you have some reporting queries to deal with you will not benefit of this parallelization.
Of course another thing update query needs to do is to flush transactional log. In case innodb_flush_log_at_trx_commit=1 (default) this will be synchronous operation and thread will need to wait for it to complete before continuing. Though in decent systems you have battery backed up cache so this wait is not long. In case innodb_flush_log_at_trx_commit is set to 0 or 2 physical IO will be happening in background giving yet another request which can be executed in parallel in the background.
So first thing to consider - single client work load does not necessary means single outstanding disk IO request at all times. So natively it can benefit from multiple hard drives this way.
But what is about case when you have only one IO request to deal with ? In this case you also can benefit from multiple drives because of the different reason. If you’re using RAID10 - most commonly used RAID level for write intensive database systems, you will have 2 hard drives to pick reading any block. RAID controller (and even software RAID driver) typically tracks disk head position so it would pick to read from the drive with shorter seek time if both of hard drives are idle.
The other seek time optimizations come from the striping - consider 100GB database which is stored on single disk. Assuming it is not fragmented it will take N tracks on this disk. Now if we stripe the same data among couple of hard drives it will take N/2 tracks so average seek time will be shorter. Similar affect happens to using larger hard drives - If you put 100GB on 100GB hard drive you will have it spread across all tracks, if you use 100GB non fragmented block on 1TB hard drive you will have 10 seeks condensed in 1/10 of the full seek distance.
I must say however some people expect too much improvements from optimizing seeks. You can have shorter seeks but as soon as you have any seeks you will be far from sequential disk access performance. First you should consider what has to happen with hard drive to read/write the data you have requested. Drive has to seek to the proper track and you need drive to rotate to the place where your data is stored to perform the IO operation. When we’re speaking about seeks we imply this rotation as well though people often forget to account for it. If we take 15K RPM Hard Drive it does 250 rotations per second (4ms per rotation) which gives us 2ms of average latency - exactly what you can see in the specs. Looking at the same specs (reads) we get 0.2ms track to track seek time (our best case) and 3.5ms average seek time which is significant difference. Now if we add 2ms of average latency which we have to deal with in both cases we have 2.2 vs 5.5ms which is 2.5 times difference - This is a lot, but remember this was the best case. Typically you will not be able to get over 30-50% from the second hard drive (and diminishing returns if you keep more copies). So sick time is worth optimizing but do not expect magic because latency is not going away.
Let us get back to multiple outstanding IOs case - Matt points out the issue of disk contention people constantly forgot about. If you ask most people what would be faster for random reads (writes are obvious) - RAID0 or RAID1 most will think they are about the same - in both cases you have 2 disks to deal with. There is the serious difference however - if you have RAID1 for ANY request you can use any disk to perform the read request while with RAID0 it is only the disk which has the data (even keeping aside partial “border” IO request which will require reading both drives). If you have 2 random outstanding IO requests there is 50% chance they will require block from the same drive and will need to be serialized if you’re using RAID0 with 2 drives.
This number improves as you’re getting more drives because there is less chance 2 requests will hit the same drive as well as with increased concurrency. If you would have 256 concurrent IOs for example this effect will almost disappear. This is why I think people often do not see this difference - often IO subsystem capacity is tested with single thread and with some high amount of threads which do not show this effect well.
So as you might see tuning IO subsystem can indeed be fun - there is a lot to deal with not even mentioning various seriolization, stripe sizes, cache policies, filesystem and OS issues
Entry posted by peter | 2 comments
2008-04-25 Fri
AnySQL.net
DBA notes
Oracle & Starcraft
eagle's home
Give you some color to see see!
AnySQL.net English
Oracle Scratchpad
Oracle Life
OracleDBA Blog---请享受无法回避的痛苦!
Uploads from dbanotes
Chanel [K]
xzh2000的博客
Oracle Security Blog
ERN空间
Eddie Awad's Blog
MySQL Performance Blog
The Tom Kyte Blog
del.icio.us/fenng/oracle
AIXpert
O'Reilly Databases
Red Hat Magazine
DBASupport
DB2 Magazine 中文版
developerWorks : AIX 专区的文章,教程
Pythian Group Blog » Log Buffer
车东[Blog^2]
blue_prince
玉面飞龙的BLOG
此生 今世
人生就是如此
Orange Tiger 木匠 的 移民生活
生活帮-LifeBang
Hey!! Sky!
dba on unix
Oracle Notes Wiki
Brotherxiao's Home
柔嘉维则@life.oracle.eng
Fenng's shared items in Google Reader
jametong's shared items in Google Reader
缥缈游侠-logzgh
Tanel Poder's blog: Core IT for geeks and pros
DBA Tools
ilonng
yangtingkun
NinGoo@Net
Oracle & Unix
Inside the Oracle Optimizer - Removing the black magic
Ricky's Test Blog
DBA@Taobao
存储部落
Think in 88
Alibaba DBA Team
Oracle Team @SNC
淘宝数据仓库团队
OracleBlog.cn




