Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes -- ![]()
2008-08-01 Fri
JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions
The one I'm speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.
I have described couple of solutions to this problem - using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.
Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users .... etc)
-
CREATE TABLE `info` (
-
`id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
-
`d` date NOT NULL,
-
`group_id` int(10) UNSIGNED NOT NULL,
-
`events` int(10) UNSIGNED NOT NULL,
-
PRIMARY KEY (`id`),
-
KEY `d` (`d`,`group_id`)
-
) ENGINE=MyISAM AUTO_INCREMENT=18007591 DEFAULT CHARSET=latin1
-
-
mysql> SELECT sum(events) FROM info WHERE d BETWEEN '2007-01-01' AND '2007-01-31' AND group_id IN (10,20,30,40,50,60,70,80,90,100);
-
+-------------+
-
| sum(events) |
-
+-------------+
-
| 3289092 |
-
+-------------+
-
1 row IN SET (1.04 sec)
-
-
mysql> EXPLAIN SELECT sum(events) FROM info WHERE d BETWEEN '2007-01-01' AND '2007-01-31' AND group_id IN (10,20,30,40,50,60,70,80,90,100) \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: range
-
possible_keys: d
-
KEY: d
-
key_len: 7
-
ref: NULL
-
rows: 355213
-
Extra: USING WHERE
-
1 row IN SET (0.00 sec)
As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.
Note this is also interesting case of EXPLAIN being wrong - it shows key_len=7 which corresponds to the full key while only first key part is used.
Let us now replace the range with IN list in this query:
-
mysql> EXPLAIN SELECT sum(events) FROM info WHERE d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31') AND group_id IN (10,20,30,40,50,60,70,80,90,100) \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: range
-
possible_keys: d
-
KEY: d
-
key_len: 7
-
ref: NULL
-
rows: 3681
-
Extra: USING WHERE
-
1 row IN SET (0.01 sec)
-
-
mysql> SELECT sum(events) FROM info WHERE d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31') AND group_id IN (10,20,30,40,50,60,70,80,90,100);
-
+-------------+
-
| sum(events) |
-
+-------------+
-
| 3289092 |
-
+-------------+
-
1 row IN SET (0.02 sec)
So we get same result but approximately 50 times faster. In this report we had just one month worth of data - what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).
Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:
-
mysql> SHOW CREATE TABLE dl \G
-
*************************** 1. row ***************************
-
TABLE: dl
-
CREATE TABLE: CREATE TABLE `dl` (
-
`myday` date NOT NULL,
-
PRIMARY KEY (`myday`)
-
) ENGINE=MyISAM DEFAULT CHARSET=latin1
-
1 row IN SET (0.00 sec)
-
-
mysql> SELECT * FROM dl LIMIT 5;
-
+------------+
-
| myday |
-
+------------+
-
| 2001-01-01 |
-
| 2001-01-02 |
-
| 2001-01-03 |
-
| 2001-01-04 |
-
| 2001-01-05 |
-
+------------+
-
5 rows IN SET (0.00 sec)
-
-
-
mysql> EXPLAIN SELECT sum(events) FROM info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday=d AND group_id IN (10,20,30,40,50,60,70,80,90,100) \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: dl
-
type: range
-
possible_keys: PRIMARY
-
KEY: PRIMARY
-
key_len: 3
-
ref: NULL
-
rows: 30
-
Extra: USING WHERE; USING INDEX
-
*************************** 2. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: range
-
possible_keys: d
-
KEY: d
-
key_len: 7
-
ref: NULL
-
rows: 355213
-
Extra: USING WHERE
-
2 rows IN SET (0.00 sec)
As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.
It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on "info" table instead nested loops join - exactly what we tried to avoid.
It is easy to block equality propagation by using some trivial function:
-
mysql> EXPLAIN SELECT sum(events) FROM info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND d=date(myday) AND group_id IN (10,20,30,40,50,60,70,80,90,100) \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: dl
-
type: range
-
possible_keys: PRIMARY
-
KEY: PRIMARY
-
key_len: 3
-
ref: NULL
-
rows: 30
-
Extra: USING WHERE; USING INDEX
-
*************************** 2. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: ref
-
possible_keys: d
-
KEY: d
-
key_len: 3
-
ref: func
-
rows: 17990
-
Extra: USING WHERE
-
2 rows IN SET (0.00 sec)
So we stopped equality propagation but now have another problem - for some reason MySQL decides to only do "ref" on the date only instead of using range on day and list of groups for each join iteration.
This does not make sense but this is how it is.
I also tried to increase cardinality by having all rows to have different group_id and it still does not work.
The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)
-
mysql> EXPLAIN SELECT sum(events) FROM info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND d=myday AND group_id IN (10) \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: dl
-
type: range
-
possible_keys: PRIMARY
-
KEY: PRIMARY
-
key_len: 3
-
ref: NULL
-
rows: 30
-
Extra: USING WHERE; USING INDEX
-
*************************** 2. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: ref
-
possible_keys: d
-
KEY: d
-
key_len: 7
-
ref: test.dl.myday,const
-
rows: 18
-
Extra:
-
2 rows IN SET (0.00 sec)
For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.
So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team
UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas
At the same time I figured out how to make MySQL Optimizer to do what we want to do - Just add yet another table to the join so the info table just has bunch of ref lookups:
-
mysql> SELECT * FROM g;
-
+-----+
-
| gr |
-
+-----+
-
| 10 |
-
| 20 |
-
| 30 |
-
| 40 |
-
| 50 |
-
| 60 |
-
| 70 |
-
| 80 |
-
| 90 |
-
| 100 |
-
+-----+
-
10 rows IN SET (0.00 sec)
-
-
mysql> EXPLAIN SELECT sum(events) FROM g,info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday=d AND group_id=g.gr \G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: dl
-
type: range
-
possible_keys: PRIMARY
-
KEY: PRIMARY
-
key_len: 3
-
ref: NULL
-
rows: 30
-
Extra: USING WHERE; USING INDEX
-
*************************** 2. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: g
-
type: INDEX
-
possible_keys: PRIMARY
-
KEY: PRIMARY
-
key_len: 4
-
ref: NULL
-
rows: 10
-
Extra: USING INDEX
-
*************************** 3. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: info
-
type: ref
-
possible_keys: d
-
KEY: d
-
key_len: 7
-
ref: test.dl.myday,test.g.gr
-
rows: 18
-
Extra:
-
3 rows IN SET (0.00 sec)
This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.
Entry posted by peter | No comment
有两个人A和B,A喜欢圣斗士、七龙珠、灌篮高手、太空堡垒,B也喜欢圣斗士、七龙珠、灌篮高手、太空堡垒。A和B相似吗?不,A和B只是都出生在80年代,他们除了具备一些80人儿时共同的娱乐爱好之外没有什么共同语言。
另有两个人C和D,C喜欢圣斗士、七龙珠、灌篮高手、太空堡垒,D喜欢花仙子、蓝精灵、多拉A梦、樱桃小丸子。C和D相似吗?是的,尽管看起来一个是热血男生一个是温婉女生,但偏偏他们都在Taobao UED工作,都从事并擅长形象设计的工作。
那么,如何解决这种矛盾?来看一个经典的例子: (more…)
业务方一个需求, 要对一个表中某些文本字段进行多模式匹配, 找出含有某个关键词列表中关键词的记录.
为了实现这个需求, 搞了如下的SQL:
select id, result
from
(
select /*+parallel(t 16)*/id,
decode(sign(instr(x,’关键词1′)),1,’关键词1,’,null) ||
decode(sign(instr(x,’关键词2′)),1,’关键词2,’,null) ||
decode(sign(instr(x,’关键词3′)),1,’关键词3,’,null)
as result
from
(
select id as id, col1||’,'||col2 as x
from table1
) t
)
where result is not null
/
说明一下:
1. 输出结果:
所有包含任意关键词的记录, 输出记录的id和该笔记录中所匹配到的关键词, 也可以根据需要输出其他字段.
2. 关键词列表: 把需要扫描的关键词都拼凑到SQL中, 而不是选择存放在一个表中去做表关联.
这样做是考虑到如果做表关联, instr这样的需求只能走nest loop, 而table1的记录有很大(百万级甚至更高).
而且关键词数量也不少, 一般是一两千个, 我这里为了便于阅读, 只写了三个.
这样的情况无论怎么样走nest loop, 逻辑读都非常大.
3. 关于并行: 需求紧急, 为了短平快的实现, 我用了并行.
也正好说说我对并行的理解.
并行, 其实是利用闲置的数据库服务器资源来换取任务执行的时间.
如果简单对一个10G级的大表进行全表扫描, 把并行开到32, 存储的吞吐一下子就到了极限.
而对于上面写的instr的SQL, 把并行度开到16, 则存储很闲, CPU里的user%一下子就到了96,7, 服务器load也到了很高的水平.
有一次不小心开了两个上面的SQL同时跑, 我有幸看到了三位数的load.
这样的工作不能在线上数据库做, 只能找一个空闲的非关键的环境来做.
付出这些代价, 换来的结果还是比较满意的, 任务执行时间平均缩短一半.
July 31st 2006 was my last day working for MySQL and August 1st I started what later was incorporated Percona with Vadim joining me September 1st as co-founder.
Two years is a significant anniversary for any startup - surviving (and being profitable) for 2 years can be seen as validation of our business model and strategy and we're quite happy about this.
So what is our strategy ? I left MySQL with idea of building company which will be fair in rewarding their employees for their contribution, in particular engineers which do a lot of heavy lifting in technology companies. I really liked many of Monty's ideas as he implemented during early years of MySQL (you can see many of these same ideas described in Hacking Companies article). We're not just like that but we're very close in spirit which you can describe as lets smart engineers to gather and do cool stuff together.
The second part of our strategy is being fair to the customers and providing them with great service at fair prices. We decided from the start we're making money as consulting company being for work it takes to deliver service rather than focusing on maximizing leverage by selling software or subscription.
We develop software to be able to provide better services with lower cost for the client. This makes sense because we can help more people and builds efficiency as our competitive advantage.
Third part which is important for us as a founders (and we try to hire people which share our values) is giving back to community. It works as a great marketing vehicle for us but it just feels right. We feel open source software is a great way to give back to community for technology company. We've sponsored MMM, Maatkit, Released Innodb Recovery Tools (we probably would have made a lot of money keeping this inhouse, but it just does not feel right to leave people in need without a tool to get data back if they can't pay), Sponsored some Sphinx development. We also published variety of patches for MySQL. Though our giving back to community does not stop there. On the technical Landscape we try to provide a lot of information via Blog, Forums or Presentations. We also contribute to other worth causes like gathering money for Ivan surgery.
Where do we plan to go ? We're helping customers building and maintaining high quality applications. Currently our focus around MySQL and surrounding technologies but this is so because it is "pick of the web". We're constantly looking at emerging technologies to see what can be used for building large scale web application, which is there core of our interest is. We see what other challenges our customers have and we have consultants joining us with different backgrounds which allows us to provide additional services such as capacity planning, migrations, web layer optimizations, MySQL Customizations/Optimizations etc. We want people having their own great ideas to join us and develop them in entrepreneur friendly atmosphere.
In these two years we've grown from 2 person company to company employing over 20 full time employees in Europe and US. We're still virtual company having no office where people would work.
The MySQL was a great school to show how this is possible.
We're staying profitable all the time attracting no external money as venture fundings or the loans. This allows us to develop company on our own pace and have no obligations to deliver huge returns to anyone. We believe as consulting company we do not need these to maintain comfortable growth pace without putting undue pressure on our employees and retaining team values.
For us with Vadim the the change was the serious one. As we started delivering high quality services was out main challenge and as engineers this was something we knew pretty well how to do. As the company grew our roles change to include a lot of challenges in organizing administrative sales process, ensuring we're paid and paying our consultants, managing people and leadership on leading the company. We're learning a lot as we go and we're listening to advice of Mentors we can find. We're also growing team by looking not only for great engineers but also for people with great management and administrative skills.
Yesterday Monty visited us for dinner and I told him it is 2 year anniversary since I left MySQL. He asked us if we're happy with the choice or have regrets - we have none and looking forward the next two years. Getting your own company up and running is a lot of hard work but is is a lot of fun too.
Entry posted by peter | 6 comments
Customer entitlements and subscriptions (for Red Hat® Enterprise Linux® or other applications) can be transferred from one Red Hat Network (RHN) account to another by our Customer Service team. However, system profiles and associated Red Hat Network web login accounts cannot be transferred.
In order to transfer a system profile, the following must happen:
- the entitlement for the system must first be transferred by Customer Service
- the user must delete the system profile from the old account
- the user must re-register the system with the desired account for updates
Please contact Customer Service with your desired RHN account information and entitlement transfer request. The request must originate from the email address registered on the original account and must cc the email address listed on the account where you wish to transfer your entitlements.
Welcome to the 108th edition of Log Buffer, the weekly review of database blogs.
With almost no ado at all, let’s begin with the bad news–from StatisticsIO and Jason Massie: The Death of the DBA. And who is the perpetrator of this crime? The Cloud! It sounds like something from a John Carpenter movie, doesn’t it?
Let’s see what Jason is thinking. “I’d like to retire a SQL Server DBA with 40 years experience but I don’t think that will happen. The cloud is coming and it is bad news administrators, database or otherwise. . . . Let’s make some assumptions. The features get there. The availability gets there. The platform basically matures . . . Now put yourself in the IT decision maker’s shoes. No upfront capital expenses, no managing backups, and no patch management. . . . If they can remove their focus from managing and deploying IT, they sell and service more widgets.”
Scary stuff, right? Well, the commenters don’t entirely agree. I think it will be at least a factor, but I wonder how many managers will look at “The Cloud” and feel uncomfortable about privacy, data retention, and the like. (For myself, I couldn’t even endorse the idea of putting this blog’s comments into “The Cloud”.) What do you think?
Elsewhere on StatisticsIO, Jason has a note about MSDN’s SQL Heroes contest, whose aim is to, “. . . create a community project in CodePlex based on SQL Server 2008.” Jason also links to a list of CodePlex’s active SQL Server projects.
Turning to matters technical, Jeff’s SQL Server Blog offers a lesson on converting input explicitly at your client: don’t rely on the database to “figure it out”. Jeff takes the example of formatting dates, and show both the right and the wrong way, writing, “I’ve said it over and over and I’ll say it again: The concept of formatting dates should never be something that your database code should ever worry about.”
On the Less Than Dot blog, SQLDenis observes that converting columns to date from datetime does not result in a scan in SQL Server 2008. What you get instead is a seek, as he demonstrates.
Indexing Foreign Keys - should SQL Server do that automatically? So asks Greg Low on the The Bit Bucket. “By adding indexes on the foreign keys on three tables,” he writes, “we saw a reduction of 87% in total I/O load. . . . it really struck me that having SQL Server do this by default would avoid a lot of apparent performance problems. . . . Should SQL Server simply do this by default when you declare a foreign key reference?”
Kent Tegels of Enjoy Another Sandwich — riddle me this, riddle me that! “When is a bug not a bug?” I give up, Kent. When is a bug not a bug? (more…)
It's not very clear if "blogs you are following" is a new feature or a synonymous for blogroll, since Google Reader links to a non-existent page that is supposed to reveal more information. A thread from Google Reader Group shows that the new feature was accidentally added and then removed.
"Google Reader automatically added a "Blogs I'm Following" folder on my Reader. I've already got my Reader set up the way I want it and this folder is superfluous and annoying," says Vanessa. "It would be nice if they gave us the option of using it before they just took it over that way! There is no mention of it in any of their help files either, this is just ridiculous," mentions Jackie.
The following screenshot, courtesy of "The Other Drummer", shows the new folder automatically added by Google Reader:
http://www.google.com/reader/view/user/-/state/com.blogger/following.

In other Google Reader news, the iPhone version started to reformat the linked web pages for mobile browser, but this can be changed in the settings. "For users with Nokia and other AppleWebKit-enabled phones, soon your phones won't automatically choose the iPhone version of Google Reader," says a Google employee.
{ Thanks, hlpPy. }
dbanotes posted a photo:
业余时间学点文化知识:http://q.blog.sina.com.cn/hotmedia/blogfile/49c2c43f0100acpi&dpc=1
最近,一个新“雷词”——“国家罗汉”在网络上流行。它的词义异常强大,“很黄很暴力”、“打酱油”、“俯卧撑”……跟它比不过是小儿科。
第一,它的起源不一般。2008年6月23日下午,因工程纠纷,一杨姓包工头带领江西抚州市临川区法院公职人员芦涛一行人来到工地,对另一包工头詹某围攻殴打。芦涛打人时大声叫道:“我是法院的,我代表国家罗汉,花100万弄死你这个农民!”“国家罗汉”终于从一部分人的心中蹦出了口。
第二,它是地方民俗的活学活用。在当地方言中,“罗汉”指的是地痞流氓,这些人不务正业,但靠欺压百姓也能过好日子。在芦涛等人的心目中,国家就是用来过自己的好日子的。所以,他创造性地发明了“国家罗汉”这个词语。也许在我们看来,创造这么生动的词语需要对某些人某些现象进行深刻的思想加工和酝酿,但在芦涛等人看来,做“国家罗汉”是理所当然的事情,所以他才能够脱口而出。
第三,它颠覆了传统习俗和文化。在民间,罗汉是六根清净、主持正义的英雄,“十八罗汉”就是黎民百姓的保护神。同样,国家是大家的“国家”,国家的存在,是对普通百姓最现实的保护。有了国家,我们就不需要神话中的罗汉了。但是,当“国家”和“罗汉”叠加在一起的时候,现实就和神话融为一体了,因此害处也就产生了。“国家罗汉”的出现就是公私不分的产物。即使这个词语没有发明出来,“国家罗汉”这个角色和形象也必然会存在。为了防止它的出现,“国家”应该与 “罗汉”分开,坚决不能让任何“罗汉”披上“国家”的外衣;也可以说,这是“上帝的归上帝,恺撒的归恺撒”。
Shared by Fenng
精准营销都搞成这样啦?
作者: robbin 链接:http://robbin.javaeye.com/blog/222947 发表时间: 2008年08月01日
声明:本文系JavaEye网站发布的原创博客文章,未经作者书面许可,严禁任何网站转载本文,否则必将追究法律责任!
今天下午我们公司接到广州打来的一个电话,一个中年男子声音的人开口就说:“JavaEye网站诋毁李刚”,要求我们立刻删除文章:http://www.javaeye.com/topic/213155,并且以该文章评为精华贴作为指责我们的过失(显然不明白JavaEye的会员民主投票精华贴的事实),而且威胁我们网站,如果不删除,就向公安局举报JavaEye网站。
由于JavaEye的文章属于会员所有,除非违反国家法规,网站管理规则,或者和事实有严重出入,否则我们不会删除。因此我们要求对方给出该文章诋毁李刚的合理证据。然后这个中年男子就在电话里面情绪失控,破口大骂。为此我们只能挂掉电话。
然后我们经过查询后台数据库,该文章中提到的几个ID
http://shangyem.javaeye.com/
http://digenity.javaeye.com/
http://free-dem.javaeye.com/
都是来自同一IP: 59.41.221.21
于是我们发布了公告:http://www.javaeye.com/news/3083
但从今天中午,我们公司电话,个人手机,家庭电话每隔3分钟接到一次骚扰电话,骚扰电话的号码是:133XXXXXXXX,通过我们和打骚扰电话的人交流,他向我们透露了如下信息:
1、他是专门从事打骚扰电话行业的,工作就是向对方不间断的电话骚扰
2、某人花钱以每天50元的价格委托他向JavaEye网站公司电话,员工个人电话,家庭电话进行不间断的电话骚扰
3、目的是威胁JavaEye网站删除不利于某人的证据
某人为了消除互联网上面不利于他的证据,竟然做出这种丧心病狂的事情来,为此,我们更应该将事实的真相公布于众,让公众了解事实的真相和某人卑鄙无耻的行径。
本文的讨论也很精彩,浏览讨论>>
JavaEye推荐
dbanotes posted a photo:
对于数据量小的情况下,其实怎么折腾都是无所谓的。否则就是个需要仔细权衡的事情。但有一点需要强调的是,一味追去 RPO 其实没什么更大意义,容灾更重要的是,流程管理
dbanotes posted a photo:
业余时间学点文化知识:http://q.blog.sina.com.cn/hotmedia/blogfile/49c2c43f0100acpi&dpc=1
最近,一个新“雷词”——“国家罗汉”在网络上流行。它的词义异常强大,“很黄很暴力”、“打酱油”、“俯卧撑”……跟它比不过是小儿科。
第一,它的起源不一般。2008年6月23日下午,因工程纠纷,一杨姓包工头带领江西抚州市临川区法院公职人员芦涛一行人来到工地,对另一包工头詹某围攻殴打。芦涛打人时大声叫道:“我是法院的,我代表国家罗汉,花100万弄死你这个农民!”“国家罗汉”终于从一部分人的心中蹦出了口。
第二,它是地方民俗的活学活用。在当地方言中,“罗汉”指的是地痞流氓,这些人不务正业,但靠欺压百姓也能过好日子。在芦涛等人的心目中,国家就是用来过自己的好日子的。所以,他创造性地发明了“国家罗汉”这个词语。也许在我们看来,创造这么生动的词语需要对某些人某些现象进行深刻的思想加工和酝酿,但在芦涛等人看来,做“国家罗汉”是理所当然的事情,所以他才能够脱口而出。
第三,它颠覆了传统习俗和文化。在民间,罗汉是六根清净、主持正义的英雄,“十八罗汉”就是黎民百姓的保护神。同样,国家是大家的“国家”,国家的存在,是对普通百姓最现实的保护。有了国家,我们就不需要神话中的罗汉了。但是,当“国家”和“罗汉”叠加在一起的时候,现实就和神话融为一体了,因此害处也就产生了。“国家罗汉”的出现就是公私不分的产物。即使这个词语没有发明出来,“国家罗汉”这个角色和形象也必然会存在。为了防止它的出现,“国家”应该与 “罗汉”分开,坚决不能让任何“罗汉”披上“国家”的外衣;也可以说,这是“上帝的归上帝,恺撒的归恺撒”。
作者:Fenng 发布在 dbanotes.net.
| 转载文章是对互联网的伤害
一直以来,Paypal 的技术信息都很封闭的,很少能看到披露后台关于信息架构的东西。
Paypal 当前的数据仓库用的是 NCR Teradata ,32 个节点,50 TB 的数据,耗时三年打造。而整个公司投入在 BI 范围上的资金占据全部 IT 投入的 60%。
之前 Paypal 用的是 Oracle 数据仓库的解决方案,旧的 Oracle 数据仓库环境其实类似生产环境 Schema 数据的镜像。从 Oracle 到 Teradata ,不是简单的迁移,而是完全重构了数据模型,对数据重新清洗并提高数据质量。
因为欧美是依赖信用卡的消费习惯,所以 Paypal 面对的信用卡消费欺诈还是很严重的,一度高达 0.25% 的资损(印象中好像有段时间来自俄罗斯和东欧的欺诈特别多),这可能也是 Paypal 在数据仓库/BI 上投入重金的一个原因(此外还收购了 Fraud Sciences 公司来减少这方面的风险)。
除了有效提供损益报告,Paypal 的数据仓库还必须即时有效的提供的一个指标叫做 "Funny Mix",代表信用卡资金交易帐务平衡指标与 ACH(自动化清算所,Automated Clearing House) 帐务平衡。
作为对比 eBay 数据仓库环境每天新进来的数据就有 40TB(和Yahoo! 的DW不相上下),这样的数据量,处理起来的难度还是有一点点的,据说原来技术人员 90% 的时间要花费在数据清洗上,现在也开始用 Teradata 大集中式数据仓库的模式了。
--EOF--
相关文章|Related Articles
- Paypal 黑帮 - Nov 16, 2007
- Paypal 黑帮形成的条件 - Nov 17, 2007
- Yahoo! 的数据仓库: 世界上最大最忙 - May 26, 2008
- BASE -- 高可用架构的基石之一 - Jul 3, 2008
评论数量(0)|Add Comments
本文网址:http://www.dbanotes.net/database/paypal_dw.html
最近作者还说了什么? Follow Twitter / Fenng
DBA notes 理念: 用最简约的技术取得最大的收益!




















