Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes -- ![]()
2012-01-25 Wed
My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:
- Amazon Elastic MapReduce
- Cloudera
- Datameer
- EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
- Hadapt
- Hortonworks
- IBM (InfoSphere BigInsights)
- Informatica (for HParser)
- Karmasphere
- MapR
- Microsoft
- Oracle
Original title and link: 12 Hadoop Vendors to Watch in 2012 (©myNoSQL)
2012-01-24 Tue
Jonathan Hsieh provides a summary of the new features in HBase 0.92.0 by splitting them into user features:
- HFile v2, a new more efficient storage format
- Faster recovery via distributed log splitting
- Lower latency region-server operations via new multi-threaded and asynchronous implementations.
operator features:
- An enhanced web UI that exposes more internal state
- Improved logging for identifying slow queries
- Improved corruption detection and repair tools
and developer features:
- Coprocessors
- Build support for Hadoop 0.20.20x, 0.22, 0.23.
- Experimental: offheap slab cache and online table schema change
Earlier today when covering the HBase 0.92.0 release, I wrote that coprocessors are the hightlight of this release. I’ll take that back. Way too many interesting features in HBase 0.92.0 to highlight just one of them.
Original title and link: More Details About Apache HBase 0.92.0 (©myNoSQL)
Google is actively researching ways to improve TCP:
Our research shows that the key to reducing latency is saving round trips. We’re experimenting with several improvements to TCP. Here’s a summary of some of our recommendations to make TCP faster:
- Increase TCP initial congestion window to 10 (IW10). The amount of data sent at the beginning of a TCP connection is currently 3 packets, implying 3 round trips (RTT) to deliver a tiny 15KB-sized content.
- Reduce the initial timeout from 3 seconds to 1 second.
- Use TCP Fast Open (TFO).
- Use Proportional Rate Reduction for TCP (PRR).
The database world attacked the network latency with connection pools and pipelining. For reducing network round trips we’ve used JOINs or denormalized data. But all software architectures will benefit from a faster TCP.
Original title and link: Google Research: Let’s Make TCP Faster (©myNoSQL)
It looks like the three pictures about Hadoop versions—first two by Cloudera and the third by Konstantin I. Boudnik & Cos—are actually worth 1066 Gartner words.
On the other hand, to address the question in the title—would custom distributions clarify Hadoop versions—I think that while custom distributions might be helpful for experimenting or getting started with Hadoop, long term they’ll actually lead to more segmentation in the market and bigger maintenance and upgrade costs for end users.
There are just a few companies with a track record of maintaining and distributing open source projects—in the Hadoop space these are Cloudera and Hortonworks (nb Hortonworks is supporting the Apache Hadoop distribution). So if a vendor tries to sell you a Hadoop package ask them about their history managing open source distributions.
Original title and link: Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions? (©myNoSQL)
Tarsnap is a service offering secure online backups. Colin Percival details the costs Tarsnap would have for using Amazon DynamoDB:
For each TB of data stored, this gives me 30,000,000 blocks requiring 60,000,000 key-value pairs; these occupy 2.31 GB, but for DynamoDB pricing purposes, they count as 8.31 GB, or $8.31 per month. That’s about 2.7% of Tarsnap’s gross revenues (30 cents per GB per month); significant, but manageable. However, each of those 30,000,000 blocks need to go through log cleaning every 14 days, a process which requires a read (to check that the block hasn’t been marked as deleted) and a write (to update the map to point at the new location in S3). That’s an average rate of 25 reads and 25 writes per second, so I’d need to reserve 50 reads and 50 writes per second of DynamoDB capacity. The reads cost $0.01 per hour while the writes cost $0.05 per hour, for a total cost of $0.06 per hour — or $44 per month. That’s 14.6% of Tarsnap’s gross revenues; together with the storage cost, DynamoDB would eat up 17.3% of Tarsnap’s revenue — slightly over $0.05 from every $0.30/GB I take in.
To put it differently getting an 83.7% profit margin sounds like a good deal, but without knowing the costs of the other components (S3, EC2, data transfer) it’s difficult to conclude if this solution would remain profitable at a good margin. Anyway, an interesting aspect of this solution is that the costs of some major components of the platform (S3, DynamoDB) would scale lineary with the revenue.
Original title and link: A Cost Analysis of DynamoDB for Tarsnap (©myNoSQL)
Just a quick roundup of the latest releases and announcements.
Hortonworks Data Platform (HDP) version 2
HDP v2 will include:
- NextGen MapReduce architecture
- HDFS NameNode HA
- HDFS Federation
- up-to-date HCatalog, HBase, Hive, Pig
According to the announcement:
In order to avoid confusion, let me explain the two versions of HDP:
- HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
- HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.
SolrCloud Completes Phase 2
Mark Miller about the completion of phase 2:
The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.
Not there yet, but it’s coming.
DataStax Community Server 1.0.7
A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7
HBase 0.92
Don’t let the version number trick you. This is an important release for HBase featuring:
- coprocessors
- security
- new (self-migrating) file format
- AWS improvements: EBS support, building a HA cluster
The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).
I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:
Original title and link: Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud (©myNoSQL)
An official slidedeck to introduce Amazon DynamoDB to your team. My notes about DynamoDB could be a nice addition.
Original title and link: Introducing Amazon DynamoDB Slidesdeck (©myNoSQL)
Etsy went from using HTTP to BitTorrent for replicating Solr indexes:
By integrating BitTorrent protocol into Solr we could replace HTTP replication. BitTorrent supports updating and continuation of downloads, which works well for incremental index updates. When we use BitTorrent for replication, all of the slave servers seed index files allowing us to bring up new slaves (or update stale slaves) very quickly.
[…]
Our Ops team started experimenting with a BitTorrent package herd, which sits on top of BitTornado. Using herd they transferred our largest search index in 15 minutes. They spent 8 hours tweaking all the variables and making the transfer faster and faster. Using pigz for compression and herd for transfer, they cut the replication time for the biggest index from 60 minutes to just 6 minutes!
Make sure you don’t miss the part where they were experimenting with multicast UDP rsync.
Original title and link: Solr Index Replication at Etsy: From HTTP to BitTorrent (©myNoSQL)
Jelastic, a company offering a cloud platform for Java server hosting, has published some stats about the databases used by their over 7000 users:

While it would be wrong to generalize these results to absolute database marketshare, it is interesting nonetheless to see that MongoDB is already outrunning PostrgeSQL being the second most used database and that CouchDB, which was added only one month ago, is already used by 5% of Jelastic’s users. MySQL detains the first position with over 40% users or differently put double the number of the second place (MongoDB).
These numbers would be even more interesting if they would account for some real usage stats like database sizes or query volumes.
Original title and link: Jelastic Database Marketshare: MySQL, MongoDB, MariaDB (©myNoSQL)
2012-01-23 Mon
AnySQL.net
Give you some color to see see!
Oracle Scratchpad
Oracle Life
Channel [K]
Oracle Security Blog
The Tom Kyte Blog
Delicious/Fenng/oracle
O'Reilly Databases
Red Hat Magazine
车东[Blog^2]
blue_prince
玉面飞龙的BLOG
木匠 Creative and Flexible
Brotherxiao's Home
jametong's shared items in Google Reader
DBA Tools
ramarao
Inside the Oracle Optimizer - Removing the black magic
DBA@Taobao
存储部落
OracleBlog.org
知道分子
支付宝官方 Blog - 支付志
木匠的天空 Database Architect and Developer
Hello DBA
OS与Oracle
Cary Millsap
Guy Harrison's main page
eagle's home
DBA Notes
OracleDBA Blog---三少个人涂鸦地!The Pythian Blog
myNoSQL