123
 123

Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes --

2012-01-19 Thu

19:39 RainStor Big Data Analytics on Hadoop Promises Impressive Data Compression Rates (4241 Bytes) » myNoSQL

RainStor has announced the Big Data Analytics on Hadoop:

  • The highest data compression in the industry with up to 40x reduction, compared to raw data typically stored in HDFS, with no re-inflation required for access
  • The ability to run faster query and analysis using both SQL query and MapReduce with 10-100x faster results
  • The ability to perform analytics directly in Hadoop, reducing the need to create copies and transfer data out
  • Reduced nodes in a Hadoop cluster with ~85 percent lower operating costs.

A couple of comments:

  • RainStor is not the only solution that can perform analytics directly in Hadoop
  • Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL
  • RainStor MapReduce support is via Pig
  • according to this, there’s an interesting aspect of RainStor support of SQL and MapReduce:

    Users can choose SQL for rapid response ad-hoc queries or run batch jobs using MapReduce against RainStor data.  Additionally you can interoperate SQL and MapReduce and join results from a query against RainStor and against native CSV files on HDFS.

    As a side note, Toad for Cloud from Quest is a tool that tries to provide a table based perspective of data in relational and NoSQL databases

Anyways, the most interesting part of the announcement is RainStor’s claimed data compression level (up to 40x) and the fact that accessing data doesn’t require re-inflation. According to an infographic the current available solutions for compression are topped at at most 8x:

  • Hadoop LZO: 3x
  • Compressed relational: 6x
  • Flatfile Gzip: 7x
  • Columnar: 8x

If such compression levels can be achieved frequently and the impact on other server resources (CPU, memory) is minimal, RainStor Big Data Analytics on Hadoop will definitely be an interesting part of the Hadoop market.

Before leaving you with the infographic, here is a nice quote form RainStor CEO, John Bantleman:

We see Hadoop as a platform like Linux, which needs solutions on top to deliver value.

Hadoop Data Compression

Original title and link: RainStor Big Data Analytics on Hadoop Promises Impressive Data Compression Rates (NoSQL database©myNoSQL)

17:00 Using MongoDB Replica Sets With Node.js on Microsoft Azure: NoSQL Tutorials (1993 Bytes) » myNoSQL
Using MongoDB Replica Sets With Node.js on Microsoft Azure: NoSQL Tutorials:

Mariano Vazquez explains how to configure MongoDB replica sets on Microsoft Azure and how that works:

  • MongoDB will run the native binaries on a worker role and will store the data in Windows Azure storage using Windows Azure Drive (basically a hard disk mounted on Azure Page blobs)
  • The good thing about using Azure Storage is that the data is georeplicated. It will also make backup easier because of the snapshot feature of blob storage (which is not a copy but a diff).
  • It will use the local hard disk in the VM (local resources in the Azure jargon) to store the log files and a local cache.
  • You can scale out to multiple Mongo Replica Sets by increasing the instance count of the MongoDB role

Original title and link: Using MongoDB Replica Sets With Node.js on Microsoft Azure: NoSQL Tutorials (NoSQL database©myNoSQL)

16:53 Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak (2038 Bytes) » myNoSQL

Old Quora question with very good answers.

  • (pro) can (potentially) query live data
  • (pro) can (conceptually) be highly efficient at joining data sets that are identically sharded on the join key (the joins can be pushed down into the key-value store itself)
  • (con) full scans (the most common pattern for map-reduce) is most likely to be much faster with raw file system access
  • (con) because of the better decoupling of computation and storage in the GFS+Mao-Reduce model - tolerating hot spots (resulting from MR jobs) is much easier
  • (con) key-value stores are rarely arranged to have schemas optimized for analytics

Naoki Yanai

Original title and link: Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak (NoSQL database©myNoSQL)

16:51 Quiz Night (1 Bytes) » Oracle Scratchpad
A
16:15 Basho: Congratulations, Amazon! (1426 Bytes) » myNoSQL
Basho: Congratulations, Amazon!:

A dynamo-as-a-service offered by Amazon on their ecosystem will appeal to some. For others, the benefits of a Dynamo-inspired product that can be deployed on other public clouds, behind-the-firewall, or not on the cloud at all, will be critical.

Objective. Clear. To the point.

Original title and link: Basho: Congratulations, Amazon! (NoSQL database©myNoSQL)

09:18 SCSI读写错误导致文件系统只读的数据库恢复 (6186 Bytes) » Oracle Life

作者:eygle 发布在 eygle.com

假期马上来到,一个客户数据库出现问题。

两个实例异常终止,文件系统变成只读:
PCLERPDB2:[10g]:/DBMS/PCMK/admin/PCMK> sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on Thu Jan 19 09:08:05 2012

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

ERROR:
ORA-09925: Unable to create audit trail file
Linux Error: 30: Read-only file system
Additional information: 9925
ORA-01075: you are currently logged on
检查系统日志,发现早晨出现SCSI IO错误:
Jan 19 07:56:00 PCLERPDB2 kernel: SCSI error : <0 0 0 1> return code = 0x10000
Jan 19 07:56:00 PCLERPDB2 kernel: end_request: I/O error, dev sda, sector 26480696
Jan 19 07:56:00 PCLERPDB2 kernel: Buffer I/O error on device sda1, logical block 3310083
Jan 19 07:56:00 PCLERPDB2 kernel: lost page write due to I/O error on sda1
Jan 19 07:56:00 PCLERPDB2 kernel: SCSI error : <0 0 0 9> return code = 0x10000
Jan 19 07:56:00 PCLERPDB2 kernel: end_request: I/O error, dev sdh, sector 60052680
Jan 19 07:56:00 PCLERPDB2 kernel: SCSI error : <0 0 0 4> return code = 0x10000
Jan 19 07:56:00 PCLERPDB2 kernel: end_request: I/O error, dev sdc, sector 20042688
Jan 19 07:56:00 PCLERPDB2 kernel: SCSI error : <0 0 0 9> return code = 0x10000
Jan 19 07:56:00 PCLERPDB2 kernel: end_request: I/O error, dev sdh, sector 26747408
Jan 19 07:56:00 PCLERPDB2 kernel: Buffer I/O error on device sdh2, logical block 843074
Jan 19 07:56:00 PCLERPDB2 kernel: lost page write due to I/O error on sdh2
Jan 19 07:56:00 PCLERPDB2 kernel: SCSI error : <0 0 0 1> return code = 0x10000
Jan 19 07:56:00 PCLERPDB2 kernel: end_request: I/O error, dev sda, sector 32606944
Jan 19 07:56:00 PCLERPDB2 kernel: Buffer I/O error on device sda1, logical block 4075864
然后数据库崩溃.

安排用户重启数据库主机,检查是否硬件软故障。
很幸运,重启后数据库能够正常启动:
Thu Jan 19 09:55:09 2012
Completed redo application
Thu Jan 19 09:55:09 2012
Completed crash recovery at
 Thread 1: logseq 18735, block 5214, scn 5965501404211
 59 data blocks read, 59 data blocks written, 609 redo blocks read
Thu Jan 19 09:55:09 2012
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=23, OS id=14599
Thu Jan 19 09:55:09 2012
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=24, OS id=14601
Thu Jan 19 09:55:09 2012
Thread 1 advanced to log sequence 18736
Thread 1 opened at log sequence 18736
  Current log# 3 seq# 18736 mem# 0: /DBMS/DCERP/dcerpdata/log03a.dbf
  Current log# 3 seq# 18736 mem# 1: /DBMS/DCERP/dcerpdata/log03b.dbf
Successful open of redo thread 1
Thu Jan 19 09:55:09 2012
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Jan 19 09:55:09 2012
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Thu Jan 19 09:55:09 2012
ARC1: Becoming the heartbeat ARCH
Thu Jan 19 09:55:09 2012
SMON: enabling cache recovery
Thu Jan 19 09:55:11 2012
Successfully onlined Undo Tablespace 368.
Thu Jan 19 09:55:11 2012
SMON: enabling tx recovery
Thu Jan 19 09:55:11 2012
Database Characterset is UTF8
Thu Jan 19 09:55:11 2012
Incremental checkpoint up to RBA [0x4930.3.0], current log tail at RBA [0x4930.43.0]
Thu Jan 19 09:55:11 2012
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=25, OS id=14626
Thu Jan 19 09:55:25 2012
Completed: ALTER DATABASE OPEN
Thu Jan 19 10:15:13 2012
Incremental checkpoint up to RBA [0x4930.100d.0], current log tail at RBA [0x4930.107d.0]
Thu Jan 19 10:35:16 2012
Incremental checkpoint up to RBA [0x4930.150f.0], current log tail at RBA [0x4930.155b.0]
Thu Jan 19 10:55:17 2012
Incremental checkpoint up to RBA [0x4930.1724.0], current log tail at RBA [0x4930.175a.0]
Thu Jan 19 11:15:18 2012
Incremental checkpoint up to RBA [0x4930.1edf.0], current log tail at RBA [0x4930.1f13.0]

估计硬件的生命周期达到,需要更新了。



相关文章|Related Articles

评论数量(0)|Add Comments

本文网址:

06:46 Amazon DynamoDB: NoSQL in the Cloud (2114 Bytes) » myNoSQL
Amazon DynamoDB: NoSQL in the Cloud:

James Hamilton:

In a past blog entry, One Size Does Not Fit All, I offered a taxonomy of 4 different types of structured storage system, argued that Relational Database Management Systems are not sufficient, and walked through some of the reasons why NoSQL databases have emerged and continue to grow market share quickly. The four database categories I introduced were: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. RDBMS own the first category.

DynamoDB targets workloads fitting into the Scale-First and Simple Structured storage categories where NoSQL database systems have been so popular over the last few years

A great post focusing on the challenges faced to implement the features that make DynamoDB, the Amazon cloud-based NoSQL database, unique.

Original title and link: Amazon DynamoDB: NoSQL in the Cloud (NoSQL database©myNoSQL)

05:16 Amazon’s DynamoDB Shows Hardware as Means to an End... Actually It's All About Predictability (3502 Bytes) » myNoSQL
Amazon’s DynamoDB Shows Hardware as Means to an End... Actually It's All About Predictability:

Derrick Harris:

In that sense, DynamoDB is something of a curveball. It lets AWS users leverage the performance of SSDs, only as the underpinning of a new service rather than as a new IaaS feature alone.

[…]

Web developers use NoSQL databases more frequently than enterprise developers, and NoSQL requires solid-state performance.

I think Derrick got this mostly wrong this time. Developers do not care about SSDs per se. What good developers care about is performance. And great developers care about predictability of performance.

There are a couple of NoSQL databases that know this very well. To give you just a couple of examples, take a look at this benchmark of Riak and see what is it focusing on. Or check Riak’s Bitcask backend—here’s also a great explanation of the Bitcask paper—which guarantees a single disk seek per read. I assume you guessed the keyword behind both of these: predictability.

Amazon DynamoDB is using SSDs because:

  • it wants to offer predictable low latency
  • it wants to offer predictable throughput
  • it wants to offer single-digit millisecond average service-side responses
  • and it wants to do all these at any scale of dataset sizes and request rates

Hardware is a means to an end. And SSD or not, the aboves are all that matter[1].


  1. There are other dimensions of systems that are as critical as the ones covered (e.g. availability, fault-tolerance, etc.), but these are less related to the SSD vs spinning-disks discussion.  

Original title and link: Amazon’s DynamoDB Shows Hardware as Means to an End… Actually It’s All About Predictability (NoSQL database©myNoSQL)

03:40 Introducing Amazon DynamoDB Video (1248 Bytes) » myNoSQL

The live broadcast of today’s Amazon DynamoDB announcement went down, so here’s the complete video featuring Werner Vogels (CTO Amazon), Swami Svbasubramanian (GM of DynamoDB), and Don MacAskill (CEO SmugMug)

Original title and link: Introducing Amazon DynamoDB Video (NoSQL database©myNoSQL)

00:34 Cassandra and Amazon DynamoDB Comparison (1777 Bytes) » myNoSQL
Cassandra and Amazon DynamoDB Comparison:

Maybe a couple of too strong words, but definitely a great comparison of Cassandra and Amazon DynamoDB by Jonathan Ellis (Cassandra chair and founder of DataStax):

As an engineer, it’s nice to see so many of Cassandra’s design decisions imitated by Amazon’s next-gen NoSQL product. I feel like a proud uncle! But in many important ways, Cassandra retains a firm lead in power and flexibility.

Cassandra vs Amazon DynamoDB

Update: this is the updated version of the comparison.

Original title and link: Cassandra and Amazon DynamoDB Comparison (NoSQL database©myNoSQL)

00:03 Notes About Amazon DynamoDB (7935 Bytes) » myNoSQL

It’s been only a couple of hours since the news about Amazon DynamoDB got out and since then I went through as much documentation as I could. Here are my notes so far. If you found interesting bits please leave a comment and I’ll add them to the list (with attribution):

  • it is not the first managed/hosted NoSQL
  • it is the first managed NoSQL databases that auto-shards
  • it is the first managed auto-sharding NoSQL databases that automatically reshards based on SLA (request capacity can be specified by user)
  • DynamoDB says that average service-side latenciesare typically single-digit milliseconds
  • DynamoDB stores data on Solid State Drives (SSDs)
  • DynamoDB replicates data synchronously across multiple AWS Availability Zones in an AWS Region to provide built-in high availability and data durability
  • DynamoDB is caps the throughput at both table level and account level link
    • Jeff Barr says that this limit can be changed and DynamoDB can definitely deliver more link
    • Werner Vogels clarified that similar to all AWS services default limits (tables, throughput, etc) can be lifted by filling out a request form. link
  • DynamoDB departs (a bit) from the original Dynamo model by allowing a type of non-opaque keys (which supports querying). There’s also a scan operation that allows filtering of results based on attributes’ values
  • DynamoDB limits the size of an item (record) to 64KB.An item size is the sum of lengths of its attribute names and values (binary and UTF-8 lengths).
  • DynamoDB supports two types of primary keys:
    • Hash Type Primary Key — In this case the primary key is made of one attribute, a hash attribute. Amazon DynamoDB builds an unordered hash index on this primary key attribute.
    • Hash and Range Type Primary Key — In this case, the primary key is made of two attributes. The first attribute is the hash attribute and the second one is the range attribute. Amazon DynamoDB builds an unordered hash index on the hash primary key attribute and a sorted range index on the range primary key attribute.
  • There are two types of data types:

    • scalar: number and string
    • multi-value: string set and number set

    Note that the multi-value data types are sets (elements are unique) and not lists

  • The behavior of a write is confusing:

    When Amazon DynamoDB returns an operation successful response to your write request, Amazon DynamoDB ensures the write is durable on multiple servers. However, it takes time for the update to propagate to all copies. That is, the data is eventually consistent, meaning that your read request immediately after a write might not show the change.

  • DynamoDB supports both eventually consistent and consistent reads

    • the price of a consistent read is double the price of an eventual consistent read
  • Conditional writes are supported: a write is performed iif a pre-condition is met
  • DynamoDB supports atomic counters
  • pricing is based on actual write/read operations and not API calls (e.g. a query returning 100 results accounts for 100 ops and not 1 op)
  • when defining tables (or updating), you also specify the capacity to be reserved in terms of reads and writes
    • Units of Capacity required = Number of item ops per second x item size (rounded up to the nearest KB)
    • DynamoDB divides a table’s items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions.
      • Total provisioned throughput/partitions = throughput per partition.
  • supported operations:
    • table level: create, describe, list, update
    • data level: put (create or update), get, batch get, update, delete, query, scan
      • A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process
      • The BatchGetItem operation returns the attributes for multiple items from multiple tables using their primary keys. The maximum number of item attributes that can be retrieved for a single operation is 100. Also, the number of items retrieved is constrained by a 1 MB the size limit
      • the BatchGetItem is eventually consistent, only
      • a Scan operation scans the entire table.You can specify filters to apply to the results to refine the values returned to you, after the complete scan. Amazon DynamoDB puts a 1MB limit on the scan (the limit applies before the results are filtered).
  • JSON is used for sending data and for responses, but it is not used as the native storage schema

Update:

  • for backups/restore, one could use the EMR integration to backup your table into S3 and restore from that to a new table
  • there’s no mention of SLA. Also having in mind the Amazon RDS scheduled maintenance windows, it would be good to clarify if DynamoDB will require anything similar (I doubt that, but it should be clarified). Update: Werner Vogels confirms in the comments that indeed there are no maintenance windows (always-on)
  • Some interesting data shared by a DynamoDB beta tester
    • loaded multiple terabytes
    • 250k writes/s
    • this throughput was maintained continuously for more than 3 days
    • average read latency close to 2ms and 99th percentile 6-8ms
    • no impact on other customers
  • CloudWatch alarms can be used to notify that a specific threshold for throughput has been reached for a table and when it is time to add additional read or write capacity units

Any other interesting bits to be emphasized?

Original title and link: Notes About Amazon DynamoDB (NoSQL database©myNoSQL)

2012-01-18 Wed