Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes -- ![]()
2012-02-06 Mon
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.
Original title and link: Research in the MapReduce Space (©myNoSQL)
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.
A couple of things I’ve highlighted when reading it:
- Tenzing is in production, but doesn’t serve yet a huge amount of queries
- the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
- when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
- the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
- to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
- there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
- to reduce query latency, processes are kept running
- Tenzing supports almost all SQL92 standard and some extensions from SQL99
- projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
- set operations (implemented in the reduce phase)
- nested queries and subqueries
- aggregation and statistical functions
- analytic functions (syntax similar to PostgreSQL/Oracle)
- OLAP extensions
-
JOINs:
Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.
Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.
Original title and link: Paper: Tenzing A SQL Implementation on the MapReduce Framework (©myNoSQL)
From the “Temporal Analytics on Big Data for Web Advertising” paper:
TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.
Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.
According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.
Read or download the paper after the break.
-
Data Stream Management System ↩
Original title and link: Paper: TiMR is a Time-oriented data processing system in MapReduce (©myNoSQL)
Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:
People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.
The interview is only 16 minutes long and you have the full transcript.
Original title and link: Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (©myNoSQL)
2012-02-05 Sun
To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and Bit.ly.
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: Cassandra at SocialFlow with Drew Robb - Powered by NoSQL (©myNoSQL)
Jonathan Corbet summarizing a presentation about the present and future of XFS by Dave Chinner:
XFS is often seen as the filesystem for people with massive amounts of data. It serves that role well, Dave said, and it has traditionally performed well for a lot of workloads. Where things have tended to fall down is in the writing of metadata; support for workloads that generate a lot of metadata writes has been a longstanding weak point for the filesystem. In short, metadata writes were slow, and did not really scale past even a single CPU.
After the break the video of Dave Chinner’s presentation, “XFS: Recent and Future Adventures in Filesystem scalability”.
Even if it’s very long, make sure you check the comment thread.
Original title and link: XFS: the filesystem of the future? (©myNoSQL)
2012-02-04 Sat
The fine folks from DataStax have made available the presentations from their Cassandra NYC 2011 event.
The first video to post here is Eric Evans’s presentation on Cassandra Query Language.
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: CQL: SQL for Cassandra with Eric Evans - NoSQL videos (©myNoSQL)
2012-02-03 Fri
One of the comments on that post was by David Litchfield, he wrote:
Hey Tom,Funnily enough I just published a paper about doing the same thing with NUMBER concatenations. This was an addendum to a paper I wrote in 2008 on exploit DATE concatenations - the same problem you discuss here. You can get the recent paper here: http://www.accuvant.com/capability/accuvant-labs/security-research/lateral-sql-injection-revisited-exploiting-numbers and the first paper here: http://www.databasesecurity.com/dbsec/lateral-sql-injection.pdf
But - you can. Just not as flexibly. But the end result can be as disastrous.
One of the follow on comments to this posting by David was:
the problem David mentions in http://www.accuvant.com/capability/accuvant-labs/security-research/lateral-sql-injection-revisited-exploiting-numbers only arises since NUM_PROC is owned by SYS,as far as I can see, correct ?
So, it's not really a problem since nobody ever does something as SYS, correct.
In his example, David used SYS to demonstrate with - which could lead people to believe "ah, it needs SYS to exploit this flaw". But - it doesn't. All it requires is an account with these privileges:
- Create session
- Create procedure
- Create public synonym <<<=== these guys are evil! Should be avoided
Updated a little later: Let me also say this:
If you use static sql in plsql - your code in plsql cannot be sql injected, period. It is not possible. The only way to get sql injected in plsql is to use dynamic sql - that is the only time. So, if you want maximum protection from SQL Injection - if you just want to avoid it, you will:
a) write your SQL code in PL/SQL
b) call this PL/SQL from your java/c/c#/whatever code USING BINDS to pass all inputs and outputs to/from the database
If you do that - no SQL Injection attacks are possible.
AnySQL.net
Give you some color to see see!
Oracle Scratchpad
Oracle Life
Channel [K]
Oracle Security Blog
The Tom Kyte Blog
Delicious/Fenng/oracle
O'Reilly Databases
Red Hat Magazine
车东[Blog^2]
blue_prince
玉面飞龙的BLOG
木匠 Creative and Flexible
Brotherxiao's Home
jametong's shared items in Google Reader
DBA Tools
ramarao
Inside the Oracle Optimizer - Removing the black magic
DBA@Taobao
存储部落
OracleBlog.org
知道分子
支付宝官方 Blog - 支付志
木匠的天空 Database Architect and Developer
Hello DBA
OS与Oracle
Cary Millsap
Guy Harrison's main page
eagle's home
DBA Notes
OracleDBA Blog---三少个人涂鸦地!The Pythian Blog
myNoSQL