خواندن ۱۵ دقیقه·۳ سال پیش

16 تکنیک SQL که هر مبتدی باید بداند

در مقیاس 1 تا 10 مهارت های ذخیره سازی داده شما چقدر خوب است؟

می خواهید بالاتر از 7/10 بروید؟ پس این مقاله برای شماست.

SQL شما چقدر خوب است؟ آیا می خواهید در اسرع وقت برای یک مصاحبه شغلی آماده شوید؟

این پست وبلاگ پیچیده ترین تکنیک های SQL انبار داده را با جزئیات توضیح می دهد. من از گویش استاندارد BigQuery SQL برای نوشتن چند فکر در مورد این موضوع استفاده خواهم کرد.

1. جداول افزایشی و ادغام

به روز رسانی جدول مهم است. واقعاً مهم است. وضعیت ایده آل زمانی است که تراکنش هایی داشته باشید که کلید اصلی، اعداد صحیح منحصر به فرد و افزایش خودکار هستند. به روز رسانی جدول در این مورد ساده است:

insert target_table (transaction_id)select transaction_id from source_table where transaction_id > (select max(transaction_id) from target_table);

این همیشه هنگام کار با مجموعه داده‌های طرحواره ستاره‌ای غیرعادی‌شده در انبارهای داده مدرن صدق نمی‌کند. ممکن است شما وظیفه داشته باشید که جلساتی را با SQL ایجاد کنید و/یا مجموعه داده‌ها را به‌صورت تدریجی تنها با بخشی از داده‌ها به‌روزرسانی کنید. Transaction_id ممکن است وجود نداشته باشد، اما در عوض شما باید با مدل داده سروکار داشته باشید که در آن کلید یکتا به آخرین Transaction_id (یا مهر زمانی) شناخته شده بستگی دارد. برای مثال، user_id در مجموعه داده last_online به آخرین مهر زمان اتصال شناخته شده بستگی دارد. در این صورت شما می خواهید کاربران موجود را به روز کنید و کاربران جدید را وارد کنید.

ادغام و به روز رسانی افزایشی

می توانید از MERGE استفاده کنید یا می توانید عملیات را به دو عمل تقسیم کنید. یکی برای به‌روزرسانی رکوردهای موجود با رکوردهای جدید و دیگری برای درج رکوردهای کاملاً جدید که خارج نمی‌شوند (وضعیت LEFT JOIN).

MERGE عبارتی است که عموماً در پایگاه داده های رابطه ای استفاده می شود. Google BigQuery MERGE Command یکی از دستورات زبان دستکاری داده ها (DML) است. اغلب برای انجام سه عملکرد اصلی به صورت اتمی در یک عبارت واحد استفاده می شود. این توابع عبارتند از UPDATE، INSERT و DELETE.

زمانی که دو یا چند داده با هم مطابقت داشته باشند می توان از عبارت UPDATE یا DELETE استفاده کرد.
بند INSERT زمانی قابل استفاده است که دو یا چند داده متفاوت باشند و مطابقت نداشته باشند.
بند UPDATE یا DELETE همچنین می تواند زمانی استفاده شود که داده های داده شده با منبع مطابقت نداشته باشند.

این بدان معناست که فرمان ادغام Google BigQuery به شما امکان می دهد داده های Google BigQuery را با به روز رسانی، درج و حذف داده ها از جداول Google BigQuery خود ادغام کنید.

این SQL را در نظر بگیرید:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

2. شمارش کلمات

انجام UNNEST() و بررسی اینکه آیا کلمه مورد نیاز شما در لیستی است که نیاز دارید ممکن است در بسیاری از موقعیت ها مفید باشد، یعنی تجزیه و تحلیل احساسات انبار داده:

with titles as (
    select 'Title with word foo' as title union all
    select 'Title with word bar'
)
, data as (
select 
    title, 
    split(title, ' ') as words 
from 
    titles
)
select * from data, unnest(words) words
where
    words in ('bar')
;

3. استفاده از دستور IF() خارج از دستور SELECT

این به ما فرصتی می دهد تا برخی از خطوط کد را ذخیره کنیم و از نظر کد شیواتر باشیم. معمولاً می‌خواهید این را در یک جستار فرعی قرار دهید و یک فیلتر در عبارت Where اضافه کنید، اما در عوض می‌توانید این کار را انجام دهید:

with daily_revenue as (
select
      current_date() as dt
    , 100          as revenue
    union all
select
      date_sub(current_date(), interval 1 day) as dt
    , 100          as revenue
)
select
*
from daily_revenue
where
    if(revenue >101,1,0) = 1
;

مثال دیگری که چگونه از آن با جداول پارتیشن بندی شده استفاده نکنیم. این کار را نکن این مثال بدی است زیرا از آنجایی که پسوندهای جدول تطبیق احتمالاً به صورت پویا تعیین می شوند (بر اساس چیزی در جدول شما) برای اسکن کامل جدول هزینه دریافت خواهید کرد.

SELECT *
FROM `firebase.events`
WHERE IF(condition,
         _TABLE_SUFFIX BETWEEN '20170101' AND '20170117',
         _TABLE_SUFFIX BETWEEN '20160101' AND '20160117')
;

همچنین می توانید از آن در توابع HAVING و AGGREGATE استفاده کنید.

4. استفاده از GROUP BY ROLLUP

تابع ROLLUP برای انجام تجمیع در سطوح چندگانه استفاده می شود. این زمانی مفید است که باید با نمودارهای بعد کار کنید.

کوئری زیر کل اعتبار هزینه شده در روز را بر اساس نوع تراکنش (is_gift) مشخص شده در بند Where برمی گرداند و همچنین کل هزینه شده برای هر روز و کل هزینه شده را در تمام تاریخ های موجود نشان می دهد.

with data as (
    select
    current_timestamp() as ts           
    ,'stage'            as context_type 
    ,1                  as user_id      
    ,100                as credit_value 
    , true              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24 hour) as ts           
    ,'user'             as context_type 
    ,1                  as user_id      
    ,200                as credit_value 
    ,false              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24*2 hour) as ts           
    ,'user'             as context_type 
    ,3                  as user_id      
    ,300                as credit_value 
    ,true               as is_gift

)
, results as (
select 
     date(ts) as date 
    ,context_type
    ,sum(credit_value)/100 as daily_credits_spend
from data    

group by rollup(1, context_type)
order by 1
)

select
  date
  ,if(context_type is null, 'total', context_type) as context_type
  ,daily_credits_spend
from results
order by date
;

5. تبدیل جدول به JSON

تصور کنید باید جدول خود را به شی JSON تبدیل کنید که در آن هر رکورد عنصری از آرایه تودرتو است. این جایی است که تابع to_json_string() مفید می شود:

with mytable as (
 select 1 as x, 'foo' as y, true as z union all
 select 2, 'bar', false
)
select 
    concat("{", "\"MyTable\":", "[", string_agg(to_json_string(t), ","), "]", "}")
from mytable as t
;

سپس می توانید از آن در هر جایی استفاده کنید: تاریخ ها، قیف های بازاریابی، شاخص ها، نمودارهای هیستوگرام و غیره.

6. استفاده از PARTITION BY

با توجه به ستون های user_id، date و total_cost. برای هر تاریخ، چگونه ارزش کل درآمد را برای هر مشتری نشان می‌دهید در حالی که همه ردیف‌ها را حفظ می‌کنید؟ شما می توانید به این شکل دست پیدا کنید:

select
     date
    ,user_id
    ,total_cost
    ,sum(total_cost) over (partition by date,user_id) as revenue_per_day
from production.payment_transaction
;

7. میانگین متحرک

اغلب اوقات توسعه دهندگان BI وظیفه دارند میانگین متحرک را به گزارش ها و داشبوردهای فوق العاده خود اضافه کنند. این ممکن است نمودار خطی MA 7، 14، 30 روز در ماه یا حتی سال باشد. حالا چطور آن کار را انجام بدهیم؟

with dates as (
select
    dt
from 
    unnest(generate_date_array(date_sub(current_date(), interval 90 day), current_date(), interval 1 day)) as dt
)

, data as (
    select dt
        , CEIL(RAND()*1000) as revenue -- just some random data.
    from
        dates
)
select
  dt
, revenue
, AVG(revenue) OVER(ORDER BY unix_date(dt) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) as seven_day_moving_average
from data
;

8. آرایه های تاریخ

زمانی که با حفظ کاربر کار می‌کنید یا می‌خواهید برخی از مجموعه داده‌ها را برای مقادیر گمشده، یعنی تاریخ‌ها بررسی کنید، واقعاً مفید می‌شود. BigQuery تابعی به نام GENERATE_DATE_ARRAY دارد:

select
 dt
from 
    unnest(generate_date_array('2019–12–04', '2020–09–17', interval 1 day)) as dt
;

9. Row_number()

این برای دریافت جدیدترین اطلاعات از داده‌های خود، یعنی آخرین رکورد به‌روزرسانی‌شده و غیره یا حتی حذف موارد تکراری مفید است:

with reputation_data as (
select
      1     as user_id
    , 100   as reputation
    , 1     as reputation_level
    , timestamp_sub(current_timestamp(), interval 3 hour) as ts
union all
select
      1     as user_id
    , 101   as reputation
    , 1     as reputation_level
    , timestamp_sub(current_timestamp(), interval 2 hour)
union all
select
      1     as user_id
    , 200   as reputation
    , 2     as reputation_level
    , timestamp_sub(current_timestamp(), interval 1 hour)
)
select *
from reputation_data a
qualify row_number() over (partition by a.user_id order by a.ts desc) = 1
;

10. NTILE()

یک تابع شماره گذاری دیگر. اگر یک برنامه تلفن همراه دارید، برای نظارت بر مواردی مانند مدت زمان ورود در چند ثانیه بسیار مفید است. به عنوان مثال، من برنامه خود را به Firebase متصل کرده ام و وقتی کاربران وارد سیستم می شوند، می توانم ببینم چقدر طول کشیده است.

این تابع، ردیف‌ها را بر اساس ترتیب ردیف‌ها به سطل‌های ثابت_صحیح_expression تقسیم می‌کند و شماره سطل 1 را که به هر سطر اختصاص داده می‌شود، برمی‌گرداند. تعداد ردیف ها در سطل ها می تواند حداکثر تا 1 متفاوت باشد. مقادیر باقیمانده (باقیمانده تعداد ردیف‌ها تقسیم بر سطل) یک عدد برای هر سطل توزیع می‌شود، که با سطل 1 شروع می‌شود. اگر iftext_integer_expression NULL، 0 یا منفی ارزیابی شود، یک خطا ارائه می‌شود.

select (case when tile = 50 then 'median' when tile = 95 then '95%' else '5%' end) as tile
    , dt
    , max(cast( round(duration/1000) as numeric)/1000 ) max_duration_s
    , min(cast( round(duration/1000) as numeric)/1000 ) min_duration_s

from (
    select 
         trace_info.duration_us duration
        , ntile(100) over (partition by (date(event_timestamp)) order by trace_info.duration_us) tile
        , date(event_timestamp) dt

    from firebase_performance.my_mobile_app 
    where 
        date(_partitiontime) >= parse_date('%y%m%d', @ds_start_date) and date(_partitiontime) <= parse_date('%y%m%d', @ds_end_date)
        and 
        date(event_timestamp) >= parse_date('%y%m%d', @ds_start_date)
        and 
        date(event_timestamp) <= parse_date('%y%m%d', @ds_end_date)
    and lower(event_type) = "duration_trace"
    and lower(event_name) = 'logon'
) x
WHERE tile in (5, 50, 95)
group by dt, tile
order by dt
;

11. رتبه / تراکم_رتبه

به آنها توابع شماره گذاری نیز می گویند. من تمایل دارم از DENSE_RANK به عنوان تابع رتبه‌بندی پیش‌فرض استفاده کنم، زیرا از رتبه‌بندی بعدی در دسترس نمی‌گذرد در حالی که RANK این کار را می‌کند. مقادیر رتبه های متوالی را برمی گرداند. می توانید از آن با پارتیشنی استفاده کنید که نتایج را به سطل های مجزا تقسیم می کند. سطرها در هر پارتیشن اگر مقادیر یکسانی داشته باشند رتبه های یکسانی را دریافت می کنند. مثال:

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 4 as user_id, 300 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 44 as user_id, 2500 as total_spend, 22   as reputation_level 

)

select 
    user_id
    , rank() over(partition by reputation_level order by total_spend desc) as rank
    , dense_rank() over(partition by reputation_level order by total_spend desc) as dense_rank
from
    top_spenders
;

نمونه ای دیگر با قیمت محصولات:

with products as (
    
    select
        2                    as product_id      
        , 'premium_account'  as product_type    
        , 100                as total_cost   
    union all
    select
        1                    as product_id      
        , 'premium_group'    as product_type
        , 200                as total_cost
    union all
    select
        111                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
    union all
    select
        112                  as product_id      
        , 'bots'             as product_type    
        , 400                as total_cost      
    union all
    select
        113                  as product_id      
        , 'bots'             as product_type    
        , 500                as total_cost      
    union all
    select
        213                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
  
)
select * from (
	select
		  product_id
		, product_type
		, total_cost as product_price
		, dense_rank () over ( 
			partition by product_type
			order by total_cost desc
		) price_rank 
	from
		products
) t
where price_rank < 3
;

12. Pivot / unpivot

Pivot سطرها را به ستون تغییر می دهد. این همه کاری است که انجام می دهد. Unpivot برعکس عمل می کند.

select * from
(
  -- #1 from_item
  select 
     extract(month from dt) as mo         
    ,product_type    
    ,revenue   
  from (
    select
        date(current_date()) as dt              
        , 'premium_account'  as product_type    
        , 100                as revenue   
    union all
    select
        date_sub(current_date(), interval 1 month) as dt
        , 'premium_group'    as product_type
        , 200                as revenue
    union all
    select
        date_sub(current_date(), interval 2 month) as dt
        , 'bots'             as product_type
        , 300                as revenue
  )
)
pivot
(
  -- #2 aggregate
  avg(revenue) as avg_revenue_
  -- #3 pivot_column
  for product_type in ('premium_account', 'premium_group')
)
;

13. First_value / last_value

این تابع مفید دیگری است که به دریافت یک دلتا برای هر ردیف در برابر مقدار اول / آخر در آن پارتیشن خاص کمک می کند.

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 150 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 2500 as total_spend, 22   as reputation_level 

)
, data as (
    select
        user_id
        ,total_spend
        ,reputation_level
        ,first_value(total_spend)
    over (partition by reputation_level order by total_spend desc
    rows between unbounded preceding and unbounded following) as top_spend
  from top_spenders
)

select
    user_id
    ,reputation_level
    ,total_spend
    ,top_spend          as top_spend_by_rep_level
    ,total_spend - top_spend as delta_in_usd
from data
;

14. یک جدول را به Array of structs تبدیل کنید و آنها را به UDF منتقل کنید

این زمانی مفید است که شما نیاز به اعمال یک تابع تعریف شده توسط کاربر (UDF) با منطق پیچیده برای هر ردیف یا جدول دارید. همیشه می توانید جدول خود را به عنوان آرایه ای از اشیاء TYPE STRUCT در نظر بگیرید و سپس هر یک از آنها را به UDF ارسال کنید.بستگی به منطق شما داره به عنوان مثال، من از آن برای محاسبه زمان انقضای خرید استفاده می کنم:

select 
     target_id
    ,product_id
    ,product_type_id
    ,production.purchase_summary_udf()(
        ARRAY_AGG(
            STRUCT(
                target_id
                , user_id
                , product_type_id
                , product_id
                , item_count
                , days
                , expire_time_after_purchase
                , transaction_id 
                , purchase_created_at 
                , updated_at
            ) 
            order by purchase_created_at
        )
    ) AS processed

from new_batch
;

به روشی مشابه می توانید جداول را بدون نیاز به استفاده از UNION ALL ایجاد کنید. به عنوان مثال، من از آن برای تمسخر برخی از داده های آزمایشی برای تست های واحد استفاده می کنم. به این ترتیب فقط با استفاده از Alt+Shift+Down در ویرایشگر خود می توانید آن را بسیار سریع انجام دهید.

select * from unnest([
        struct
        (     
            1                                 as user_id
        ,   111                               as reputation
        ,   timestamp('2021-12-16 13:00:01')  as update_time
        
        ),

        (
            2                                 --as user_id
        ,   111                               --as reputation
        ,   timestamp('2011-12-16 13:00:01')  --as update_time
        ),

        (
            3                                 --as user_id
        ,   111                               --as reputation
        ,    timestamp(format_timestamp("%Y-%m-%d 12:59:01 UTC" ,timestamp(date_sub(current_date(), interval 0 day))))   --as update_time
        )
        ]
    ) as t

15. ایجاد قیف رویداد با استفاده از FOLLOWING و UNBUNDED FOLLOWING

مثال خوب ممکن است قیف های بازاریابی باشد. مجموعه داده شما ممکن است حاوی رویدادهایی از یک نوع باشد که به طور مداوم تکرار می شوند، اما در حالت ایده آل شما می خواهید هر رویداد را با رویداد بعدی از نوع متفاوت زنجیر کنید. این ممکن است زمانی مفید باشد که برای ایجاد مجموعه داده قیف نیاز به دریافت لیستی از چیزی دارید، مانند رویدادها، خریدها و غیره. کار با PARTITION BY به شما این فرصت را می دهد که همه رویدادهای بعدی را بدون توجه به تعداد آنها در هر پارتیشن گروه بندی کنید.

with d as (
select * from unnest([
  struct('0003f' as user_pseudo_id, 12322175 as user_id, timestamp '2020-10-10 16:46:59.878 UTC' as event_timestamp, 'join_group' as event_name),
  ('0003',12,timestamp '2022-10-10 16:50:03.394 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:02:38.632 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:09:38.645 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:10:38.645 UTC','join_group'),
  ('0003',12,timestamp '2022-10-10 17:15:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:17:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:18:38.645 UTC','in_app_purchase'),
  ('0003',12,timestamp '2022-10-10 17:19:38.645 UTC','spend_virtual_currency'),
  ('0003',12,timestamp '2022-10-10 17:19:45.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:20:38.645 UTC','set_avatar')
  ]
  ) as t)

  , event_data as (
SELECT 
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , ARRAY_AGG(
        STRUCT(
              event_name AS event_name
            , event_timestamp AS event_timestamp
        )
    ) 
    OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND  UNBOUNDED FOLLOWING ) as next_events

FROM d
WHERE
DATE(event_timestamp) = "2022-10-10" 

)
select
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , (SELECT 
        event_name FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event
  , (SELECT 
        event_timestamp FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event_ts

from event_data t
;

16. Regexp

در صورت نیاز به استخراج چیزی از داده های بدون ساختار، مانند نرخ های fx، گروه بندی های سفارشی و غیره، از آن استفاده می کنید. کار با نرخ تبدیل ارز با استفاده از regexp این مثال را با داده های نرخ ارز در نظر بگیرید:

-- One or more digits (\d+), optional period (\.?), zero or more digits (\d*).
with object as
(select  '{"aed":3.6732,"afn":78.45934,"all":110.586428}' as rates)

, data as (
select "usd" as base_currency,
  regexp_extract_all(rates, r'"[^"]+":\d+\.?\d*') as pair
from object
)
, splits as (
select base_currency, pair, split(pair, ':') positions 
from data cross join unnest (pair) as pair
)
select base_currency, pair,  positions[offset(0)] as rate_currency,  positions[offset(1)] as rate
from splits  
;

کار با نسخه های برنامه با استفاده از regexp گاهی اوقات ممکن است بخواهید از regexp برای دریافت نسخه‌های اصلی، نسخه‌های منتشر شده یا مود برای برنامه خود و ایجاد یک گزارش سفارشی استفاده کنید:

with events as (
  select  'open_chat' as event_name, '10.1.0' as app_display_version union all
  select  'open_chat' as event_name, '10.1.9' as app_display_version union all
  select  'open_chat' as event_name, '9.1.4' as app_display_version union all
  select  'open_chat' as event_name, '9.0.0' as app_display_version
)
select
     app_display_version
    ,REGEXP_EXTRACT(app_display_version, '^[^.^]*') main_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, '[0-9]+.[0-9]+') as float64) release_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, r"^[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)") as int64) as mod_version
from events
;

نتیجه

SQL ابزار قدرتمندی است که به دستکاری داده ها کمک می کند. امیدواریم این موارد استفاده از SQL از بازاریابی دیجیتال برای شما مفید باشد. این در واقع یک مهارت مفید است و می تواند در بسیاری از پروژه ها به شما کمک کند. این قطعه‌های SQL زندگی من را بسیار آسان‌تر کرد و تقریباً هر روز در محل کار از آن استفاده می‌کنم. بیشتر، SQL و انبارهای داده مدرن ابزارهای ضروری برای علم داده هستند. ویژگی‌های گویش قوی آن امکان مدل‌سازی و تجسم داده‌ها را به راحتی فراهم می‌کند.از آنجا که SQL زبانی است که انبارهای داده و متخصصان هوش تجاری از آن استفاده می کنند، اگر می خواهید داده ها را با آنها به اشتراک بگذارید، انتخاب بسیار خوبی است. این رایج ترین راه برای برقراری ارتباط با تقریباً هر راه حل انبار داده / دریاچه در بازار است.

sql

Na.Ka

شاید از این پست‌ها خوشتان بیاید

Na.Ka

خواندن ۱۵ دقیقه·۳ سال پیش

16 تکنیک SQL که هر مبتدی باید بداند

در مقیاس 1 تا 10 مهارت های ذخیره سازی داده شما چقدر خوب است؟

می خواهید بالاتر از 7/10 بروید؟ پس این مقاله برای شماست.

SQL شما چقدر خوب است؟ آیا می خواهید در اسرع وقت برای یک مصاحبه شغلی آماده شوید؟

1. جداول افزایشی و ادغام

insert target_table (transaction_id)select transaction_id from source_table where transaction_id > (select max(transaction_id) from target_table);

ادغام و به روز رسانی افزایشی

زمانی که دو یا چند داده با هم مطابقت داشته باشند می توان از عبارت UPDATE یا DELETE استفاده کرد.
بند INSERT زمانی قابل استفاده است که دو یا چند داده متفاوت باشند و مطابقت نداشته باشند.
بند UPDATE یا DELETE همچنین می تواند زمانی استفاده شود که داده های داده شده با منبع مطابقت نداشته باشند.

این SQL را در نظر بگیرید:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

2. شمارش کلمات

with titles as (
    select 'Title with word foo' as title union all
    select 'Title with word bar'
)
, data as (
select 
    title, 
    split(title, ' ') as words 
from 
    titles
)
select * from data, unnest(words) words
where
    words in ('bar')
;

3. استفاده از دستور IF() خارج از دستور SELECT

with daily_revenue as (
select
      current_date() as dt
    , 100          as revenue
    union all
select
      date_sub(current_date(), interval 1 day) as dt
    , 100          as revenue
)
select
*
from daily_revenue
where
    if(revenue >101,1,0) = 1
;

SELECT *
FROM `firebase.events`
WHERE IF(condition,
         _TABLE_SUFFIX BETWEEN '20170101' AND '20170117',
         _TABLE_SUFFIX BETWEEN '20160101' AND '20160117')
;

همچنین می توانید از آن در توابع HAVING و AGGREGATE استفاده کنید.

4. استفاده از GROUP BY ROLLUP

تابع ROLLUP برای انجام تجمیع در سطوح چندگانه استفاده می شود. این زمانی مفید است که باید با نمودارهای بعد کار کنید.

with data as (
    select
    current_timestamp() as ts           
    ,'stage'            as context_type 
    ,1                  as user_id      
    ,100                as credit_value 
    , true              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24 hour) as ts           
    ,'user'             as context_type 
    ,1                  as user_id      
    ,200                as credit_value 
    ,false              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24*2 hour) as ts           
    ,'user'             as context_type 
    ,3                  as user_id      
    ,300                as credit_value 
    ,true               as is_gift

)
, results as (
select 
     date(ts) as date 
    ,context_type
    ,sum(credit_value)/100 as daily_credits_spend
from data    

group by rollup(1, context_type)
order by 1
)

select
  date
  ,if(context_type is null, 'total', context_type) as context_type
  ,daily_credits_spend
from results
order by date
;

5. تبدیل جدول به JSON

with mytable as (
 select 1 as x, 'foo' as y, true as z union all
 select 2, 'bar', false
)
select 
    concat("{", "\"MyTable\":", "[", string_agg(to_json_string(t), ","), "]", "}")
from mytable as t
;

سپس می توانید از آن در هر جایی استفاده کنید: تاریخ ها، قیف های بازاریابی، شاخص ها، نمودارهای هیستوگرام و غیره.

6. استفاده از PARTITION BY

select
     date
    ,user_id
    ,total_cost
    ,sum(total_cost) over (partition by date,user_id) as revenue_per_day
from production.payment_transaction
;

7. میانگین متحرک

with dates as (
select
    dt
from 
    unnest(generate_date_array(date_sub(current_date(), interval 90 day), current_date(), interval 1 day)) as dt
)

, data as (
    select dt
        , CEIL(RAND()*1000) as revenue -- just some random data.
    from
        dates
)
select
  dt
, revenue
, AVG(revenue) OVER(ORDER BY unix_date(dt) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) as seven_day_moving_average
from data
;

8. آرایه های تاریخ

select
 dt
from 
    unnest(generate_date_array('2019–12–04', '2020–09–17', interval 1 day)) as dt
;

9. Row_number()

with reputation_data as (
select
      1     as user_id
    , 100   as reputation
    , 1     as reputation_level
    , timestamp_sub(current_timestamp(), interval 3 hour) as ts
union all
select
      1     as user_id
    , 101   as reputation
    , 1     as reputation_level
    , timestamp_sub(current_timestamp(), interval 2 hour)
union all
select
      1     as user_id
    , 200   as reputation
    , 2     as reputation_level
    , timestamp_sub(current_timestamp(), interval 1 hour)
)
select *
from reputation_data a
qualify row_number() over (partition by a.user_id order by a.ts desc) = 1
;

10. NTILE()

select (case when tile = 50 then 'median' when tile = 95 then '95%' else '5%' end) as tile
    , dt
    , max(cast( round(duration/1000) as numeric)/1000 ) max_duration_s
    , min(cast( round(duration/1000) as numeric)/1000 ) min_duration_s

from (
    select 
         trace_info.duration_us duration
        , ntile(100) over (partition by (date(event_timestamp)) order by trace_info.duration_us) tile
        , date(event_timestamp) dt

    from firebase_performance.my_mobile_app 
    where 
        date(_partitiontime) >= parse_date('%y%m%d', @ds_start_date) and date(_partitiontime) <= parse_date('%y%m%d', @ds_end_date)
        and 
        date(event_timestamp) >= parse_date('%y%m%d', @ds_start_date)
        and 
        date(event_timestamp) <= parse_date('%y%m%d', @ds_end_date)
    and lower(event_type) = "duration_trace"
    and lower(event_name) = 'logon'
) x
WHERE tile in (5, 50, 95)
group by dt, tile
order by dt
;

11. رتبه / تراکم_رتبه

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 4 as user_id, 300 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 44 as user_id, 2500 as total_spend, 22   as reputation_level 

)

select 
    user_id
    , rank() over(partition by reputation_level order by total_spend desc) as rank
    , dense_rank() over(partition by reputation_level order by total_spend desc) as dense_rank
from
    top_spenders
;

نمونه ای دیگر با قیمت محصولات:

with products as (
    
    select
        2                    as product_id      
        , 'premium_account'  as product_type    
        , 100                as total_cost   
    union all
    select
        1                    as product_id      
        , 'premium_group'    as product_type
        , 200                as total_cost
    union all
    select
        111                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
    union all
    select
        112                  as product_id      
        , 'bots'             as product_type    
        , 400                as total_cost      
    union all
    select
        113                  as product_id      
        , 'bots'             as product_type    
        , 500                as total_cost      
    union all
    select
        213                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
  
)
select * from (
	select
		  product_id
		, product_type
		, total_cost as product_price
		, dense_rank () over ( 
			partition by product_type
			order by total_cost desc
		) price_rank 
	from
		products
) t
where price_rank < 3
;

12. Pivot / unpivot

Pivot سطرها را به ستون تغییر می دهد. این همه کاری است که انجام می دهد. Unpivot برعکس عمل می کند.

select * from
(
  -- #1 from_item
  select 
     extract(month from dt) as mo         
    ,product_type    
    ,revenue   
  from (
    select
        date(current_date()) as dt              
        , 'premium_account'  as product_type    
        , 100                as revenue   
    union all
    select
        date_sub(current_date(), interval 1 month) as dt
        , 'premium_group'    as product_type
        , 200                as revenue
    union all
    select
        date_sub(current_date(), interval 2 month) as dt
        , 'bots'             as product_type
        , 300                as revenue
  )
)
pivot
(
  -- #2 aggregate
  avg(revenue) as avg_revenue_
  -- #3 pivot_column
  for product_type in ('premium_account', 'premium_group')
)
;

13. First_value / last_value

این تابع مفید دیگری است که به دریافت یک دلتا برای هر ردیف در برابر مقدار اول / آخر در آن پارتیشن خاص کمک می کند.

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 150 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 2500 as total_spend, 22   as reputation_level 

)
, data as (
    select
        user_id
        ,total_spend
        ,reputation_level
        ,first_value(total_spend)
    over (partition by reputation_level order by total_spend desc
    rows between unbounded preceding and unbounded following) as top_spend
  from top_spenders
)

select
    user_id
    ,reputation_level
    ,total_spend
    ,top_spend          as top_spend_by_rep_level
    ,total_spend - top_spend as delta_in_usd
from data
;

14. یک جدول را به Array of structs تبدیل کنید و آنها را به UDF منتقل کنید

select 
     target_id
    ,product_id
    ,product_type_id
    ,production.purchase_summary_udf()(
        ARRAY_AGG(
            STRUCT(
                target_id
                , user_id
                , product_type_id
                , product_id
                , item_count
                , days
                , expire_time_after_purchase
                , transaction_id 
                , purchase_created_at 
                , updated_at
            ) 
            order by purchase_created_at
        )
    ) AS processed

from new_batch
;

select * from unnest([
        struct
        (     
            1                                 as user_id
        ,   111                               as reputation
        ,   timestamp('2021-12-16 13:00:01')  as update_time
        
        ),

        (
            2                                 --as user_id
        ,   111                               --as reputation
        ,   timestamp('2011-12-16 13:00:01')  --as update_time
        ),

        (
            3                                 --as user_id
        ,   111                               --as reputation
        ,    timestamp(format_timestamp("%Y-%m-%d 12:59:01 UTC" ,timestamp(date_sub(current_date(), interval 0 day))))   --as update_time
        )
        ]
    ) as t

15. ایجاد قیف رویداد با استفاده از FOLLOWING و UNBUNDED FOLLOWING

with d as (
select * from unnest([
  struct('0003f' as user_pseudo_id, 12322175 as user_id, timestamp '2020-10-10 16:46:59.878 UTC' as event_timestamp, 'join_group' as event_name),
  ('0003',12,timestamp '2022-10-10 16:50:03.394 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:02:38.632 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:09:38.645 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:10:38.645 UTC','join_group'),
  ('0003',12,timestamp '2022-10-10 17:15:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:17:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:18:38.645 UTC','in_app_purchase'),
  ('0003',12,timestamp '2022-10-10 17:19:38.645 UTC','spend_virtual_currency'),
  ('0003',12,timestamp '2022-10-10 17:19:45.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:20:38.645 UTC','set_avatar')
  ]
  ) as t)

  , event_data as (
SELECT 
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , ARRAY_AGG(
        STRUCT(
              event_name AS event_name
            , event_timestamp AS event_timestamp
        )
    ) 
    OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND  UNBOUNDED FOLLOWING ) as next_events

FROM d
WHERE
DATE(event_timestamp) = "2022-10-10" 

)
select
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , (SELECT 
        event_name FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event
  , (SELECT 
        event_timestamp FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event_ts

from event_data t
;

16. Regexp

-- One or more digits (\d+), optional period (\.?), zero or more digits (\d*).
with object as
(select  '{"aed":3.6732,"afn":78.45934,"all":110.586428}' as rates)

, data as (
select "usd" as base_currency,
  regexp_extract_all(rates, r'"[^"]+":\d+\.?\d*') as pair
from object
)
, splits as (
select base_currency, pair, split(pair, ':') positions 
from data cross join unnest (pair) as pair
)
select base_currency, pair,  positions[offset(0)] as rate_currency,  positions[offset(1)] as rate
from splits  
;

with events as (
  select  'open_chat' as event_name, '10.1.0' as app_display_version union all
  select  'open_chat' as event_name, '10.1.9' as app_display_version union all
  select  'open_chat' as event_name, '9.1.4' as app_display_version union all
  select  'open_chat' as event_name, '9.0.0' as app_display_version
)
select
     app_display_version
    ,REGEXP_EXTRACT(app_display_version, '^[^.^]*') main_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, '[0-9]+.[0-9]+') as float64) release_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, r"^[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)") as int64) as mod_version
from events
;

نتیجه

sql

Na.Ka

شاید از این پست‌ها خوشتان بیاید