Chapter 7 Most Used Tables
7.1 Traditional Data
Within the databases that we covered in one of the previous sections is a litany of tables that we access for various purposes. Below is a list of the popular tables we reference with a short description of what they are for. If you have any questions, talk to your supervisor.
- FHVHV_TripRecord
- Trip record table for High Volume (UBER, Lyft, VIA prior to Sep 2021) trips after 2019-01, each row represents a trip.
- Database: Azure_Trip_Data
- Note: Dont ever pull the whole table, it will crash your PC. Use ‘datetimeid’ for dates - it is indexed.
- Sample pull:
100 *
SELECT top
FROM [TPEP_AZURE].[TPEPDW].[dbo].[FHVHV_TripRecord]>= 2021080100 and datetimeid < 2021110100 where datetimeid
- FHV_Prd_TripRecord
- Trip record table for all fhvs (high volume and non high volume before 2019-02) and just traditional fhvs (after 2019-01). Each row represents a trip.
- Database: Azure_Trip_Data
- Note: Dont ever pull the whole table, it will crash your PC. Use ‘datetimeid’ for dates - it is indexed.
- Sample pull:
100 *
SELECT top
FROM [TPEP_AZURE].[TPEPDW].[dbo].[FHV_Prd_TripRecord]>= 2021080100 and datetimeid < 2021110100 where datetimeid
- vw_FHVALL_Triprecord
- Combined view of the 2 tables above combined. Each row represents a trip.
- Database: Azure_Trip_Data
- Note: Not every column that is present in the FHVHV_TripRecord present in the FHV_Prd_TripRecord. For example, anything that has to do with financial information.
- Sample pull:
100 *
SELECT top
FROM [TPEP_AZURE].[TPEPDW].[dbo].[vw_FHVALL_Triprecord]>= 2021080100 and datetimeid < 2021110100 where datetimeid
- Tpep2_triprecord
- Trip record table for medallion (yellow) trips after 2010, each row represents a yellow cab trip .
- Database: Azure_Trip_Data
- Note: Dont ever pull the whole table, it will crash your PC. Use ‘datetimeid’ for dates - it is indexed.
- Sample pull:
100 *
SELECT top
FROM [TPEP_AZURE].[TPEPDW].[dbo].[Tpep2_triprecord]>= 2021080100 and datetimeid < 2021110100 where datetimeid
- Lpep2_triprecord
- Trip record table for SHL (green) trips after 2010, each row represents a yellow cab trip .
- Database: Azure_Trip_Data
- Note: This table’s tructure is very similar to Tpep2_triprecord, but it has much fewer records. Use ‘datetimeid’ for dates - it is indexed.
- Sample pull:
100 *
SELECT top
FROM [TPEP_AZURE].[TPEPDW].[dbo].[Lpep2_triprecord]>= 2021080100 and datetimeid < 2021110100 where datetimeid
- DimLocation
- Super important table if you are doing spacial analysis.
- Database: Azure_Trip_Data
- Note: Useful when you are joining it to the trip records by the locationid column
- Sample pull:
*
SELECT FROM [TPEP_AZURE].[TPEPDW].[dbo].[DimLocation]
- Fhv_base_list
- The list of bases. Bases are companies that dispatch trips. The table might not be super helpful on its own, but it is super useful when you join it to trip tables to figure out an industry or a company name of a base that dispatched a trip
- Database: Azure_Trip_Data
- Note: Not a big table. In the example below, look at the last 5 columns.
- Sample pull:
100 *
select top
FROM
[TPEP_AZURE].[TPEPDW].[dbo].[FHV_Prd_TripRecord]
AS TRIPS
INNER JOIN [TPEP_AZURE].[TPEPDW].[dbo].[fhv_base_list] bases on = bases.[LIC_NO] TRIPS.[Dispatching_base_num]
- Tlc_camis_entity
- A snapshot of all entities (all licensees including but not limited to drivers, vehicles, bases) and pertinent information like the license application date, addresses, etc.
- Database: DataWarehouse
- Note: Very important table. We have a very extensive printed documentation for it. Ask your supervisor for it.
- Sample pull: All active drivers
- A snapshot of all entities (all licensees including but not limited to drivers, vehicles, bases) and pertinent information like the license application date, addresses, etc.
rtrim(ltrim(fed_id)) as fed_id, rtrim(ltrim(lic_no)) as lic_no, lic_code, lic_exp_date
SELECT entity_nam,
from tlc_camis_entityin ('HDR','CDR') and STAT_ENTITY_LIC IN ('002','009','010','RNA','ANL') where lic_code
- Tlc_plate
- A table holding all current and historical plate information for vehicles.
- Database: DataWarehouse
- Note: .
- Sample pull: pull top 100
100 *
SELECT top from tlc_plate
7.2 TLC Datawarehouse
The Data Team, collaborating with IT, built out a data warehouse that automatically aggregates the most often requested data points on a set schedule, drastically increasing the speed with which data can be pulled and analyzed. Most tables in the warehouse update automatically and run on a set schedule. The standard workflow of creating a new table is as follows:
Create a new table with some initial data straight from the SSMS, Python, or R.
Create a Stored Procedure script. The goal of that script is to update the table that you created.
Create a Job in the SSMS job scheduler. That job will run the Stored Procedure you created in step 2 on a specified schedule.
Datawarehouse SSMS view:
Here is a connection example using R and Python. Note: you must have the ODBC connection set up, as shown in section 4.2 of this manual.
- R:
library(RODBC)
= odbcConnect("TLC_Policy_Programs_Dev", uid = "...")
tp2
<- sqlQuery(tp2,
test "SELECT *
FROM [TLC_Policy_Programs_Dev].[dbo].[high_volume_indicators_weekly_financials]")
- Python:
import pyodbc
= urllib.parse.quote_plus("DRIVER={SQL Server};SERVER=msdwvd-tlctxy01.csc.nycnet;DATABASE=TLC_Policy_Programs_Dev;Trusted_Connection=yes")
params= create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
engine
= '''
sql SELECT *
FROM [TLC_Policy_Programs_Dev].[dbo].[high_volume_indicators_weekly_financials]
'''
= pd.read_sql_query(sql, engine) test
The data dictionaries for the majority of tables in the Datawarehouse are in here:
:\COF\COF\_M3trics2\automation\data_dictionaries I
There is also a standardization guide for creating tables and views in the warehouse. You can access it here:
:\COF\COF\_DA&E_\Nikita\Supporting_docs I
7.3 Tables
There are many valuable tables in the Datawarehouse. I recommend going through the documentation folder to get accustomed to some of them. I want to list the top 5 most used tables in this document, though:
- industry_indicators_daily_trips
- This table goes back to 2014 (inclusive) for yellow and green and 2015 for fhvs. It contains trip counts aggregated by day, split by every industry.
- Database: TLC_Policy_Programs_Dev
- Note: This table will save you a ton of time.
- Sample pull:
TOP (1000) [period_start]
SELECT
,[period_end]
,[metric_day]
,[industry]
,[count_trips] FROM [TLC_Policy_Programs_Dev].[dbo].[industry_indicators_daily_trips]
- data_reports_monthly_indicators_all
- These are a set of published metrics updated every month and reviewed with the commissioner before updating. They cover a myriad of relevant metrics for specific industries we regulate.
- Database: TLC_Policy_Programs_Dev
- Note: Serves as a base for the monthly indicators that we publish to our website.
- Sample pull:
TOP (1000) [Month_Year]
SELECT
,[License_Class]
,[Trips_Per_Day]
,[Farebox_Per_Day]
,[Unique_Drivers]
,[Unique_Vehicles]
,[Vehicles_Per_Day]
,[Avg_Days_Vehicles_on_Road]
,[Avg_Hours_Per_Day_Per_Vehicle]
,[Avg_Days_Drivers_on_Road]
,[Avg_Hours_Per_Day_Per_Driver]
,[Avg_Minutes_Per_Trip]
,[Percent_of_Trips_Paid_with_Credit_Card]
,[Trips_Per_Day_Shared] FROM [TLC_Policy_Programs_Dev].[dbo].[data_reports_monthly_indicators_all]
- high_volume_indicators_weekly_financials
- These are a set of metrics we created to track driver income on a Monday to Sunday weekly schedule.
- Database: TLC_Policy_Programs_Dev
- Note: Created in a python script they piggyback off utilization to come up with our best estimate on high volume driver income.
- Sample pull:
TOP (1000) [date]
SELECT
,[metric_week]
,[aggregate_pay]
,[aggregate_hours]
,[aggregate_hourly_pay]
,[median_total_pay]
,[median_logon_hours]
,[median_hourly_pay]
,[driver_count]
,[pay_per_driver]
FROM [TLC_Policy_Programs_Dev].[dbo].[high_volume_indicators_weekly_financials] order by metric_week desc
- industry_zone_indicators_monthly_pickups
- Count of pickups split by month, industry, and taxi zone (265).
- Database: TLC_Policy_Programs_Dev
- Note: This table will save you a ton of time.
- Sample pull:
TOP (1000) [period_start]
SELECT
,[period_end]
,[metric_month]
,[industry]
,[zone]
,[count_pickups]
,[count_pickups_shared]
,[count_pickups_ehail]
FROM [TLC_Policy_Programs_Dev].[dbo].[industry_zone_indicators_monthly_pickups] order by [metric_month] desc
- company_indicators_weekly_utilization_even
- Driver utilization is calculated and loaded into our policy dev server. It is currently run unweighted, meaning that app logon time which is the denominator in this calculation is evenly split for apps a driver is logged into simultaneously.
- Database: TLC_Policy_Programs_Dev
- Note: Note that every nth time a year we re-evaluate utilization publicly as per the law – legal can provide more assistance on the timeline as Ryan wrote the rules.
- Sample pull:
TOP (1000) [period_start]
SELECT
,[period_end]
,[metric_week]
,[company]
,[sum_cruising_seconds]
,[sum_passenger_seconds]
,[pct_utilization] FROM [TLC_Policy_Programs_Dev].[dbo].[company_indicators_weekly_utilization_even]
This section will be developed more in the future.
This work is licensed under a Creative Commons Attribution 4.0 International License.