Compilation of my live tweets from SNIA’s SDC 2012 (Storage Developer Conference)

September 21, 2012, 9:40 am

≫ Next: Talks at SNW Fall 2012 in October will cover SMB Direct, Hyper-V over SMB and SMB 3.0

≪ Previous: SNIA’s Storage Developers Conference 2012 is just a few weeks away

Here is a compilation of my live tweets from SNIA’s SDC 2012 (Storage Developers Conference).
You can also read those directly from twitter at http://twitter.com/josebarreto (in reverse order)

Notes and disclaimers

These tweets were typed during the talks and they include typos and my own misinterpretations.
Text under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
I spent just limited time fixing typos or correcting the text after the event. Just so many hours in a day...
I have not attended all sessions (since there are 4 or 5 at a time, that would actually not be possible :-)…
SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.

Linux CIFS/SMB2 Kernel Clients - A Year In Review by Steven French, IBM

SMB3 will be important for Linux, not just Windows #sdc2012
Linux kernel supports SMB. Kernel 3.7 (Q4-2012) includes 71 changes related to SMB (including SMB 2.1), 3.6 has 61, 3.5 has 42
SMB 2.1 kernel code in Linux enabled as experimental in 3.7. SMB 2.1 will replace CIFS as the default client when stable.
SMB3 client (CONFIG_EXPERIMENTAL) expected by Linux kernel 3.8.
While implementing Linux client for SMB3, focusing on strengths: clustering, RDMA. Take advantage of great protocol docs

Multiuser CIFS Mounts, Jeff Layton, Red Hat

I attended this session, but tweeted just the session title.

How Many IOPS is Enough by Thomas Coughlin, Coughlin Associates

79% of surveyed people said they need between 1K and 1M IOPs. Capacity: from 1GB to 50TB with sweet spot on 500GB.
78% of surveyed people said hardware delivers between 1K and 1M IOPs, with a sweet spot at 100K IOPs. Matches requirements
Minimum latency system hardware (before other bottleneck) ranges between >1sec to <10ns. 35% at 10ms latency.
$/GB for SDD and HDD both declining in parallel paths. $/GB roughly follows IOPs.
Survey results will be available in October...

SMB 3.0 ( Because 3 > 2 ) - David Kruse, Microsoft

Fully packed room to hear David's SMB3 talk. Plus a few standing in the back... pic.twitter.com/TT5mRXiT
Time to ponder: When should we recommend disabling SMB1/CIFS by default?

Understanding Hyper-V over SMB 3.0 Through Specific Test Cases with Jose Barreto

No tweets during this session. Hard to talk and tweet at the same time :-)

Continuously Available SMB – Observations and Lessons Learned - David Kruse and Mathew George.

I attended this session, but tweeted just the session title.

Status of SMB2/SMB3 Development in Samba, Michael Adam, Samba Team

SMB 2.0 officially supported in Samba 3.6 (about a year ago, August 2011)
SMB 2.1 work done in Samba for Large MTU, multi-credit, dynamic re-authentication
Samba 4.0 will be the release to incorporate SMB 3.0 (encryption and secure negotiate already done)

The Solid State Storage (R-)Evolution, Michael Krause, Hewlett-Packard

Storage (especially SSD) performance constrained by SAS interconnects
Looking at serviceability from DIMM to PCIe to SATA to SAS. Easy to replace x perfor
No need to re-invent SCSI. All OS, hypervisors, file systems, PCIe storage support SCSI.
Talking SCSI Express. Potential to take advantage of PCIe capabilities.
PCIe has benefits but some challenges: Non optimal DMA "caching", non optimal MMIO performance
everything in the world of storage is about to radically change in a few years: SATA, SAS, PCIe, Memory
Downstream Port Containment. OS informed of async communications lost.
OCuLink: new PCIe cable technology
Hardware revolution: stacked media, MCM / On-die, DIMM. Main memory in 1 to 10 TB. Everything in memory?
Express Bay (SFF 8639 connector), PCIe CEM (MMIO based semantics), yet to be developed modules
Media is going to change. $/bit, power, durability, performance vs. persistence. NAND future bleak.
Will every memory become persistent memory? Not sic-fi, this could happen in a few years...
Revolutionary changes coming in media. New protocols, new hardware, new software. This is only the beginning

Block Storage and Fabric Management Using System Center 2012 Virtual Machine Manager and SMI-S, Madhu Jujare, Microsoft

Windows Server 2012 Storage Management APIs are used by VMM 5012. An abstraction of SMI-S APIs.
SMAPI Operations: Discovery, Provisioning, Replication, Monitoring, Pass-thru layer
Demo of storage discovery and mapping with Virtual Machine Manager 2012.SP1. Using Microsoft iSCSI Target!

Linux Filesystems: Details on Recent Developments in Linux Filesystems and Storage by Chris Mason, Fusion-io

Many journaled file systems introduced in Linux 2.4.x in the early 2000s.
Linux 2.6.x. Source control at last. Kernel development moved more rapidly. Specially after Git.
Backporting to Enterprise. Enterprise kernels are 2-3 years behind mainline. Some distros more than others.
Why are there so many filesystems? Why not pick one? Because it's easy and people need specific things.
Where Linux is now. Ext4, XFS (great for large files). Btrfs (snapshots, online maintenance). Device Mapper.
Where Linux is now. CF (Compact Flash). Block. SCSI (4K, unmap, trim, t10 pi, multipath, Cgroups).
NFS. Still THE filesystem for Linux. Revisions introduce new features and complexity. Interoperable.
Futures. Atomic writes. Copy offload (block range cloning or new token based standard). Shingled drives (hybrid)
Futures. Hinting (tiers, connect blocks, IO priorities). Flash (seems appropriate to end here :-)

Non-volatile Memory in the Storage Hierarchy: Opportunities and Challenges by Dhruva Chakrabarti, HP

Will cover a few technologies coming the near future. From disks to flash and beyond...
Flash is a huge leap, but NVRAM presents even bigger opportunities.
Comparing density/retention/endurance/latency/cost for hdd/sdd (nand flash)/dram/nvram
Talking SCM (Storage Class Memory). Access choices: block interface or byte-addressable model.
Architectural model for NVRAM. Coexist with DRAM. Buffers/caches still there. Updates may linger...
Failure models. Fail-stop. Byzantine. Arbitrary state corruption. Memory protection.
Store to memory must be failure-atomic.
NVRAM challenges. Keep persistent data consistent. Programming complexity. Models require flexibility.
Visibility ordering requirements. Crash can lead to pointers to uninitialized memory, wild pointers.
Potential inconsistencies like persistent memory leaks. There are analogs in multi-threading.
Insert a cache line flush to ensure visibility in NVRAM. Reminiscent of a disk cache flush.
Many flavors of cache flushes. Intended semantics must be honored red. CPU instruction or API?
Fence-based programming has not been well accepted. Higher level abstractions? Wrap in transactions?
Conclusion. What is the right API for persistent memory? How much effort? What's the implementation cost?

Building Next Generation Cloud Networks for Big Data Applications by Jayshree Ullal, Arista Networks

Agenda: Big Data Trends, Data Analytics, Hadoop.
64-bit CPUs trends, Data storage trends. Moore's law is alive and well.
Memory hierarchy is not changing. Hard drives not keeping up, but Flash...
Moore's law for Big Data, Digital data doubling every 2 years. DAS/NAS/SAN not keeping up.
Variety of data. Raw, unstructured. Not enough minds around to deal with all the issues here.
Hadoop means the return of DAS. Racks of servers, DAS, flash cache, non-blocking fabric.
Hadoop. 3 copies of the data, one in another rack. Protect you main node, single point of failure.
Hadoop. Minimum 10Gb. Shift from north-south communications to east-west. Servers talking to each other.
From mainframe, to client/server, to Hadoop clusters.
Hadoop pitfalls. Not a layer 2 thing. Highly redundant, many paths, routing. Rack locality. Data integrity
Hadoop. File transfers in chunks and blocks. Pipelines replication east-west. Map and Reduce.
showing sample 2-rack solution. East-west interconnect is very important. Non-blocking. Buffering.
Sample conf. 4000 nodes. 48 servers per cabinet. High speed network backbone. Fault tolerant main node
Automating cluster provisioning. Script using DHCP for zero touch provisioning.
Buffer challenges. Dynamic allocations, survive micro bursts.
Advanced diagnostics and management. Visibility to the queue depth and buffering. Graph historical latency.
my power is running out. I gotta speak fast. :-)

Windows File and Storage Directions by Surendra Verma, Microsoft

Landscape: pooled resources, self-service, elasticity, virtualization, usage-based, highly available
Industry-standard parts to build very high scale, performing systems. Greater number of less reliable parts.
Services influencing hardware. New technologies to address specific needs. Example: Hadoop.
OS storage built to address specific needs. Changing that requires significant effort.
You have to assume that disks and other parts will fail. Need to address that in software.
If you have 1000 disks in a system, some are always failing, you're always reconstructing.
ReFS: new file system in Windows 8, assumes that everything is unreliable underneath.
Other relevant features in Windows Server 2012: Storage Spaces, Clustered Shared Volumes, SMB Direct.
Storage Spaces provides resiliency to media failures. Mirror (2 or 3 way), parity, hot spares.
Shared Storage Spaces. Resiliency to node and path failures using shared SAS disks.
Storage Spaces is aware of enclosures, can tolerate failure of an entire enclosure.
ReFS provides resiliency to media failures. Never write metadata in place. Integrity streams checksum.
integrity Streams. User data checksum, validated on every read. Uses Storage Spaces to find a good copy.
You own application can use an API to talk to Storage Spaces, find all copies of the data, correct things.
Resiliency to latent media errors. Proactively detect and correct, keeping redundancy levels intact.
ReFS can detect/correct corrupted data even for data not frequently read. Do it on a regular basis.
What if all copies are lost? ReFS will keep the volume online, you can still read what's not corrupted.
example configuration with 4 Windows Server 2012 nodes connected to multiple JBODs.

Hyper-V Storage Performance and Scaling with Joe Dai & Liang Yang, Microsoft

Joe Dai:

New option in Windows Server 2012: Virtual Fibre Channel. FC to the guest. Uses NPIV. Live migration just works.
New in WS2012: SMB 3.0 support in Hyper-V. Enables Shared Nothing Live Migration, Cross-cluster Live Migration.
New in WS 2012: Storage Spaces. Pools, Spaces. Thin provisioning. Resiliency.
Clustered PCI RAID. Host hardware RAID in a cluster setup.
Improved VHD format used by Hyper-V. VHDX. Format specification at http://www.microsoft.com/en-us/download/details.aspx?id=29681 Currently v0.95. 1.0 soon
VHDX: Up to 64TB. Internal log for resiliency. MB aligned. Larger blocks for better perf. Custom metadata support.
Comparing performance. Pass thru, fixed, dynamic, differencing. VHDX dynamic ~= VHD fixed ~= physical disk.
Offloaded Data Transfers (ODX). Reduces times to merge, mirror and create VHD/VHDX. Also works for IO inside the VM.
Hyper-V support for UNMAP. Supported on VHDX, Pass-thru. Supported on VHDX Virtual SCSI, Virtual FC, Virtual IDE.
UNMAP in Windows Server 2012 can flow from virtual IDE in VM to VHDX to SMB share to block storage behind share.

Laing Yang:

My job is to find storage bottlenecks in Hyper-V storage and hand over to Joe to fix them. :-)
Finding scale limits in Hyper-V synthetic SCSI IO path in WS2008R2. 1 VSP thread, 1 VMBus channel per VM, 256 queue depth per
WS2012: From 4 VPs per VM to 64 VP per VM. Multi-threaded IO model. 1 channel per 16 VPs. Breaks 1 million IOPs.
Huge performance jump in WS2012 Hyper-V. Really close to physical even with high performance storage.
Hyper-V Multichannel (not to be confused with SMB Multichannel) enables the jump on performance.
Built 1 million IOPs setup for about $10K (excluding server) using SSDs. Demo using IOmeter. Over 1.22M IOPs...

The Virtual Desktop Infrastructure Storage Behaviors and Requirements with Spencer Shepler, Microsoft

Storage for Hyper-V in Windows Server 2012: VHDX, NTFS, CSV, SMB 3.0.
Review of SMB 3.0 advantages for Hyper-V: active recovery, Multichannel, RDMA.
Showing results for SMB Multichannel with four traditional 10GbE. Line rate with 64KB IOs. CPU bound with 8KB.
Files used by Hyper-V. XML, BIN, CSV, VHD, VHDX, AVHDX. Gold, diff and snapshot disk relationships.
improvements in VHDX. Up to 64TB size. 4KB logical sector size, 1MB alignment for allocations. UNMAP. TRIM.
VDI: Personal desktops vs. Pooled desktops. Pros and cons.
Test environment. WS2012 servers. Win7 desktops. Login VSI http://www.loginvsi.com - 48 10K rpm HDD.
Workload. Copy, word, print pdf, find/replace, zip, outlook e-mail, ppt, browsing, freemind. Realistic!
Login VSI fairly complex to setup. Login frequency 30 seconds. Workload started "randomly" after login.
Example output from Login VSI. Showing VSI Max.
Reading of BIN file during VM restore is sequential. IO size varies.
Gold VHDX activity. 77GB over 1 hour. Only reads, 512 bytes to 1MB size IOs. 25KB average. 88% are <=32KB
Distribution for all IO. Reads are 90% 64KB or less. Writes mostly 20KB or less.
AVHD activity 1/10 read to write ratio. Flush/write is 1/10. Range 512 bytes to 1MB. 90% are 64KB or less.
At the end of test run for 1 hour with 85 desktops. 2000 IOPs from all 85 VMs, 2:1 read/write ratio.

SQL Server: Understanding the Data Workload by Gunter Zink, Microsoft (original title did not fit a tweet)

Looking at OLTP and data warehousing workloads. What's new in SQL Server 2012.
Understanding SQL Server. Store and retrieve structured data, Relation, ACID, using schema.
Data organized in tables. tables have columns. Tables stored in 8KB pages. Page size fixed, not configurable.
SQL Server Datafile. Header, GAM page (bitmap for 4GB of pages), 4GB of pages, GAM page, 4GB of pages, etc...
SQL Server file space allocated in extents. An extent is 8 pages or 64KB. Parameter for larger extent size.
SQL Server log file: Hreader, log records (512 bytes to 60KB). Checkpoint markers. truncated after backup.
If your storage reports 4KB sector size, minimum log write for SQL Server is 4KB. Records are padded.
2/3 of SQL Servers run OLTP workloads. Many active users, lightweight transactions.
Going over what happens when you run OLTP. Read cache or read disk, write log to disk and mark page as dirty
Log buffer. Circular buffer, no fixed size. One buffer written to disk, another being filled with changes.
If storage is not fast enough, writing log takes longer and buffer changes grows larger.
Lazy writer. Writes dirty pages to disk (memory pressure). Checkpoint: Writes pages, marks log file (time limit)
Checkpoint modes: Automatic, Indirect, Manual. Write rate reduced if latency reaches 20ms (can be configured)
Automatic SQL Checkpoint. Write intensity controlled by recovery interval. Default is 0 = every two minutes.
New in SQL Server 2012. Target_Recovery_Time. Makes checkpoint less spikey by constantly writing dirty pages.
SQL Server log file. Change records in sequence. Mostly just writes. Except in recovery or transaction rollback.
Data file IO. 8KB random reads, buffered (based on number of user queries). Can be done in 64KB at SQL start up.
Log file IO: unbuffered small sequential writes (depends on how many inserts/updates/deletes).
About 80% of SQL Server performance problems are storage performance problems. Not enough spindles or memory.
SQL Server problems. 20ms threshold too high for SSDs. Use -k parameter to limit (specified in MB/sec)
Issues. Checkpoint floods array cache (20ms). Cache de-staging causes log drive write performance.
Log writes must go to disk, no buffering. Data writes can be buffered, since it can recover from the log.
SQL Server and Tiered Storage. We probably won't read what we've just written.
Data warehouse. Read large amounts of data, mostly no index, table scans. Hourly or daily updates (from OLTP).
Understanding a data warehouse query. Lots of large reads. Table scans and range scans. Reads: 64KB up to 512KB.
DW. Uses TempDB to handle intermediate results, sort. Mostly 64KB writes, 8KB reads. SSDs are good for this.
DW common problems: Not enough IO bandwidth. 2P server can ingest 10Gbytes/sec. Careful with TP, pooled LUNs.
DW common problems. Arrays don't read from multiple mirror copies.
SMB file server and SQL Server. Limited support in SQL Server 2008 R2. Fully supported with SQL Server 2012.
I got my fastest data warehouse performance using SMB 3.0 with RDMA. Also simpler to manage.
Comparing steps to update SQL Server with Fibre Channel and SMB 3.0 (many more steps using FC).
SQL Server - FC vs. SMB 3.0 connectivity cost comparison. Comparing $/MB/sec with 1GbE, 10GbE, QDR IB, 8G FC.

The Future of Protocol and SMB2/3 Analysis with Paul Long, Microsoft

We'll talk about Message Analyzer. David is helping.
Protocol Engineering Framework
Like Network Monitor. Modern message analysis tool built on the Protocol Engineering Framework
Source for Message Analyzer can be network packets, ETW events, text logs, other sources. Can validate messages.
Browse for message sources, Select a subset of messages, View using a viewer like a grid..
New way of viewing starting from the top down, instead of the bottom up in NetMon.
Unlike NetMon, you can group by any field or message property. Also payload rendering (like JPG)
Switching to demo mode...
Guidance shipped online. Starting with a the "Capture/Trace" option.
Trace scenarios: NDIS, Firewall, Web Proxy, LAN , WLAN, Wifi. Trace filter as well.
Doing a link layer capture (just like old NetMon). Start capture. Generate some web traffic.
Stop the trace. Group by module. Look at all protocols. Like HTTP. Drill in to see operations.
Looking at operations. HTTP GET. Look at the details. High level stack view.
Now grouping on both protocol and content type. Easily spots pictures over HTTP. Image preview.
Easier to see time elapsed per operation when you group messages. You dig to individual messages
Now looking at SMB trace. Trace of a file copy. Group on the file name (search for the property)
Now grouped on SMB.Filename. You can see all SB operations to copy a specific file.
Now looking at a trace of SMB file copy to an encrypted file share.
Built in traces to capture from the client side or server side. Can do full PDU or header.only
This can also be used to capture SMB Direct data, using the SMB client trace.
Showing the trace now with both network traffic and SMB client trace data (unencrypted).
Want to associate the wire capture with the SMB client ETW trace? Use the message ID
Showing mix of firewall trace and SMB client ETW trace. You see it both encrypted and not.
SMB team at Microsoft is the first to add native protocol unit tracing. Very useful...
Most providers have ETW debug logging but not the actual messages.
You can also get the trace with just NetSh or LogMan and load the trace in the tool later.
We also can deploy the tool and use PowerShell to start/stop capture.
If the event provider offers them, you can specify level and keywords during the capture.
Add some files (log file and wireshark trace). Narrow down the time. Add selection filter.
Mixing wireshark trace with a Samba text log file (pattern matching text log).
Audience: As a Samba hacker, Message Analyzer is one of the most interesting tools I have seen!
Jaws are dropping as Paul demos analyzing a trace from WireShark + Samba taken on Linux.
Next demo: visualizations. Two separate file copies. Showing summary view for SMB reads/writes
Looking at a graph of bytes/second for SMB reads and writes. Zooming into a specific time.
From any viewer you should be any to do any kind of selection and then launch another viewer.
If you're a developer, you can create a very sophisticated viewer.
Next demo: showing the protocol dashboard viewer. Charts with protocol bars. Drills into HTTP.

Storage Systems for Shingled Disks, with Garth Gibson, Panasas

Talking about disk technology. Reaction of HDD to what's going with SSDs.
Kryder's law for magnetic disks. Expectation is that disks will cost next to nothing.
High capacity disk. As bits get smaller, the bit might not hold it's orientation 10 years later.
Heat assisted to make it possible to write, then keep it longer when cold. Need to aim that laser precisely..
New technology. Shingled writing. Write head is wider than read head. Density defined by read head, not write head.
As you write, you overwrite a portion of what you wrote before, but you can still read it.
Shingled can be done with today's heads with minor changes, no need to wait for heat assisted technology.
Shingled disks. Large sequential writes. Disks becomes tape!!
Hard to see just the one bit. Safe plan is to see the bit from slightly different angles and use signal processing.
if aiming at 3x the density: cross talk. Signal processing using 2 dimensions TMDR. 3-5 revs to to read a track.
Shingled disks. Initial multiplier will be a factor of 2. Seek 10nm instead of 30 nm. Wider band with sharp edges.
Write head edge needs to be sharp on one side, where the tracks will overlap. Looking at different widths.
Aerial density favors large bands that overlap. Looking at some math that proves this.
You could have a special place in the disk with no shingles for good random write performance, mixed with shingled.
Lots of question on shingled disks. How to handle performance, errors, etc.
Shingled disks. Same problem for Flash. Shingled disks - same algorithms as Flash.
modify software to avoid or minimize read, modify, write. Log structured file systems are 20 years old.
Key idea is that disk attribute says "sequential writing". T13 and t10 standards.
Shingled disks. Hadoop as initial target. Project with mix of shingled and unshingled disks. Could also be SSD+HDD.
Prototype banded disk API. Write forward or move back to 0. Showing test results with new file system.
future work. Move beyond hadoop to general workloads, hurts with lots of small files. Large files ok.
future work. Pack metadata. All of the metadata into tables, backed on disk by large blob of changes.
Summary of status. Appropriate for Big Data. One file = one band. Hadoop is write once. Next steps: pack metadata.

The Big Deal of Big Data to Big Storage with Benjamin Woo, Neuralytix

Can't project to both screens because laptop does not have VGA. Ah, technology... Will use just right screen.
Even Batman is into big data. ?!
What's the big picture for big data. Eye chart with lots of companies, grouped into areas...
We have a problem with storage/data processing today. Way too many hops. (comparing to airline routes ?!)
Sample path: Oracle to Informatica to Microstategy and Hadoop. Bring them together. Single copy of "the truth".
Eliminate the process of ETL. Eliminate the need for exports. Help customers to find stuff in the single copy.
You are developers. You need to find a solution for this problem. Do you buy into this?
Multiple copies OK for redundancy or performance, but shouldn't it all be same source of truth?
Single copy of the truth better for discovery. Don't sample, don't summarize. You will find more than you expect.
We're always thinking about the infrastructure. Remove yourself from the hardware and think about the data!
The challenge is how to think about the data. Storage developers can map that to the hardware.
Send complaints to /dev/null. Tweet at @BenWooNY
Should we drop RDBMS altogether? Should we add more metadata to them? Maybe.
Our abstractions are already far removed from the hardware. Think virtual disks in VM to file system to SAN array.
Software Defined Storage is something we've been doing for years in silicon.
Remember what we're here for. It's about the data. Otherwise there is no point in doing storage.
Is there more complexity in having a single copy of the truth? Yes, but that is part of what we do! We thrive there!
Think about Hadoop. They take on all the complexity and use dumb hardware. That's how they create value!

Unified Storage for the Private Cloud with Dennis Chapman, NetApp

10th anniversary of SMI-S. Also 10th anniversary of pirate day. Arghhh...
application silos to virtualization to private clouds (plus public and hybrid clouds)
Focusing on the network. Fundamentally clients talking to storage in some way...
storage choices for physical servers. Local (DAS) and remote (FC, iSCSI, SMB). Local for OS, remote for data.
Linux pretty much the same as Windows. Difference is NFS instead of SMB. Talking storage affinities.
Windows OS. Limited booting from iSCSI and FC. Mostly local.
Windows. Data mostly on FC and iSCSI, SMB still limited (NFS more well established on Linux).
shifting to virtualized workloads on Windows. Opts for local and remote. More choices, storage to the guest.
Virtualized workloads are the #1 configuration we provide storage for.
Looking at options for Windows and Linux guests, hosted on both VMware and Hyper-V hosts. Table shows options
FC to the guest. Primary on Linux, secondary on Windows. Jose: FC to the guest new in WS2012.
File storage (NFS) primary on Linux, but secondary on Windows (SMB). Jose: again, SMB support new in WS2012.
iSCSI secondary for Linux guest, but primary for Windows guests.
SMB still limited right now, expect it to grow. Interested on how it will play, maybe as high as NFS on Linux
Distributed workload state. Workload domain, hypervisors domain, storage domain.
Guest point in time consistency. Crash consistency or application consistency. OS easier, applications harder
Hibernation consistency. Put the guest to sleep and snapshot. Works well for Windows VMs. Costs time.
Application consistency. Specific APIs. VSS for Windows. I love this! Including remote VSS for SMB shares.
Application consistency for Linux. Missing VSS. We have to do specific things to make it work. Not easy.
hypervisors PIT consistency. VMware, cluster file system VMFS. Can store files on NFS as well.
Hypervisors PIT for Hyper-V. Similar choices with VHD on CSV. Also now option for SMB in WS2012.
Affinities and consistency. Workload domain, Hypervisors domain and Storage domain backups. Choices.
VSS is the major difference between Windows and Linux in terms of backup and consistency.
Moving to the Storage domain. Data ONTAP 8 Clustering. Showing 6-node filer cluster diagram.
NetApp Vservers owns a set of Flexvols, with contain close objects (either LUN or file).
Sample workflow with NetApp with remote SMB storage. Using remote VSS to create a backup using clones.
Sample workflow. App consistent backup from a guest using an iSCSI LUN.
Showing eye charts with integration with VMware and Microsoft.
Talking up the use of PowerShell, SMB when integrating with Microsoft.
Talk multiple protocols, rich services, deep management integration, highly available and reliable.

SNIA SSSI PCIe SSD Round Table. Moderator + four members.

Introductions, overview of SSSI PCIe task force and committee.
62 companies in the last conference. Presentations available for download. http://www.snia.org/forums/sssi/pcie
Covering standards, sites and tools available from the group. See link posted
difference between PCIE SSDs look just other drives, but there are differences. Bandwidth is one of them.
Looking at random 4KB write IOPs and response time for different types of disks: HDD, MLC, SLC, PCIe.
Different SSD tech offer similar response rates. Some high latencies due to garbage collection.
comparing now DRAM, PCIe, SAS and SATA. Lower latencies in first two.
Comparing CPU utilization. From less than 10% to over 50%. What CPU utilization to achieve IOPs...
Other system factors. Looking at CPU affinity effect on random 4KB writes... Wide variation.
Performance measurement. Response time is key when testing PCIe SSDs. Power mgmt? Heat mgmt? Protocol effect on perf?
Extending the SCSI platform for performance. SCSI is everywhere in storage.
Looking at server attached SSDs and how much is SATA, SAS, PCIe, boot drive. Power envelope is a consideration.
SCSI is everywhere. SCSI Express protocol for standard path to PCIe. SoP (SCSI over PCIe). Hardware and software.
SCSI Express: Controllers, Drive/Device, Drivers. Express bay connector. 25 watts of power.
Future: 12Gbps SAS in volume at the end of 2013. Extended copy feature. 25W devices. Atomic writes. Hinting. SCSI Express.
SAS controllers > 1 million IOPs and increased power for SAS. Reduces PCIe SSD differentiation. New form factors?
Flash drives: block storage or memory.
Block versus Memory access. Storage SSDs, PCIe SSDs, memory class SCM compared in a block diagram. Looking at app performance
optimization required for apps to realize the memory class benefits. Looking at ways to address this.
Open industry directions. Make all storage look like SCSI or offer apps other access models for storage?
Mapping NVMExpress capability to SCSI commands. User-level abstractions. Enabling SCM by making it easy.
Panel done with introductions. Moving to questions.
How is Linux support for this? NVMExpress driver is all that exists now.
How much of the latency is owned by the host and the PCIe device? Difficult to answer. Hardware, transport, driver.
Comparing to DRAM was excellent. That was very helpful.
How are form factors moving forward? 2.5" HDD format will be around for a long time. Serviceability.
Memory like access semantics - advantages over SSDs. Lower overhead, lots in the hardware.
Difference between NVMe and SOP/PQI? Capabilities voted down due to complexity.
What are the abstractions like? Something like a file? NVMe has a namespace. Atomic write is a good example. How to overlay?
It's easy to use just a malloc, but it's a cut the block, run with memory. However, how do you transition?

NAS Management using System Center 2012 Virtual Machine Manager and SMI-S with Alex Naparu and Madhu Jujare

VMM for Management of Virtualized Infrastructure: VMM 2012 SP1 covers block storage and SMB3 shares
Lots of SMB 3.0 sessions here at SDC...
VMM offers to manage your infrastructure. We'll be focusing on storage. Lots enabled by Windows Server 2012.
There's an entire layer in Windows Server 2012 dedicated to manage storage. Includes translation of WMI to SMI-S
All of this can be leveraged using PowerShell.
VMM NAS Management: Discovery (Servers, Systems, Shares), Creation/Removal (Systems, Shares), Share Permissions
How did we get there? With a lot of help from our partners. Kick-off with EMC and NetApp. More soon. Plugfests.
Pre-release providers. If you have any questions on the availability of providers, please ask EMC and NetApp.
Moving now into demo mode. Select provider type. Specify discovery scope. Provide credentials. Discovering...
Discovered some block storage and file storage. Some providers expose one of them, some expose both.
Looking at all the pools and all the shares. Shallow discovery at first. After selection, we do deep discovery.
Each pool is given a tag, called classification. Tagged some as Gold, some as Platinum. Finishing discovery.
Deep discovery completed. Looking at the Storage tree in VMM, with arrays, pools, LUNs, file shares.
Now using VMM to create a file share. Provide a name, description, file server, storage pool and size.
Creates a logical disk in the pool, format with a file system, then create a file share. All automated.
Now going to a Hyper-V host, add a file share to the host using VMM. Sets appropriate permissions for the share.
VMM also checks the file access is good from that host.
Now let's see how that works for Windows. Add a provider, abstracted. Using WMI, not SMI-S. Need credentials.
Again, shows all shares, select for deep discovery. Full management available after that.
Now we can assign Windows file share to the host, ACLs are set. Create a share. All very much the same as NAS.
VMM also verifies the right permissions are set. VMM can also repair permission to the share if necessary.
Now using VMM to create a new VM on the Windows SMB 3.0 file share. Same as NAS device with SMB 3.0.
SMI-S support. Basic operations supported on SMI-S 1.4 and later. ACL management. requires SMI-S 1.6.
SMI-S 1.4 profiles: File server, file share, file system discovery, file share creation, file share removal.
Listing profiles that as required for SMI-S support with VMM. Partial list: NAS Head, File System, File Export
SMI-S defines a number of namespaces. "Interop" namespace required. Associations are critical.
Details on Discovery. namespaces, protocol support. Filter to get only SMB 3.0 shares.
Discovery of File Systems. Reside on logical disks. That's the tie from file storage to block storage.
Different vendors have different way to handle File Systems. Creating a new one is not trivial. Another profile.
VMM creates the file system and file share in one step. Root of FS is the share. Keeping things simple.
Permissions management. Integrated with Active Directory. Shares "registered" with Hyper-V host. VMM adds ACLs.
Demo of VMM specific PowerShell walking the hierarchy from the array to the share and back.
For VMM, NAS device and SMI-S must be integrated with Active Directory. Simple Identity Management Subprofile.
CIM Passthrough API. WMI provider can be leveraged via code or PowerShell.

SMB 3, Hyper-V and ONTAP, Garrett Mueller, NetApp

Senior Engineer at NetApp focused on CIFS/SMB.
What we've done with over 30 developers: features, content for Windows Server 2012. SMB3, Witness, others.
Data ONTAP cluster-mode architecture. HA pairs with high speed interconnect. disk "blade" in each node.
Single SMB server spread across multiple nodes in the cluster. Each an SMB server with same configuration
Each instance of the SMB server in a node has access to the volumes.
Non-disruptive operations. Volume move (SMB1+). Logical Interface move (SMB2+). Move node/aggregate (SMB3).
We did not have a method to preserve the locks between nodes. That was disruptive before SMB3.
SMB 3 and Persistent Handles. Showing two nodes and how you can move a persistent SMB 3 handle.
Witness can be used in lots of different ways. Completely separate protocol. NetApp scoped it to an HA pair.
Diagram explaining how NetApp uses Witness protocol with SMB3 to discover, monitor, report failure.
Remote VSS. VSS is Microsoft's solution for app consistent snapshot. You need to back up your shares!
NetApp implemented a provider for Remote VSS for SMB shares using the documented protocol. Showing workflow.
All VMs within a share are SIS cloned. SnapManager does backup. After done, temp SIS clones are removed.
Can a fault occur during a backup. If there is a failure, the backup will fail. Not protected in that way.
Offloaded Data Transfer (ODX). Intra-volume: SIS clones. Inter-volume/inter-node: back-end copy engine.
ODX: The real benefit is in the fact that it's used by default in Windows Server 2012. It just works!
ODX implications for Hyper-V over SMB: Rapid provisioning, rapid storage migrations, even disk within a VM.
Hyper-V over SMB. Putting it all together. Non-disruptive operations, Witness, Remote VSS, ODX.
No NetApp support for SMB Multichannel or SMB Direct (RDMA) with SMB 3.

Design and Implementation of SMB Locking in a Clustered File System with Aravind Velamur Srinivasan, EMC - Isilon

Part of SMB team at EMC/Isilon. Talk agenda covers OneFS and its distributed locking mechanism.
Overview of OneFS. NAS file server, scalable, 8x mirror, +4 parity. 3 to 144 nodes, using commodity hardware.
Locking: avoid multiple writers to the same file. Potentially in different file server nodes.
DLM challenges: Performance, multiple protocols ands requirements. Expose appropriate APIs.
Diagram explaining the goals and mechanism of the Distributed Locking Manager (DLM) Isilon's OneFS
Going over requirements of the DLM. Long list...

Scaling Storage to the Cloud and Beyond with Ceph with Sage Weil, Inktank

Trying to catch up with ongoing talk on ceph. Sage Weil talks really fast and uses dense slides...
Covering RADOS block device being used by virtualization, shared storage. http://ceph.com/category/rados/
Covering ceph-fs. Metadata and data paths. Metadata server components. Combined with the object store for data.
Legacy metadata storage: bad. Ceph-fs metadata does not use block lists or inode tables. Inode in directory.
Dynamic subtree partitioning very scalable. Hundreds of metadata servers. Adaptive. Preserves locality.
Challenge dealing metadata Io. Use metadata server as cache, prefect dir-inode. Large journal or log.
What is journaled? Lots of state. Sessions, metadata changes. Lazy flush.
Client protocol highly stateful. Metadata servers, direct access to IDS.
explaining the ceph-fs workflow using ceph-mon, ceph-mds, ceph-osd.
Snapshots. Volume and subvolume unusable at petabyte scale. Snapshot arbitrary directory
client implementations. Linux kernel client. Use Samba to reexport as CIFS. Also NFS and Hadoop.
Current status of the project: most components: status=awesome. Ceph-fs nearly awesome :-)
Why do it? Limited options for scalable open source storage. Proprietary solutions expensive.
What to do with hard links? They are rare. Using auxiliary table, a little more expensive, but works.
How do you deal with running out of space? You don't. Make sure utilization on nodes balanced. Add nodes.

Introduction to the last day

Big Data is like crude oil, it needs a lot of refining and filtering...
Growing from 2.75 Zettabytes in 2012 to 8 ZB in 2015. Nice infographic showing projected growth...

The Evolving Apache Hadoop Eco System - What It Means for Big Data and Storage Developers, Sanjay Radia, Hortonworks

One of the surprising things about Hadoop is that is does not RAID on the disks. It does surprise people.
Data is growing. Lots of companies developing custom solutions since nothing commercial could handle the volume.
web logs with terabytes of data. Video data is huge, sensors. Big Data = transactions + interactions + observations.
Hadoop is commodity servers, jbod disk, horizontal scaling. Scale from small to clusters of thousands of servers..
Large table with use cases for Hadoop. Retail, intelligence, finance, ...
Going over classic processes with ETL, BI, Analytics. A single system cannot process huge amounts of data.
Big change is introducing a "big data refinery". But you need a platform that scales. That's why we need Hadoop.
Hadoop can use a SQL engine, or you can do key-value store, NoSQL. Big diagram with Enterprise data architecture.
Hadoop offers a lot tools. Flexible metadata services across tools. Helps with the integration, format changes.
Moving to Hadoop and Storage. Looking at diagram showing racks, servers, 6k nodes, 120PB. Fault tolerant, disk or node
manageability. One operator managing 3000 nodes! Same boxes do both storage and computation.
Hadoop uses very high bandwidth. Ethernet or InfiniBand. Commonly uses 40GbE.
Namespace layer and Block storage layer. Block pool Isis a set of blocks, like a LUN. Did/file abstraction on namesp.
Data is normally accessed locally, but can pull from any other servers. Deals with failures automatically.
looking at HDFS. Goes back to 1978 paper on separating data from function in a DFS. Luster, Google, pNFS.
I attribute the use of commodity hardware and replication to the GoogleFS. Circa 2003. Non-posix semantics.
Computation close to data is an old model. Map Reduce.
Significance of not using disk RAID. Replication factor of Hadoop is 3. Node can be fixed when convenient.
HDFS recovers at a rate of 12GB in minutes, done in parallel. Even faster for larger clusters. Recovers automatically.
Clearly there is an overhead. It's 3x instead of much less for RAID. Used only for some of the data.
Generic storage service opportunities for innovation. Federation, partitioned namespace, independent block pools.
Archival data. Where should it sit? Hadoop encourages keeping old data for future analysis. Hot/ cold? Tiers? Tape?
Two versions of Hadoop. Hadoop 1 (GA) and Hadoop 2 (alpha). One is stable. Full stack HA work in progress.
Hadoop full stack HA architecture diagram. Slave nodes layer + HA Cluster layer. Improving performance, DR, upgrades.
upcoming features include snapshots, heterogeneous storage (flash drives), block grouping, other protocols (NFS).
Which Apache Hadoop distro should you use? Little marketing of Hortonworks. Most stable version of components.
It's a new product. At yahoo we needed to make sure we did not lose any data. Needs it to be stable.
Hadoop changes the game. Cost, storage and compute. Scales to very very large. Open, growing ecosystem, no lock in.
Question from the audience. What is Big Data? What is Hadoop? You don't' need to know what it is, just buy it :-)
Sizing? The CPU performance and disk performance/capacity varies a lot. 90% of disk performance for sequential IO.
Question: Security? Uses Kerberos authentication, you can conned to Active Directory. There is a paper on this.
1 name node to thousands of nodes, 200M files. Hadoop moving to more name nodes to match the capacity of working set.

Primary Data Deduplication in Windows Server 2012 with Sudipta Sengupta, Jim Benton

Sudipta Sengupta:

Growing file storage market. Dedup is the #1 feature customers asking for. Lots of acquisitions in dedup space.
What is deduplication, how to do it. Content based chucking using a sliding window, computing hashes. Rabin method.
Dedup for data at rst, data on the wire. Savings in your primary storage more valuable, more expensive disks...
Dimensions of the problem: Primary storage, locality, service data to components, commodity hardware.
Extending the envelope from backup scenarios only to primary deduplication.
Key design decisions: post-processing, granularity and chucking, scale slowly to data size, crash consistent
Large scale study of primary datasets. Table with different workloads, chunking.
Looking at whole-file vs. sub-file. Decided early on to do chunking. Looking at chunk size. Compress the chunks!
Compression is more efficient on larger chunk sizes. Decided to use larger chunk size, pays off in metadata size.
You don't want to compress unless there's a bang for the buck. 50% of chunks = 80% for compression savings.
Basic version of the Rabin fingerprinting based chunking. Large chunks, but more uniform chunk size distribution
In Windows average chunk size is 64KB. Jose: Really noticing this guy is in research :-) Math, diagrams, statistics
Chunk indexing problem. Metadata too big to fit in RAM. Solution via unique chunk index architecture. Locality.
Index very frugal on both memory usage and IOPs. 6 bytes of RAM per chunk. Data partitioning and reconciliation.

Jim Benton:

Windows approach to data consistency and integrity. Mandatory block diagram with deduplication components.
Looking at deduplication on-disk structures. Identify duplicate data (chunks), optimize target files (stream map)
Chunk store file layout. Data container files: chunks and stream maps. Chunk ID has enough data to locate chunk
Look at the rehydration process. How to get the file back from the steam map and chunks.
Deduplicated file write path partial recall. Recall bitmap allows serving IO from file stream or chunk store.
Crash consistency state diagram. One example with partial recall. Generated a lot of these diagrams for confidence.
Used state diagrams to allow test team to induce failures and verify deduplication is indeed crash consistent.
Data scrubbing. Induce redundancy back in, but strategically. Popular chunks get more copies. Checksum verified.
Data scrubbing approach: Detection, containment, resiliency, scrubbing, repair, reporting. Lots of defensive code!
Deduplication uses Storage Spaces redundancy. Can use that level to recover the data from another copy if possible.
Performance for deduplication. Looking at a table with impact of dedup. Looking at options using less/more memory.
Looking at resource utilization for dedup. Focus on converging them.
Dedup performance varies depending on data access pattern. Time to open office file, almost no difference.
Dedup. Time to copy large VHD file. Lots of common chunks. Actually reduces copy time for those VHD files. Caching.
dedup write performance. Crash consistency hurts performance, so there is a hit. In a scenario, around 30% slower.
Deduplication around the top features in Windows Server 2012. Mentions at The Register, Ars Technica, Windows IT Pro
Lots of great questions being asked. Could not capture it all.

High Performance File Serving with SMB3 and RDMA via the SMBDirect Protocol with Tom Talpey and Greg Kramer

Tom Talpey:

Where we are with SMB Direct, where we are going, some pretty cool performance results.
Last year here at SDC we had our coming out party for SMB Direct. Review of what's SMB Direct.
Nice palindromic port for SMB direct 5455. Protocol documented at MS-SMBD. http://msdn.microsoft.com/en-us/library/hh536346(v=PROT.13).aspx…
Covering the basic of SMB Direct. Only 3 message types. 2 way full duplex. Discovered via SMB Multichannel.
Relationship with the NDKPI in Windows. Provider interface implemented by adapter vendors.
Send/receive model. Possibly sent as train. Implements crediting. Direct placement (read/write). Scatter/gather list
Going over the details on SMB Direct send transfers. Reads and writes, how they map to SMB3. Looking at read transfer
looking at exactly how the RDMA reads and writes work. Actual offloaded transfers via RDMA. Also covering credits.
Just noticed we have a relatively packed room for such a technical talk...And it's the larger room here...
interesting corner cases for crediting. Last credit case. Async, cancels and errors. No reply, many/large replies
SMB Direct efficiency. Two pipes, one per direction, independent. Truly bidirectional. Server pull model. Options.
SMB Direct options for RDMA efficiency. FRMR, silent completions, coalescing, etc.
Server pull model allows for added efficiency, in addition to improved security. Server controls all RDMA operations.

Greg Kramer:

On the main event. That's why you're here, right? Performance...
SDC 2011 results. 160k iops, 3.2 GBytes/sec.
New SDC 2012 results. Dual CX3 InfiniBand, Storage Spaces, two SAS HBAs, SSDs. SQLIO tool.
Examining the results. 7.3 Gbytes / sec with 512KB IOs at 8.6% CPU. 453K 8KB IOs at 60% CPU.
Taking it 11. Three InfiniBand links. Six SAS HBAs. 48 SSDs. 16.253 GBytes/sec!!! Still low CPU utilization...
NUMA effects on performance. Looking at NUMA disabled versus enabled. 16% percent in CPU utilization.
That's great! Now what? Looking at potential techniques to reduce the cost of IOs, increase IOPs further.
Looking at improving how invalidation consumes CPU cycles, RNIC bus cycles. But you do need to invalidate agressively
Make invalidate cheaper. Using "send with invalidate". Invalidate done as early as possible, fewer round trips.
Send with invalidate: supported in InfiniBand, iwarp and roce. No changes to SMB direct protocol. Not committed plan
Shout out to http://smb3.info Thanks, Greg!
Question: RDMA and encryption? Yes, you can combine them. SMB Direct will use RDMA send recives in that case.
Question: How do you monitor at packet level? Use Message Analyzer. But careful drinking from the fire hose :-)
Question: Performance monitor? There are counters for RDMA, look out for stalls, hints on how to optimize.

SMB 3.0 Application End-to-End Performance with Dan Lovinger

Product is released now, unlike last year. We're now showing final results...
Scenarios with OLTP database, cluster motion, Multichannel. How we found issues during development.
Summary statistics. You can drown on river with an average depth of six inches.
Starting point: Metric versus time. Averages are not enough, completely miss what's going on.
You should think about distribution. Looking at histogram. The classic Bell Curve. 34% to each side.
Standard deviation and median. Mid point of all data points. What makes sense for latency, bandwidth?
Looking at percentiles. Cumulative distributions. Remember that from College?
OLTP workload. Transaction rate, cumulative distribution. How we found and solved an issue that makes SMB ~= DAS
OLTP. Log file is small to midsize sequential IO, database file is small random IO.
Found 18-year-old perfor bug that affects only SMB and only in an OLTP workload. Leftover from FAT implementation.
Found this "write bubble" performance bug look at average queue length. Once fixed, SMB =~ DAS.
back to OLTP hardware configuration. IOPs limited workload does not need fast interconnect.
Comparing SMB v. DAS transaction rate at ingest. 1GbE over SMB compared to 4GbFC. Obviously limited by bandwidth.
As soon as the ingest phase is done, then 1GbE is nearly identical to 4GbFC. IOPs limited on disks. SMB=~DAS.
This is just a sample of why workload matters, why we need these performance analysis to find what we can improve.
IOmeter and SQLIO are not enough. You need to look at a real workload to find these performance issues.
Fix for this issue in Windows Server 2012 and also back ported to Windows Server 2008 R2.
Another case: Looking at what happens when you move a cluster resource group from one node to another.
3 file server cluster groups, 40 disk on each. How resource control manager handles the move. Needed visualization.
Looking at a neat visualization of how cluster disks are moved from one node to another. Long pole operations.
Found that every time we offline a disk, there as a long running operation that was not needed. We fixed that.
We also found a situation that took multiple TCP timeouts, leading to long delay in the overall move. Fixed!
Final result, dramatic reduction of cluster move time. Entire move time from 55 seconds to under 10 seconds.
Now we can do large cluster resource group moves with 120 disks in under 10 seconds. Not bad...
Last case study. SMB Multichannel performance. Looking at test hardware configuration. 24 SSDs, 2 SAS HBAs, IOmeter
Looking at local throughput at different IO sizes, as a baseline.
SMB Multichannel. We can achieve line rate saturation at about 16KB with four 40GBE interfaces.
Curve for small IOs matches between DAS and SMB at line rate..

Closing tweet

#SDConference is finished. Thanks for a great event! Meet you at SNW Fall 2012 in a month, right back here. On my way back to Redmond now...

↧

Talks at SNW Fall 2012 in October will cover SMB Direct, Hyper-V over SMB and SMB 3.0

September 27, 2012, 1:43 pm

≫ Next: SNIA’s Storage Developer Conference 2013 is just a few weeks away

≪ Previous: Compilation of my live tweets from SNIA’s SDC 2012 (Storage Developer Conference)

I have three presentations lined up for the ComputerWorld/SNIA SNW Fall 2012 Conference, scheduled for October 16-19, 2012 in Santa Clara, California. Here are the details for each one, taken from the official event web site.

Industry Perspective: High Throughput File Servers with SMB Direct, Using the Three Flavors of RDMA network adapters
Wednesday, 10/17/2012, 11:40 AM -12:25 PM

In Windows Server 2012, we introduce the “SMB Direct” protocol, which allows file servers to use high throughput/low latency RDMA network interfaces. However, there are three distinct flavors of RDMA, each with their own specific requirements and advantages, their own pros and cons. In this session, we'll look into iWARP, InfiniBand and RoCE, outline the differences between them. We'll also list the specific vendors that offer each technology and provide step-by-step instructions for anyone planning to deploy them. The talk will also include an update on RDMA performance and a customer case study.

Industry Perspective: Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V
Friday, 10/19/2012, 10:20 AM - 11:05 AM

In this session, we cover the Windows Server 2012 Hyper-V support for remote file storage using SMB 3.0. This introduces a new first-class storage option for Hyper-V that is a flexible, easy to use and cost-effective alternative to block storage. We detail the basic requirements for Hyper-V over SMB and outline the specific enhancements to SMB 3.0 to support server application storage, including SMB Transparent Failover, SMB Scale-Out, SMB Multichannel, SMB Direct (SMB over RDMA), SMB Encryption, SMB PowerShell, SMB performance counters and VSS for Remote File Shares. We conclude with a few suggested configurations for Hyper-V over SMB, including both standalone and clustered options. SMB 3.0 is an open protocol family, which is being implemented by several major vendors of enterprise NAS, and by the Samba open-source CIFS/SMB package in Linux and other operating systems.

SNIA Tutorial: SMB Remote File Protocol (including SMB 3.0)
Friday, 10/19/2012, 11:15 AM - 12:00 PM

The SMB protocol has evolved over time from CIFS to SMB1 to SMB2, with implementations by dozens of vendors including most major Operating Systems and NAS solutions. The SMB 3.0 protocol, announced at the SNIA SDC Conference in September 2011, is expected to have its first commercial implementations by Microsoft, NetApp and EMC by the end of 2012 (and potentially more later). This SNIA Tutorial describes the basic architecture of the SMB protocol and basic operations, including connecting to a share, negotiating a dialect, executing operations and disconnecting from a share. The second part of the talk will cover improvements in the version 2.0 of the protocol, including a reduced command set, support for asynchronous operations, compounding of operations, durable and resilient file handles, file leasing and large MTU support. The final part of the talk covers the latest changes in the SMB 3.0 version, including persistent handles (SMB Transparent Failover), active/active clusters (SMB Scale-Out), multiple connections per sessions (SMB Multichannel), support for RDMA protocols (SMB Direct), snapshot-based backups (VSS for Remote File Shares) opportunistic locking of folders (SMB Directory Leasing), and SMB encryption.

If you’re not registered yet, there’s still time. Visit the official web site at http://www.snwusa.com and click on the Register link. I look forward to seeing you there…

↧

SNIA’s Storage Developer Conference 2013 is just a few weeks away

August 9, 2013, 8:00 pm

≫ Next: Raw notes from the Storage Developers Conference (SDC 2013)

≪ Previous: Talks at SNW Fall 2012 in October will cover SMB Direct, Hyper-V over SMB and SMB 3.0

The Storage Networking Industry Association (SNIA) is hosting the 10th Storage Developer Conference (SDC) in the Hyatt Regency in beautiful Santa Clara, CA (Silicon Valley) on the week of September 16th. As usual, Microsoft is one of the underwriters of the SNIA SMB2/SMB3 PlugFest, which is co-located with the SDC event.

For developers working with storage-related technologies, this event gathers a unique crowd and includes a rich agenda that you can find at http://www.storagedeveloper.org. Many of the key industry players are represented and this year’s agenda lists presentations from EMC, Fujitsu, Google, Hortonworks, HP, Go Daddy, Huawei, IBM, Intel, Microsoft, NEC, NetApp, Netflix, Oracle, Red Hat, Samba Team, Samsung, Spectra Logic, SwitfTest, Tata and many others.

It’s always worth reminding you that the SDC presentations are usually delivered to developers by the actual product development teams and frequently the actual developer of the technology is either delivering the presentation or is in the room to take questions. That kind of deep insight is not common in every conference out there.

Presentations by Microsoft this year include:

Title	Presenters
Advancements in Windows File Systems	Neal Christiansen, Principal Development Lead, Microsoft
LRC Erasure Coding in Windows Storage Spaces	Cheng Huang, Researcher, Microsoft Research
SMB3 Update	David Kruse, Development Lead, Microsoft
Cluster Shared Volumes	Vladimir Petter, Principal Software Design Engineer, Microsoft
Tunneling SCSI over SMB: Shared VHDX files for Guest Clustering in Windows Server 2012 R2	Jose Barreto, Principal Program Manager, Microsoft Matt Kurjanowicz, Software Development Engineer, Microsoft
Windows Azure Storage - Speed and Scale in the Cloud	Joe Giardino, Senior Development Lead, Microsoft
SMB Direct update	Greg Kramer, Sr. Software Engineer, Microsoft
Scaled RDMA Performance & Storage Design with Windows Server SMB 3.0	Dan Lovinger, Principal Software Design Engineer, Microsoft
SPEC SFS Benchmark - The Next Generation	Spencer Shepler, Architect, Microsoft
Data Deduplication as a Platform for Virtualization and High Scale Storage	Adi Oltean, Principal Software Design Engineer, Microsoft Sudipta Sengupta, Sr. Researcher, Microsoft

For a taste of what SDC presentations look like, make sure to visit the site for last year’s event, where you can find the downloadable PDF files for most and video recordings for some. You can find them at http://www.snia.org/events/storage-developer2012/presentations12.

Registration for SDC 2013 is open at http://www.storagedeveloper.org and you should definitely plan to attend. If you are registered, leave a comment and let’s plan to meet when we get there!

↧

Raw notes from the Storage Developers Conference (SDC 2013)

September 19, 2013, 5:25 pm

≫ Next: Windows Server 2012 R2: Which version of the SMB protocol (SMB 1.0, SMB 2.0, SMB 2.1, SMB 3.0 or SMB 3.02) are you using?

≪ Previous: SNIA’s Storage Developer Conference 2013 is just a few weeks away

This blog post is a compilation of my raw notes from SNIA’s SDC 2013 (Storage Developers Conference).

Notes and disclaimers:

These notes were typed during the talks and they may include typos and my own misinterpretations.
Text in the bullets under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
I spent limited time fixing typos or correcting the text after the event. Just so many hours in a day...
I have not attended all sessions (since there are 4 or 5 at a time, that would actually not be possible :-)…
SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.
You can find the event agenda at http://www.snia.org/events/storage-developer2013/agenda2013

SMB3 Meets Linux: The Linux Kernel Client
Steven French, Senior Engineer SMB3 Architecture, IBM

Title showing is (with the strikethrough text): CIFSSMB2SMB2.1SMB3 SMB3.02 and Linux, a Status Update.
How do you use it? What works? What is coming?
Who is Steven French: maintains the Linux kernel client, at SMB3 Architect for IBM Storage
Excited about SMB3
Why SMB3 is important: cluster friendly, large IO sizes, more scalable.
Goals: local/remote transparency, near POSIX semantics to Samba, fast/efficient/full function/secure method, as reliable as possible over bad networks
Focused on SMB 2.1, 3, 3.02 (SMB 2.02 works, but lower priority)
SMB3 faster than CIFS. SMB3 remote file access near local file access speed (with RDMA)
Last year SMB 2.1, this year SMB 3.0 and minimal SMB 3.02 support
308 kernel changes this year, a very active year. More than 20 developers contributed
A year ago 3.6-rc5 – now at 3.11 going to 3.12
Working on today copy offload, full linux xattr support, SMB3 UNIX extension prototyping, recover pending locks, starting work on Multichannel
Outline of changes in the latest releases (from kernel version 3.4 to 3.12), version by version
Planned for kernel 3.13: copy chunk, quota support, per-share encryption, multichannel, considering RDMA (since Samba is doing RDMA)
Improvements for performance: large IO sizes, credit based flow control, improved caching model. Still need to add compounding,
Status: can negotiate multiple dialects (SMB 2.1, 3, 3.02)
Working well: basic file/dir operations, passes most functional tests, can follow symlinks, can leverage durable and persistent handles, file leases
Need to work on: cluster enablement, persistent handles, witness, directory leases, per-share encryption, multichannel, RDMA
Plans: SMB 2.1 no longer experimental in 3.12, SMB 2.1 and 3 passing similar set of functional tests to CIFS
Configuration hints: adjusting rsize, wsize, max_pending, cache, smb3 signing, UNIX extension, nosharelock
UNIX extensions: POSIX pathnames, case sensitive path name, POSIX delete/rename/create/mkdir, minor extensions to stat/statfs, brl, xattr, symlinks, POSIX ACLs
Optional POSIX SMB3 features outlined: list of flags used for each capability
Question: Encryption: Considering support for multiple algorithms, since AES support just went in the last kernel.
Development is active! Would like to think more seriously about NAS appliances. This can be extended…
This is a nice, elegant protocol. SMB3 fits well with Linux workloads like HPC, databases. Unbelievable performance with RDMA.
Question: Cluster enablement? Durable handle support is in. Pieces missing for persistent handle and witness are small. Discussing option to implement and test witness.
Need to look into the failover timing for workloads other than Hyper-V.
Do we need something like p-NFS? Probably not, with these very fast RDMA interfaces…

Mapping SMB onto Distributed Storage
Christopher R. Hertel, Senior Principal Software Engineer, Red Hat
José Rivera, Software Engineer, Red Hat

Trying to get SMB running on top of a distributed file system, Gluster
Chris and Jose: Both work for RedHat, both part of the Samba team, authors, etc…
Metadata: data about data, pathnames, inode numbers, timestamps, permissions, access controls, file size, allocation, quota.
Metadata applies to volumes, devices, file systems, directories, shares, files, pipes, etc…
Semantics are interpreted in different contexts
Behavior: predictable outcomes. Make them the same throughout the environments, even if they are not exactly the same
Windows vs. POSIX: different metadata + different semantics = different behavior
That’s why we have a plugfest downstairs
Long list of things to consider: ADS, BRL, deleteonclose, directory change notify, NTFS attributes, offline ops, quota, etc…
Samba is a Semantic Translator. Clients expect Windows semantics from the server, Samba expects POSIX semantics from the underlying file system
UNIX extensions for SMB allows POSIX clients to bypass some of this translation
If Samba does not properly handle the SMB protocol, we call it a bug. If cannot handle the POSIX translation, it’s also a bug.
General Samba approach: Emulate the Windows behavior, translate the semantics to POSIX (ensure other local processes play by similar rules)
The Samba VFS layers SMB Protocol Initial Request Handling  VFS Layer  Default VFS Layer  actual file system
Gluster: Distributed File System, not a cluster file system. Brick  Directory in the underlying file system. Bricks bound together as a volume. Access via SMB, NFS, REST.
Gluster can be FUSE mounted. Just another access method. FUSE hides the fact that it’s Gluster underneath.
Explaining translations: Samba/Gluster/FUSE. Gluster is adaptable. Translator stack like Samba VFS modules…
Can add support for: Windows ACLs, oplocks, leases, Windows timestamps.
Vfs_glusterfs: Relatively new code, similar to other Samba VFS modules. Took less than a week to write.
Can bypass the lower VFS layers by using libgfapi. All VFS calls must be implemented to avoid errors.
CTDB offers three basics services: distributed metadata database (for SMB state), node failure detection/recovery, IP address service failover.
CTDB forms a Samba cluster. Separate from the underlying Gluster cluster. May duplicate some activity. Flexible configuration.
SMB testing, compared to other access methods: has different usage patterns, has tougher requirements, pushes corner cases.
Red Hat using stable versions, kernel 2.x or something. So using SMB1 still…
Fixed: Byte range locking. Fixed a bug in F_GETLK to get POSIX byte range locking to work.
Fixed: SMB has strict locking and data consistency requirements. Stock Gluster config failed ping_pong test. Fixed cache bugs  ping_pong passes
Fixed: Slow directory lookups. Samba must do extra work to detect and avoid name collisions. Windows is case-INsensitive, POSIX is case-sensitive. Fixed by using vfs_glusterfs.
Still working on: CTDB node banning. Under heavy load (FSCT), CTDB permanently bans a running node. Goal: reach peak capacity without node banning. New CTDB versions improved capacity.
Still working on: CTDB recovery lock file loss. Gluster is a distributed FS, not a Cluster FS. In replicated mode, there are two copies of each file. If Recovery Lock File is partitioned, CTDB cannot recover.
Conclusion: If implementing SMB in a cluster or distributed environment, you should know enough about SMB to know where to look for trouble… Make sure metadata is correct and consistent.
Question: Gluster and Ceph have VFS. Is Samba suitable for that? Yes. Richard wrote a guide on how to write a VFS. Discussing a few issues around passing user context.
Question: How to change SMB3 to be more distributed? Client could talk to multiple nodes. Gluster working on RDMA between nodes. Protocol itself could offer more about how the cluster is setup.

Pike - Making SMB Testing Less Torturous
Brian Koropoff, Consulting Software Engineer, EMC Isilon

Pike – written in Python – starting with a demo
Support for a modest subset of SMB2/3. Currently more depth than breadth.
Emphasis on fiddly cases like failover, complex creates
Mature solutions largely in C (not convenient for prototyping)
Why python: ubiquitous, expressive, flexible, huge ecosystem.
Flexibility and ease of use over performance. Convenient abstractions. Extensible, re-usable.
Layers: core primitives (abstract data model), SMB2/3 packet definitions, SMB2/3 client model (connection, state, request, response), test harness
Core primitives: Cursor (buffer+offset indicating read/write location), frame (packet model), enums, anti-boilerplate magic. Examples.
SMB2/SMB3 protocol (pike.smb2) header, request/response, create {request/response} context, concrete frame. Examples.
SMB2/SMB3 model: SMB3 object model + glue. Future, client, connection (submit, trasceive, error handling), session, channel (treeconnect, create, read), tree, open, lease, oplocks.
Examples: Connect, tree connect, create, write, close. Oplocks. Leases.
Advanced uses. Manually construct and submit exotic requests. Override _encode. Example of a manual request.
Test harness (pike,test): quickly establish connection, session and tree connect to server. Host, credentials, share parameters taken from environment.
Odds and ends: NT time class, signing, key derivation helpers.
Future work: increase breadth of SMB2/3 support. Security descriptors, improvement to mode, NTLM story, API documentation, more tests!
http://github.com/emc-isilon/pike - open source, patches are welcome. Has to figure out how to accept contributions with lawyers…
Question: Microsoft has a test suite. It’s in C#, doesn’t work in our environment. Could bring it to the plugfest.
Question: I would like to work on implementing it for SMB1. What do you think? Not a priority for me. Open to it, but should use a different model to avoid confusion.
Example: Multichannel. Create a session, bind another channel to the same session, pretend failover occurred. Write fencing of stable write.

Exploiting the High Availability features in SMB 3.0 to support Speed and Scale
James Cain, Principal Software Architect, Quantel Ltd

Working with TV/Video production. We only care about speed.
RESTful recap. RESTful filesystems talk from SDC 2010. Allows for massive scale by storing application state in the URLs instead of in the servers.
Demo (skipped due to technical issues): RESTful SMB3.
Filling pipes: Speed (throughput) vs. Bandwidth vs. Latency. Keeping packets back to back on the wire.
TCP Window size used to limit it. Mitigate by using multiple wires, multiple connections.
Filling the pipes: SMB1 – XP era. Filling the pipes required application participation. 1 session could do about 60MBps. Getting Final Cut Pro 7 to lay over SMB1 was hard. No choice to reduce latency.
Filling the pipes: SMB 2.0 – Vista era. Added credits, SMB2 server can control overlapped requests using credits. Client application could make normal requests and fill the pipe.
Filling the pipes: SMB 2.1 – 7 era. Large MTU helps.
Filling the pipes: SMB 3 – 8 era. Multi-path support. Enables: RSS, Multiple NICs, Multiple machines, RDMA.
SMB3 added lots of other features for high availability and fault tolerance. SignKey derivation.
Filesystem has DirectX GUI :-) - We use GPUs to render, so our SMB3 server has Cuda compute built in too. Realtime visualization tool for optimization.
SMB3 Multi-machine with assumed shared state. Single SMB3 client talking to two SMB3 servers. Distributed non-homogeneous storage behind the SMB servers.
Second NIC (channel) initiation has no additional CREATE. No distinction on the protocol between single server or multiple server. Assume homogeneous storage.
Asking Microsoft to consider “NUMA for disks”. Currently, shared nothing is not possible. Session, trees, handles are shared state.
“SMB2++” is getting massive traction. Simple use cases are well supported by the protocol. SMB3 has a high cost of entry, but lower than writing n IFS in kernel mode.
There are limits to how far SMB3 can scale due to its model.
I know this is not what the protocol is designed to do. But want to see how far I can go.
It could be help by changing the protocol to have duplicate handle semantics associated with the additional channels.
The protocol is really, really flexible. But I’m having a hard time doing what I was trying to do.
Question: You’re basic trying to do Multichannel to multiple machines. Do you have a use case? I’m experimenting with it. Trying to discover new things.
Question: You could use CTDB to solve the problem. How much would it slow down? It could be a solution, not an awful lot of state.

SMB3 Update
David Kruse, Development Lead, Microsoft

SMB 3.02 - Don’t panic! If you’re on the road to SMB3, there are no radical changes.
Considered not revving the dialect and doing just capability bits, but thought it would be better to rev the dialect.
Dialects vs. Capabilities: Assymetric Shares, FILE_ATTRIBUTE_INTEGRITY_STREAMS.
SMB 2.0 client attempting MC or CA? Consistency/documentation question.
A server that receives a request from a client with a flag/option/capability that is not valid for the dialect should ignore it.
Showing code on how to mask the capabilities that don’t make sense for a specific dialect
Read/Write changes: request specific flag for unbuffered IO. RDMA flag for invalidation.
Comparing “Traditional” File Server Cluster vs. “Scale-Out” File Server cluster
Outlining the asymmetric scale-out file server cluster. Server-side redirection. Can we get the client to the optimal node?
Asymmetric shares. New capability in the TREE_CONNECT response. Witness used to notify client to move.
Different connections for different shares in the same scale-out file server cluster. Share scope is the unit of resource location.
Client processes share-level “move” in the same fashion as a server-level “move” (disconnect, reconnects to IP, rebinds handle).
If the cost accessing the data is the same for all nodes, there is no need to move the client to another node.
Move-SmbWitnessClient will not work with asymmetric shares.
In Windows, asymmetric shares are typically associated with Mirrored Storage Spaces, not iSCSI/FC uniform deployment. Registry key to override.
Witness changes: Additional fields: Sharename, Flags, KeepAliveTimeOutInSeconds.
Witness changes: Multichannel notification request. Insight into arrival/loss of network interfaces.
Witness changes: Keepalive. Timeout for async IO are very coarse. Guarantees client and server discover lost peer in minutes instead of hours.
Demos in Jose’s blog. Thanks for the plug!
Diagnosability events. New always-on events. Example: failed to reconnect a persistent handle includes previous reconnect error and reason. New events on server and client.
If Asymmetric is not important to you, you don’t need to implement it.
SMB for IPC (Inter-process communications) – What happened to named pipes?
Named pipes over SMB has been declined in popularity. Performance concerns with serialized IO. But this is a property of named pipes, not SMB.
SMB provides: discovery, negotiation, authentication, authorization, message semantics, multichannel, RDMA, etc…
If you can abstract your application as a file system interface, you could extend it to removte via SMB.
First example: Remote Shared Virtual Disk Protocol
Second example: Hyper-V Live Migration over SMB. VID issues writes over SMB to target for memory pages. Leverages SMB Multichannel, SMB Direct.
Future thoughts on SMB for IPC. Not a protocol change or Microsoft new feature. Just ideas shared as a thought experiment.

MessageFs – User mode-client and user-mode server. Named Pipes vs. MessageFs. Each offset marks a distinct transaction, enables parallel actions.
MemFs – Kernel mode component on the server side. Server registers a memory region and clients can access that memory region.
MemFs+ - What if we combine the two? Fast exchange for small messages plus high bandwidth, zero copy access for large transfers. Model maps directly to RDMA: send/receive messages, read/write memory access.

One last thing… On Windows 8.1, you can actually disable SMB 1.0 completely.

Architecting Block and Object Geo-replication Solutions with Ceph
Sage Weil, Founder & CTO, Inktank

Impossible to take notes, speaker goes too fast :-)

1 S(a) 2 M 3 B(a) 4
Michael Adam, SerNet GmbH - Delivered by Volker

What is Samba? The open source SMB server (Samba3). The upcoming open source AD controller (Samba4). Two different projects.
Who is Samba? List of team members. Some 35 or so people… www.samba.org/samba/team
Development focus: Not a single concentrated development effort. Various companies (RedHat, SuSE, IBM, SerNet, …) Different interests, changing interests.
Development quality: Established. Autobuild selftest mechanism. New voluntary review system (since October 2012).
What about Samba 4.0 after all?

First (!?) open source Active Directory domain controller
The direct continuation of the Samba 3.6 SMB file server
A big success in reuniting two de-facto separated projects!
Also a big and important file server release (SMB 2.0 with durable handles, SMB 2.1 (no leases), SMB 3.0 (basic support)

History. Long slide with history from 2003-06-07 (Samba 3.0.0 beta 1) to 2012-12-11 (Samba 4.0.0). Samba4 switched to using SMB2 by default.
What will 4.1 bring? Current 4.1.0rc3 – final planned for 2013-09-27.
Samba 4.1 details: mostly stabilization (AD, file server). SMB2/3 support in smbclient, including SMB3 encryption. Server side copy. Removed SWAT.
Included in Samba 4.0: SMB 2.0 (durable handles). SMB 2.1 (multi-credit, large MTU, dynamic reauth), SMB 3.0 (signing, encryption, secure negotiate, durable handles v2)
Missing in Samba 4.0: SMB 2.1 (leasing*, resilient file handles), SMB 3.0 (persistent file handles, multichannel*, SMB direct*, witness*, cluster features, storage features*, …) *=designed, started or in progress
Leases: Oplocks done right. Remove 1:1 relationship between open and oplock, add lease/oplock key. http://wiki.samba.org/index.php/Samba3/SMB2#Leases
Witness: Explored protocol with Samba rpcclient implementation. Working on pre-req async RPC. http://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Protocol
SMB Direct: Currently approaching from the Linux kernel side. See related SDC talk. http://wiki.samba.org/index.php/Samba3/SMB2#SMB_Direct
Multichannel and persistent handles: just experimentation and discussion for now. No code yet.

Keynote: The Impact of the NVM Programming Model
Andy Rudoff, Intel

Title is Impact of NVM Programming Model (… and Persistent Memory!)
What do we need to do to prepare, to leverage persistent memory
Why now? Programming model is decades old!
What changes? Incremental changes vs. major disruptions
What does this means to developers? This is SDC…
Why now?
One movements here: Block mode innovation (atomics, access hints, new types of trim, NVM-oriented operations). Incremental.
The other movement: Emerging NVM technologies (Performance, performance, perf… okay, Cost)
Started talking to companies in the industry  SNIA NVM Programming TWG - http://snia.org/forums/sssi/nvmp
NVM TWG: Develop specifications for new software “programming models”as NVM becomes a standard feature of platforms
If you don’t build it and show that it works…
NVM TWG: Programming Model is not an API. Cannot define those in a committee and push on OSVs. Cannot define one API for multiple OS platforms
Next best thing is to agree on an overall model.
What changes?
Focus on major disruptions.
Next generation scalable NVM: Talking about resistive RAM NVM options. 1000x speed up over NND, closer do DRAM.
Phase Change Memory, Magnetic Tunnel Junction (MT), Electrochemical Cells (ECM), Binary Oxide Filament Cells, Interfacial Switching
Timing. Chart showing NAND SATA3 (ONFI2, ONFI3), NAND PCIe Gen3 x4 ONFI3 and future NVM PCIE Gen3 x4.
Cost of software stack is not changing, for the last one (NVM PCIe) read latency, software is 60% of it?!
Describing Persistent Memory…
Byte-addressable (as far as programming model goes), load/store access (not demand-paged), memory-like performance (would stall a CPU load waiting for PM), probably DMA-able (including RDMA)
For modeling, think battery-backed RAM. These are clunky and expensive, but it’s a good model.
It is not tablet-like memory for the entire system. It is not NAND Flash (at least not directly, perhaps with caching). It is not block-oriented.
PM does not surprise the program with unexpected latencies (no major page faults). Does not kick other things out of memory. Does not use page cache unexpectedly.
PM stores are not durable until data is flushed. Looks like a bug, but it’s always been like this. Same behavior that’s been around for decades. It’s how physics works.
PM may not always stay in the same address (physically, virtually). Different location each time your program runs. Don’t store pointers and expect them to work. You have to use relative pointers. Welcome to the world of file systems…
Types of Persistent Memory: Battery-backed RAM. DRAM saved on power failure. NVM with significant caching. Next generation NVM (still quite a bit unknown/emerging here).
Existing use cases: From volatile use cases (typical) to persistent memory use case (emerging). NVDIMM, Copy to Flash, NVM used as memory.
Value: Data sets with no DRAM footprint. RDMA directly to persistence (no buffer copy required!). The “warm cache” effect. Byte-addressable. Direct user-mode access.
Challenges: New programming models, API. It’s not storage, it’s not memory. Programming challenges. File system engineers and database engineers always did this. Now other apps need to learn.
Comparing to the change that happened when we switched to parallel programming. Some things can be parallelized, some cannot.
Two persistent memory programming models (there are four models, more on the talk this afternoon).
First: NVM PM Volume mode. PM-aware kernel module. A list of physical ranges of NVMs (GET_RANGESET).
For example, used by file systems, memory management, storage stack components like RAID, caches.
Second: NVM PM File. Uses a persistent-memory-aware file system. Open a file and memory map it. But when you do load and store you go directly to persistent memory.
Native file APIs and management. Did a prototype on Linux.
Application memory allocation. Ptr=malloc(len). Simple, familiar interface. But it’s persistent and you need to have a way to get back to it, give it a name. Like a file…
Who uses NVM.PM.FILE. Applications, must reconnect with blobs of persistence (name, permissions)
What does it means to developers?
Mmap() on UNIX, MapViewOfFile() on Windows. Have been around for decades. Present in all modern operating systems. Shared or Copy-on-write.
NVM.PM.FILE – surfaces PM to application. Still somewhat raw at this point. Two ways: 1-Build on it with additional libraries. 2-Eventually turn to language extensions…
All these things are coming. Libraries, language extensions. But how does it work?
Creating resilient data structures. Resilient to a power failure. It will be in state you left it before the power failure. Full example: resilient malloc.
In summary: models are evolving. Many companies in the TWG. Apps can make a big splash by leveraging this… Looking forward to libraries and language extensions.

Keynote: Windows Azure Storage – Scaling Cloud Storage
Andrew Edwards, Microsoft

Turning block devices into very, very large block devices. Overview, architecture, key points.
Overview
Cloud storage: Blobs, disks, tables and queues. Highly durable, available and massively scalable.
10+ trillion objects. 1M+ requests per seconds average. Exposed via easy and open REST APIs
Blobs: Simple interface to retrieve files in the cloud. Data sharing, big data, backups.
Disks: Built on top on blobs. Mounted disks as VHDs stored on blobs.
Tables: Massively scalable key-value pairs. You can do queries, scan. Metadata for your systems.
Queues: Reliable messaging system. Deals with failure cases.
Azure is spread all over the world.
Storage Concepts: Accounts  ContainerBlobs/TableEntities/QueuesMessages. URLs to identify.
Used by Microsoft (XBOX, SkyDrive, etc…) and many external companies
Architecture
Design Goals: Highly available with strong consistency. Durability, scalability (to zettabytes). Additional information in the SOSP paper.
Storage stamps: Access to blog via the URL. LB  Front-end  Partition layer  DFS Layer. Inter-stamp partition replication.
Architecture layer: Distributed file system layer. JBODs, append-only file system, each extent is replicated 3 times.
Architecture layer: Partition layer. Understands our data abstractions (blobs, queues, etc). Massively scalable index. Log Structure Merge Tree. Linked list of extents
Architecture layer: Front-end layer. REST front end. Authentication/authorization. Metrics/logging.
Key Design Points
Availability with consistency for writing. All writes we do are to a log. Append to the last extent of the log.
Ordered the same across all 3 replicas. Success only if 3 replicas are commited. Extents get sealed (no more appends) when they get to a certain size.
If you lose a node, seal the old two copies, create 3 new instances to append to. Also make a 3rd copy for the old one.
Availability with consistency for reading. Can read from any replica. Send out parallel read requests if first read is taking higher than 95% latency.
Partition Layer: spread index/transaction processing across servers. If there is a hot node, split that part of the index off. Dynamically load balance. Just the index, this does not move the data.
DFS Layer: load balancing there as well. No disk or node should be hot. Applies to both reads and writes. Lazily move replicas around to load balancing.
Append only system. Benefits: simple replication, easier diagnostics, erasure coding, keep snapshots with no extra cost, works well with future dirve technology. Tradeoff: GC overhead.
Our approach to the CAP theorem. Tradeoff in Availability vs. Consistency. Extra flexibility to achieve C and A at the same time.
Lessons learned: Automatic load balancing. Adapt to conditions. Tunable and extensible to tune load balancing rules. Tune based on any dimension (CPU, network, memory, tpc, GC load, etc.)
Lessons learned: Achieve consistently low append latencies. Ended up using SSD journaling.
Lessons learned: Efficient upgrade support. We update frequently, almost consistently. Handle them almost as failures.
Lessons learned: Pressure point testing. Make sure we’re resilient despite errors.
Erasure coding. Implemented at the DFS Layer. See last year’s SDC presentation.
Azure VM persistent disks: VHDs for persistent disks are directly stored in Windows Azure Storage blobs. You can access your VHDs via REST.
Easy to upload/download your own VHD and mount them. REST writes are blocked when mounted to a VM. Snapshots and Geo replication as well.
Separating compute from storage. Allows them to be scaled separately. Provide flat network storage. Using a Quantum 10 network architecture.
Summary: Durability (3 copies), Consistency (commit across 3 copies). Availability (can read from any of the 3 relicas). Performance/Scale.
Windows Azure developer website: http://www.windowsazure.com/en-us/develop/net
Windows Azure storage blog: http://blogs.msdn.com/b/windowsazurestorage
SOSP paper/talk: http://blogs.msdn.com/b/windowsazure/archive/2011/11/21/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

SMB Direct update
Greg Kramer, Microsoft
Tom Talpey, Microsoft

Two parts: 1 - Tom shares Ecosystem status and updates, 2 - Greg shares SMB Direct details
Protocols and updates: SMB 3.02 is a minor update. Documented in MS-SMB2 and MS-SMBD. See Dave's talk yesterday.
SMB Direct specifies the SMB3 RDMA transport, works with both SMB 3.0 and SMB 3.02
Windows Server 2012 R2 – GA in October, download from MSDN
Applications using SMB3 and SMB Direct: Hyper-V VHD, SQL Server
New in R2: Hyper-V Live Migration over SMB, Shared VHDX (remote shared virtual disk, MS-RSVD protocol)
RDMA Transports: iWARP (IETF RDMA over TCP), InfiniBand, RoCE. Ethernet: iWARP and RoCE – 10 or 40GbE, InfiniBand: 32Gbps (QDR) or 54Gbps (FDR)
RDMA evolution: iWARP (IETF standard, extensions currently active in IETF). RoCE (routable RoCE to improve scale, DCB deployment still problematic). InfiniBand (Roadmap to 100Gbps, keeping up as the bandwidth/latency leader).
iWARP: Ethernet, routable, no special fabric required, Up to 40GbE with good latency and full throughput
RoCE: Ethernet, not routable, requires PFC/DCB, Up to 40GbE with good latency and full throughput
InfinBand: Specialized interconnect, not routable, dedicated fabric and switching, Up to 56Gbps with excellent latency and throughput
SMB3 Services: Connection management, authentication, multichannel, networking resilience/recovery, RDMA, File IO Semantics, control and extension semantics, remote file system access, RPC
The ISO 7-layer model: SMB presents new value as a Session layer (RDMA, multichannel, replay/recover). Move the value of SMB up the stack.
SMB3 as a session layer: Applications can get network transparency, performance, recovery, protection (signing, encryption, AD integration). Not something you see with other file systems or file protocols.
Other: Great use by clustering (inter-node communication), quality of service, cloud deployment
In summary. Look to SMB for even broader application (like Hyper-V Live Migration did). Broader use of SMB Direct. Look to see greater application “fidelity” (sophisticated applications transparently server by SMB3)
Protocol enhancements and performance results
Where can we reduce IO costs? We were extremely happy about performance, there was nothing extremely easy to do next, no low-hanging fruit.
Diagram showing the App/SMB client/Client RNIC/Server RNIC. How requests flow in SMB Direct.
Interesting: client has to wait for the invalidation completion. Invalidation popped up as an area of improvement. Consumes cycles, bus. Adds IO, latency. But it’s required.
Why pend IO until invalidation is completed? This is storage, we need to be strictly correct. Invalidation guarantees: data is in a consistent state after DMA, peers no longer has access.
Registration caches cannot provides these guarantees, leading to danger of corruption.
Back to the diagram. There is a way to decorate a request with the invalidation  Send and Invalidate. Provides all the guarantees that we need!
Reduces RNIC work requests per IO by one third for high IOPs workload. That’s huge! Already supported by iWARP/RoCE/InfiniBand
No changes required at the SMB Direct protocol. Minor protocol change in SMB 3.02 to support invalidation. New channel value in the SMB READ and SMB WRITE.
Using Send and Invalidate (Server). Only one invalidate per request, have to be associated with the request in question. You can leverage SMB compounding.
Only the first memory descriptor in the SMB3 read/write array may be remotely invalidated. Keeping it simple.
Using Send and Invalidate (Client). Not a mandate, you can still invalidate “manually” if not using remote invalidate. Must validate that the response matches.
Performance Results (drumroll…)
Benchmark configuration: Client and Server config: Xeon E5-2660. 2 x ConnectX-3 56Gbps InfiniBand. Shunt filter in the IO path. Comparing WS2012 vs. WS2012 R2 on same hardware.
1KB random IO. Uses RDMA send/receive path. Unbuffered, 64 queue depth.

Reads: 881K IOPs. 2012 R2 is +12.5% over 2012. Both client and server CPU/IO reduced (-17.3%, -36.7%)
Writes: 808K IOPs. 2012 R2 is +13.5% over 2012. Both client and server CPU/IO reduced (-16%, -32.7%)

8KB random IO. Uses RDMA read/writes. Unbuffered, 64 queue depth.
- Reads: 835K IOPs. 2012 R2 is +43.3% over 2012. Both client and server CPU/IO reduced (-37.1%, -33.2%)
- Writes: 712K IOPs. 2012 R2 is +30.2% over 2012. Both client and server CPU/IO reduced (-26%, -14.9%)
512KB sequential IO. Unbuffered, 12 queue depth. Already maxing out before. Remains awesome. Minor CPU utilization decrease.

Reads: 11,366 MBytes/sec. 2012 R2 is +6.2% over 2012. Both client and server CPU/IO reduced (-9.3%, -14.3%)
Writes: 11,412 MBytes/sec: 2012 R2 is +6% over 2012. Both client and server CPU/IO reduced (-12.2%, -10.3%)

Recap: Increased IOPS (up to 43%) and high bandwidth. Decrease CPU per IO (up to 36%).
Client has more CPU for applications. Server scales to more clients.
This includes other optimization in both the client in the server. NUMA is very important.
No new hardware required. No increase number of connections, MRs, etc.
Results reflect the untuned, out-of-the-box customer experience.
One more thing… You might be skeptical, especially about the use of shunt filter.
We never get to see this in our dev environment, we don’t have the high end gear. But...
Describing the 3U Violin memory array running Windows Server in a clustered configuration. All flash storage. Let’s see what happens…
Performance on real IO going to real, continuously available storage:

100% Reads – 4KiB: >1Million IOPS
100% Reads – 8KiB: >500K IOPS
100% Writes – 4KiB: >600K IOPS
100% Writes – 8KiB: >300K IOPS

Questions?

A Status Report on SMB Direct (RDMA) for Samba
Richard Sharpe, Samba Team Member, Panzura

I work at Panzura but has been done on my weekends
Looking at options to implement SMB Direct
2011 – Microsoft introduced SMB direct at SDC 2011. I played around with RDMA
May 2012 – Tutorial on SMB 3.0 at Samba XP
Mellanox supplied some IB cards to Samba team members
May 2013 – More Microsoft presentations with Microsoft at Samba XP
June 2013 – Conversations with Mellanox to discuss options
August 2013 – Started circulating a design document
Another month or two before it’s hooked up with Samba.
Relevant protocol details: Client connections via TCP first (port 445). Session setup, connects to a share. Queries network interfaces. Place an RDMA Connection to server on port 5445, brings up SMB Direct Protocol engine
Client sends negotiate request, Dialect 0x300, capabilities field. Server Responds.
Diagram with SMB2 spec section 4.8 has an example
SMB Direct: Small protocol - Negotiate exchange phase, PDU transfer phase.
Structure of Samba. Why did it take us two years? Samba uses a fork model. Master smbd forks a child. Easy with TCP. Master does not handle SMB PDUs.
Separate process per connection. No easy way to transfer connection between them.
Diagram with Samba structure. Problem: who should listen on port 5445? Wanted RDMA connection to go to the child process.
3 options:
1 - Convert Samba to a threaded model, everything in one address space. Would simplify TCP as well. A lot of work… Presents other problems.
2 - Separate process to handle RDMA. Master dmbd, RDMA handler, multiple child smbd, shared memory. Layering violation! Context switches per send/receive/read/write. Big perf hit.
3 - Kernel driver to handle RDMA. Smbdirect support / RDMA support (rdmacm, etc) / device drivers. Use IOCTLs. All RDMA work in kernel, including RDMA negotiate on port 5445. Still a layering violation. Will require both kernel and Samba knowledge.
I decided I will follow this kernel option.
Character mode device. Should be agnostic of the card used. Communicate via IOCTLs (setup, memory params, send/receive, read/write).
Mmap for RDMA READ and RDMA WRITE. Can copy memory for other requests. Event/callback driven. Memory registration.
The fact that is looks like a device driver is a convenience.
IOCTLs: set parameters, set session ID, get mem params, get event (includes receive, send complete), send pdu, rdma read and write, disconnect.
Considering option 2. Doing the implementation of option 3 will give us experience and might change later.
Amortizing the mode switch. Get, send, etc, multiple buffers per IOCTL. Passing an array of objects at a time.
Samba changes needed….
Goals at this stage: Get something working. Allow others to complete. It will be up on github. Longer term: improve performance with help from others.
Some of this work could be used by the SMB client
Status: A start has been made. Driver loads and unloads, listens to connections. Working through the details of registering memory. Understand the Samba changes needed.
Weekend project! http://github.com/RichardSharpe/smbdirect-driver
Acknowledgments: Microsoft. Tom Talpey. Mellanox. Or Gerlitz. Samba team members.

CDMI and Scale Out File System for Hadoop
Philippe Nicolas, Scality

Short summary of who is Scality. Founded 2009. HQ in SF. ~60 employees, ~25 engineers in Paris. 24x7 support team. 3 US patents. $35Min 3 rounds.
Scality RING. Topology and name of the product. Currently in the 4.2 release. Commodity servers and storage. Support 4 LINUX distributions. Configure Scality layer. Create a large pool of storage.
Ring Topology. End-to-end Paralelism. Object Storage. NewSQL DB. Replication. Erasure coding. Geo Redundancy. Tiering. Multiple access methods (HTTP/REST, CDMI, NFS, CIFS, SOFS). GUI/CLI management.
Usage: e-mail, file storage, StaaS, Digital Media, Big Data, HPC
Access methods: APIs: RS2 (S3 compatible API), Sproxyd, RS2 light, SNIA CDMI. File interface: Scality Scale Out File System (SOFS), NFS, CIFS, AFP, FTP. Hadoop HDFS. OpenStack Cinder (since April 2013).
Parallel network file system. Limits are huge – 2^32 volumes (FS), 2^24 namespaces, 2^64 files. Sparse files. Aggregated throughput, auto-scaling with storage or access node addition.
CDMI (path and ID based access). Versions 1.0.1., 1.0.2. On github. CDMI client java library (CaDMIum), set of open source filesystem tools. On github.
Apache Hadoop. Transform commodity hard in data storage service. Largely supported by industry and end user community. Industry adoption: big names adopting Hadoop.
Scality RING for Hadoop. Replace HDFS with the Scality FS. We validate Hortonworks and Cloudera. Example with 12 Hadoop nodes for 12 storage nodes. Hadoop task trackers on RING storage nodes.
Data compute and storage platform in ONE cluster. Scality Scale Out File System (SOFS) instead of HDFS. Advanced data protection (data replication up to 6 copies, erasure coding). Integration with Hortonworks HDP 1.0 & Cloudera CDH3/CDH4. Not another Hadoop distribution!
Summary: This is Open Cloud Access: access local or remotely via file and block interface. Full CDMI server and client. Hadoop integration (convergence approach). Comprehensive data storage platform.

Introduction to HP Moonshot
Tracy Shintaku, HP

Today’s demands – pervasive computing estimates. Growing internet of things (IoT).
Using SoC technologies used in other scenarios for the datacenter.
HP Moonshot System. 4U. World’s first low-energy software-defined server. HP Moonshot 1500 Chassis.
45 individually serviceable hot-plug artrdges. 2 network switches, private fabric. Passive base plane.
Introducing HP Proliant Moonshot Server (passing around the room). 2000 of these servers in a rack. Intel Atom S1260 2GHz, 8GB DDR ECC 1333MHz. 500GB or 1TBHDD or SSD.
Single server = 45 servers per chassis. Quad-server = 180 servers per chassis. Compute, storage or combination. Storage cartridges with 2 HDD shared by 8 servers.
Rear view of the chassis. Dual 4QSFP network uplinks each with 4 x 40GB), 5 hot-plug fans, Power supplies, management module.
Ethernet – traffic isolation and stacking for resiliency with dual low-latency switches. 45 servers  dual switches  dual uplink modules.
Storage fabric. Different module form factors allow for different options: Local storage. Low cost boot and logging. Distributed storage and RAID. Drive slices reduce cost of a boot drive 87%.
Inter-cartridge private 2D Taurus Ring – available in future cartridges. High speed communication lanes between servers. Ring fabric, where efficient localized traffic is benefitial.
Cartridge roadmap. Today to near future to future. CPU: Atom  Atom, GPU, DSP, x86, ARM. Increasing variety of workloads static web servers now to hosting, financial servers in the future.
Enablement: customer and partner programs. Partner program. Logo wall for technology partners. Solution building program. Lab, services, consulting, financing.
Partners include Redhat, Suse, Ubuntu, Hortonworks, MapR, Cloudera, Couchbase, Citrix, Intel, AMD, Calxeda, Applied Micro, TI, Marvell, others. There’s a lot of commonality with OpenStack.
Web site: http://h17007.www1.hp.com/us/en/enterprise/servers/products/moonshot/index.aspx

NFS on Steroids: Building Worldwide Distributed File System
Gregory Touretsky, Intel

Intel is a big organization, 6500 Its @ 59 sites, 95,200 employes, 142000 devices.
Every employee doing design has a Windows machines, but also interact with NFS backend
Remote desktop in interactive pool. Talks to NFS file servers, glued together with name spaces.
Large batch pools that do testing. Models stored in NFS.
Various application servers, running various systems. Also NIS, Cron servers, event monitors, configuration management.
Uses Samba to provide access to NFS file servers using SMB.
Many sites, many projects. Diagram with map of the world and multiple projects spanning geographies.
Latency between 10s t 100s ms. Bandwidth: 10s Mbps to 10s of Gbpsp.
Challenge: how to get to our customers and provide the ability to collaborate across the globe in a secure way
Cross-site data access in 2012. Multiple file servers, each with multiple exports. Clients access servers on same site. Rsync++ for replication between sites (time consuming).
Global user and group accounts. Users belong to different groups in different sites.
Goals: Access every file in the world from anywhere, same path, fault tolerant, WAN friendly, every user account and group on every site, Local IO performance not compromised.
Options: OpenAFS (moved out many years ago, decided not to got back). Cloud storage (concern with performance). NFS client-side caching (does not work well, many issues). WAN optimization (have some in production, help with some protocols, but not suitable for NFS). NFS site-level caching (proprietary and open source NFS Ganesha). In house (decided not to go there).
Direct NFS mount over WAN optimized tunnel. NFS ops terminated at the remote site, multiple potential routes, cache miss. Not the right solution for us.
Select: Site-level NFS caching and Kerberos. Each site has NFS servers and Cache servers. Provides instant visibility and minimizes amount of data transfer across sites.
Cache is also writable. Solutions with write-through and write-back caching.
Kerberos authentication with NFS v3. There are some problems there.
Cache implementations: half a dozen vendors, many not suitable for WAN. Evaluating alternatives.
Many are unable to provide a disconnected mode of operation. That eliminated a number of vendors.
Consistency vs. performance. Attribute cache timeout. Nobody integrates this at the directory level. Max writeback delay.
Optimizations. Read cache vs. Read/Write cache, delegations, proactive attribute validation for hot files, cache pre-population.
Where is it problematic? Application is very NFS unfriendly, does not work well with caching. Some cases it cannot do over cache, must use replication.
Problems: Read once over high latency link. First read, large file, interactive work. Large % of non-cacheable ops (write-through). Seldom access, beyond cache timeout.
Caching is not a business continuity solution. Only a partial copy of the data.
Cache management implementation. Doing it at scale is hard. Nobody provides a solution that fits our needs.
Goal: self-service management for data caching. Today administrators are involved in the process.
Use cases: cache my disk at site X, modify cache parameters, remove cache, migrate source/cache, get cache statistics, shared capacity management, etc.
Abstract the differences between the different vendors with this management system.
Management system example: report with project, path (mount point), size, usage, cached cells. Create cell in specific site for specific site.
Cache capacity planning. Goal: every file to be accessible on-demand everywhere.
Track cache usage by org/project. Shared cache capacity, multi-tenant. Initial rule of thumb: 7-10% of the source capacity, seeding capacity at key locations
Usage models: Remote validation (write once, read many). Get results back from remote sites (write once, read once). Drop box (generate in one site, get anywhere). Single home directory (avoid home directory in every site for every user, cache remote home directories). Quick remote environment setup, data access from branch location.
NFS (RPC) Authentication. Comparing AUTH_SYS and RPCSEC_GSS (KRB5). Second one uses an external KDC, gets past the AUTH_SYS limitation of 16 group IDs.
Bringing Kerberos? Needs to make sure this works as well as Windows with Active Directory. Need to touch everything (Linux client, NFS file servers, SSH, batch scheduler, remote desktop/interactive servers, name space/automounter (trusted hosts vs. regular hosts), Samba (used as an SMB gateway to NFS), setuid/sudo, cron jobs and service accounts (keytab management system),
Supporting the transition from legacy mounts to Kerberos mount. Must support a mixed environment. Introducing second NIS domain.
Welcome on board GDA airlines (actual connections between different sites). Good initial feedback from users (works like magic!)
Summary: NFS can be accessed over WAN – using NFS caching proxy. NFSv3 environment can be kerberized (major effort is required, transition is challenging, it would be as challenging for NFSv5/KRB)

Forget IOPS: A Proper Way to Characterize & Test Storage Performance
Peter Murray, SwiftTest

About what we learned in the last few years
Evolution: Vendor IOPs claims, test in production and pray, validate with freeware tools (iometer, IOZone), validate with workload models
What is storage validation? Characterize the various applications, workloads. Diagram: validation appliance, workload emulations, storage under test.
Why should you care? Because customers do care! Product evaluations, vendor bakeoffs, new feature and technology evaluations, etc…
IOPS: definition from the SNIA dictionary. Not really well defined. One size does not fit all. Looking at different sizes.
Real IO does not use a fixed size. Read/write may be a small portion of it in certain workloads. RDMA read/write may erode the usefulness of isolated read/write.
Metadata: data about data. Often in excess of 50%, sometimes more than 90%. GoDaddy mentioned that 94% of workloads are not read/write.
Reducing metadata impact: caching with ram, flash, SSD helps but it’s expensive.
Workloads: IOPS, metadata and your access pattern. Write/read, random/sequential, IO/metadata, block/chunk size, etc.
The important of workloads: Understand overload and failure conditions. Understand server, cluster, deduplication, compression, network configuration and conditions
Creating and understanding workloads. Access patterns (I/O mix: read/write%, metadata%) , file system (depth, files/folder, file size distribution), IO parameters (block size, chunk size, direction), load properties (number of users, actions/second, load variability/time)
Step 1 - Creating a production model. It’s an art, working to make it a science. Production stats + packet captures + pre-built test suites = accurate, realistic work model.
Looking at various workload analysis
Workload re-creation challenges: difficult. Many working on these. Big data, VDI, general VM, infinite permutations.
Complex workloads emulation is difficult and time consuming. You need smart people, you need to spend the time.
Go Daddy shared some of the work on simulation of their workload. Looking at diagram with characteristics of a workload.
Looking at table with NFSv3/SMB2 vs. file action distribution.
Step 2: Run workload model against the target.
Step 3: Analyze the results for better decisions. Analytics leads to Insight. Blocks vs. file. Boot storm handing, limits testing, failure modes, effects of flash/dedup/tiering/scale-out.
I think we’ll see dramatic changes with the use of Flash. Things are going to change in the next few years.
Results analysis: Performance. You want to understand performance, spikes during the day, what causes them. Response times, throughput.
Results analysis: Command mix. Verify that the execution reflects the expected mix. Attempts, successes, errors, aborts.
Summary: IOPs alone cannot characterize real app storage performance. Inclusion of metadata is essential, workload modeling and purpose-build load generation appliances are the way to emulate applications. The more complete the emulation, the deeper the understanding.
If we can reduce storage cost from 40% to 20% of the solution by better understanding the workload, you can save a lot of money.

pNFS, NFSv4.1, FedFS and Future NFS Developments
Tom Haynes, NetApp

Tom covering for Alex McDonald, who is sick. His slides.
We want to talk about how the protocol get defined, how it interfact with different application vendors and customers.
Looking at what is happening on the Linux client these days.
NFS: Ubiquitous and everywhere. NFSv3 is very successful, we can’t dislodge it. We though everyone would go for NFSv4 and it’s now 10 years later…
NFSv2 in 1983, NFSv3 in 1995, NFSv4 in 2003, NFSv4.1 in 2010. NFSv4.2 to be agreed at the IETF – still kinks in the protocol that need to be ironed out. 2000=DAS, 2010=NAS, 2020=Scale-Out
Evolving requirements. Adoption is slow. Lack of clients was a problem with NFSv4. NFSv3 was just “good enough”. (It actually is more than good enough!)
Industry is changing, as are requirements. Economic trends (cheap and fast cluster, cheap and fast network, etc…)
Performance: NFSv3 single threaded bottlenecks in applications (you can work around it).
Business requirements. Reliability (sessions) is a big requirement
NFSv4 and beyond.
Areas for NFSv4, NFSv4.1 and pNFS: Security, uniform namespaces, statefulness/sessions, compound operations, caching (directory and file delegations) parallelization (layout and pFNS)
Future NFSv4.2 and FedFS (Global namespace; IESG has approved Dec 2012)
NFSv4.1 failed to talk to the applications and customers and ask what they needed. We did that for NFSv4.2
Selecting the application for NFSv4.1, planning, server and client availability. High level overview
Selecting the parts: 1 – NFSv4.1 compliant server (Files, blocks or objects?), 2-compliant client. The rise of the embedded client (Oracle, VMware). 3 – Auxiliary tools (Kerberos, DNS, NTP, LDAP). 4 – If you can, use NFS v4.1 over NFSv4.
If you’re implementing something today, skip NFS v4 and go straight to NFS v4.1
First task: select an application: Home directories, HPC applications.
Don’t select: Oracle (use dNFS built in), VMware and other virtualization tools (NFSv3). Oddball apps that expect to be able to internally manage NFSv3 “maps”. Any application that required UDP, since v4.1 doesn’t support anything but TCP.
NSFv4 stateful clients. Gives client independence (client has state). Allows delegation and caching. No automounter required, simplified locking
Why? Compute nodes work best with local data, NFSv4 eliminates the need for local storage, exposes more of the backed storage functionality (hints), removes stale locks (major source of NFSv3 irritation)
NFSv4.1 Delegations. Server delegates certain responsabilities to the client (directory and file, caching). Read and write delegation. Allows clients to locally service operations (open, close, lock, etc.)
NFSv4.1 Sessions. In v3, server never knows if client got the reply message. In v4.1, sessions introduced.
Sessions: Major protocol infrastructure change. Exactly once semantics (EOS), bounded size of reply cache. Unlimited parallelism. Maintains server’s state relative to the connections belonging to a client.
Use delegation and caching transparently; client and server provide transparency. Session lock clean up automatically.
NFSv4 Compound operations – NFSv3 protocol can be “chatty”, unsuitable for WANs with poor latency. Typical NFSv3: open, read & close a file. Compounds many operations into one to reduce wire time and simple error recovery.
GETATTR is the bad boy. We spent 10 years with the Linux client to get rid of many of the GETATTR (26% of SPECsfs2008).
NFSv4 Namespace. Uniform and “infinite” namespace. Moving from user/home directories to datacenter and corporate use. Meets demand for “large scale” protocol. Unicode support for UTF-8 codepoints. No automounter required (simplifies administration). Pseudo-file system constructed by the server.
Looking at NFSv4 Namespace example. Consider the flexibility of pseudo-filesystems to permit easier migration.
NFSv4 I18N Directory and File Names. Uses UTF-8, check filenames for compatibility, review filenames for compatibility. Review existing NFSv3 names to ensure they are 7-bit ASCII clean.
NFSv4 Security. Strong security framework. ACLs for security and Windows compatibility. Security with Kerberos. NFSv4 can be implemented without Kerberos security, but not advisable.
Implementing without Kerberos (no security is a last resort!). NFSv4 represents users/groups as strings (NFSv3 used 32-bit integers, UID/GID). Requires UID/GID to be converted to all numeric strings.
Implementing with Kerberos. Find a security expert. Consider using Windows AD Server.
NFSv4 Security. Firewalls. NFSv4 ha no auxiliary protocols. Uses port 2049 with TCP only. Just open that port.
NFSv4 Layouts. Files, objects and block layouts. Flexibility for storage that underpins it. Location transparent. Layouts available from various vendors.
pNFS. Can aggregate bandwidth. Modern approach, relieves issues associated with point-to-point connections.
pNFS Filesystem implications.
pNFS terminology. Important callback mechanism to provide information about the resource.
pNFS: Commercial server implementations. NetApp has it. Panasas is the room as well. Can’t talk about other vendors…
Going very fast through a number of slides on pNFS: NFS client mount, client to MDS, MDS Layout to NFS client, pNFS client to DEVICEINFO from MDS,
In summary: Go adopt NFS 4.1, it’s the greatest thing since sliced bread, skip NFS 4.0
List of papers and references. RFCs: 1813 (NFSv3), 3530 (NFSv4), 5661 (NFSv4.1), 5663 (NFSv4.1 block layout), 5664 (NFSv4.1 object layout)

pNFS Directions / NFSv4 Agility
Adam Emerson, CohortFS, LLC

Stimulate discussion about agility as a guiding vision for future protocol evaluation
NFSv4: A standard file acces/storage protocol, that is agile
Incremental advances shouldn’t require a new access protocol. Capture more value from the engineering already done. Retain broad applicability, yet adapt quickly to new challenges/opportunities
NFSv4 has delivered (over 10+ years of effort) on a set of features designers had long aspired to: atomicity, consistency, integration, referrals, single namespaces
NFSv4 has sometimes been faulted for delivering slowly and imperfect on some key promises: flexible and easy wire security , capable and interoperable ACLs, RDMA acceleration
NFSv4 has a set of Interesting optional features not widely implemented: named attributes, write delegations, directory delegations, security state verifier, retention policy
Related discussion in the NFSv4 Community (IETF): The minor version/extension debate: de-serializing independent, potentially parallel extension efforts, fixing defect in prior protocol revisions, rationalizing past and future extension mechanisms
Related discussion in the NFSv4 Community (IETF): Extensions drafts leave my options open, but prescribes: process to support development of new features proposals in parallel, capability negotiation, experimentation
Embracing agility: Noveck formulation is subtle: rooted in NFS and WG, future depends on participants, can encompass but perhaps does not call out for an agile future.
Capability negotiation and experimental codepoint ranges strongly support agility. What we really want is a model that encourages movement of features from private experimentation to shared experimentation to standardization.
Efforts promoting agility: user-mode (and open source) NFSv4 servers (Ganesha, others?) and clients (CITI Windows NFSv4.1 client, library client implementations)
Some of the people in the original CITI team now working with us and are continuing to work on it
library client implementations: Allow novel semantics and features like pre-seeding of files, HPC workloads, etc.
NFSv4 Protocol Concepts promoting agility: Not just new RPCs and union types.
Compound: Grouping operations with context operations. Context evolves with operations and inflects the operations. It could be pushed further…
Named Attributes: Support elaboration of conventions and even features above the protocol, with minimal effort and coordination. Subfiles, proplists. Namespace issues: System/user/other, non-atomicity, not inlined.
Layout: Powerful structuring concept carrying simplified transaction pattern. Typed, Operations carry opaque data nearly everywhere, application to data striping compelling.
Futures/experimental work – some of them are ridicuolous and I apologize in advance
pNFS striping flexibility/flexible files (Halevy). Per-file striping and specific parity applications to file layout. OSDv2 layout, presented at IETF 87.
pNFS metastripe (Eisler, further WG drafts). Scale-out metadata and parallel operations for NFSv4. Generalizing parallel access concept of NFSv4 for metadata. Built on layout and attribute hints. CohortFS prototyping metastripe on a parallel version of the Ceph file system. NFSv4 missing a per-file redirect, so this has file redirection hints.
End-to-end Data Integrity (Lever/IBM). Add end-to-end data integrity primitives (NFSv4.2). Build on new READ_PLUS and WRITE ops. Potentially high value for many applications.
pNFS Placement Layouts (CohortFS). Design for algorithmic placement in pNFS layout extension. OSD selection and placement computed by a function returned at GETDEVICEINFO. Client execution of placement codes, complex parity, volumes, etc.
Replication Layouts (CohortFS). Client based replication with integrity. Synchronous wide-area replication. Built on Layout.
Client Encryption (CohortFS). Relying on named attribute extension only, could use atomicity. Hopefully combined with end-to-end integrity being worked on
Cache consistency. POSIX/non-CTO recently proposed (eg, Eshel/IBM). Potentially, more generality. Eg, flexible client cache consistency models in NFSv4. Add value to existing client caching like CacheFS.
New participants. You? The future is in the participants…

Closing Keynote: Worlds Colliding: Why Big Data Changes How to Think about Enterprise Storage
Addison Snell, CEO, Intersect360 Research

Analyst, researching big data. There are hazard of forecasting big data. Big Data goes beyond Hadoop.
What is the real opportunity? Hype vs. Reality. Perception: HPC is not something I want in my Enterprise computing environment
Technical computing vs. Enterprise computing. Both defined as mission-critical.
Vendors do not know how much money is going on each, we need to do a lot of user research.
Technical computing is about the top-line mission of the business. Enterprise computing is about keeping the business running.
TC is driven by price/performance, faster adoption of new tech. EC is driven by RAS (reliability, availability, serviceability), slow adoption of new tech.
Survey results, 278 respondents, April to August 2013. Built on earlier surveys. 178 technical, 100 enterprise, 165 commercial, 67 academic, 46 government
Insight 1: big data is a big opportunity. Money is being spent. But people that do not do big data are not likely to answer the survey…
Some to spend 25% of total IT budget in 2013. Use caution when describing the big data market. Not sure about what’s being counted.
What are big data applications? Vendors talk “Hadoop” and “graph”. Users talk about “analyze” and “algorithm”. ISV software for big data is thinly scattered.
Insight #2: it’s not just Hadoop: Out of 574 apps mentioned in the survey, about half is internally developed and this grew from last year. Only 75 mentioned Hadoop in 2012 and it’s going down.
It’s like HPC, top 20 applications cover only 40% of the market. It’s like measuring the tail of a snake…
Defining the opportunity looking at IT budget growth. Expect two year change in IT budget. Look at “satisfaction gaps”.
Insight #3: Performance counts. Top three satisfaction gaps are IO performance, storage capacity and RAS. For both TC and EC.
HPC-like mentality is starting to creep into the enterprise. Big data will be a driver for expanded usage of HPC, IF they can still meet enterprise requirements.
Faulty logic: I like sushi, I like ice cream, therefore I like sushi-flavored ice cream… big data and public cloud not necessarily go together.
Maybe private cloud is better. Lead with private cloud and burst some of it to the public cloud.
Technologies in the discussion. Storage beyond spinning disks (flash for max IO, tape for capacity), parallel file systems, high speed fabrics (InfiniBand), MPI, large shared memory spaces, accelerators.

Tunneling SCSI over SMB: Shared VHDX files for Guest Clustering in Windows Server 2012 R2
Jose Barreto, Microsoft
Matt Kurjanowicz, Microsoft

Could not take notes while presenting :-)
It was really good, though…

An SMB3 Engineer’s View of Windows Server 2012 Hyper-V Workloads
Gerald Carter, EMC

FAS paper on Virtual Machine workloads
Does the current workload characterization done for NFS work for SMB3?
What do the IO and jump distance patterns look like?
What SMB3 protocol features does Hyper-V use?
Original scope abstract was reduced due to time constraint and hardware availability (no SMB Direct, no SMB Multichannel, no failover)
Building of tools that look into this using just pcap files.
Hardware: couple of quad-core Intel Q9650 @ 3GHz, 16GB RAM, SATA II disks, networking is 1Gbps, Windows Server 2012 (not R2)
OS on the VMs: Ubuntu 12.04 x64, Windows XP SP3 x86, Windows Server 2012 Standard
Deployment to SMB file share. Sharing PowerShell scripts.
Scenarios. 1: Boot host idle for 10 minutes. Scenario 2: Build Linux Kernel. 3: Random file operations (different size file copies)
Cold boot, start packet capture, cold boot hyper-v host, launch vm, cleanly shutdown scenario, stop packet capture.
Wanted to understand distribution. Read/write mix, how much of each command was used. Distribution of read size/write size/jumps.
Sharktools: Python extension library for data analysis. Example code: import pyshark.
Dictionary composition: request/response by exchanges by file handles.
Scenario 1 - Single client boot
Boot, leave it running for 10 minutes, then shutdown.
Huge table with lots of results. You expected lots of reads and writes. Large number of query info – hypervisor queries configuration files over and over. Lots of reads (second place), then writes.
Another huge table with command occurrences. More IOs in Windows Server 2012 than on Windows XP and Linux.
VHDX handle (one slide for Linux, Windows XP, Windows Server 2012). One handle has most of the operations.
Did not look into the create options.
IO Size distribution (one slide for Linux, Windows XP, Windows Server 2012). Theory: read and write sizes relate to the latency of the network. Most frequent; 4KB, 32KB, 128KB.
Jump distance. Tables wit Time, Jump distance, offset, length. Scatter plot for the three operating systems.
Trying to measure the degree of randomness over time. Linux is more sequential, maybe due to the nature of the file system. Not authoritative, just an observation.
Comment from audience: different styles of booting Linux, some more sequential read, some more parallel.
Boot – single host summary. All read/writes are multiple of 512. 4KB, 32KB, 128KB are favorites. Size and jump distance distribution changes with guest OS (file system).
Multiple persistent handles (4 or more) opened per VM instance.
Scenario 2 – Linux kernel compile
Command distribution. Consistent – 24% query info, 33% read, 33% write.
Command occurrence.
IO size: Lots of 4KB and 128KB reads. Big spikes in writes (lots of 128KB).
Jump distances: Read and write charts. Random IO, no real surprise.
Scenario 3 – Random file operations
Guest using SCSI virtual disk.
Command distribution. 62% read, 32% write, IOCTL: 3.1% (request for resilient handles, trim), queryinfo 1.3%.
Random tree copy. IO Size: Lots of 4KB, some 1MB writes.
Jump – random tree copy. Nice and sequential. Looks like a spaceship, or the Eiffel tower sideways. Two threads contending for the same file? Metadata vs. data?
Closing thoughts. Heavy query info. Not as many large size IOs as expected. Multiple long lived file handles per guest.
Future work: Include SMB Direct, SMB Multichannel, Failover. Include more workloads. More generalized SMB3 workloads.

That's it for this year. I'm posting this last update from the San Jose airport, on my way back to Redmond...

↧

Windows Server 2012 R2: Which version of the SMB protocol (SMB 1.0, SMB 2.0, SMB 2.1, SMB 3.0 or SMB 3.02) are you using?

October 2, 2013, 3:50 pm

≫ Next: Storage Developer Conference - SDC 2013 slides now publicly available. Here are the links to Microsoft slides...

≪ Previous: Raw notes from the Storage Developers Conference (SDC 2013)

Note: This blog post is a Windows Server 2012 R2 update on a previous version focused on Windows Server 2012.

1. Introduction

With the release of Windows 8.1 and Windows Server 2012 R2, I am frequently asked about how older versions of Windows will behave when connecting to or from these new versions. Upgrading to a new version of SMB is something that happened a few times over the years and we established a process in the protocol itself by which clients and servers negotiate the highest version that both support.

2. Versions

There are several different versions of SMB used by Windows operating systems:

CIFS – The ancient version of SMB that was part of Microsoft Windows NT 4.0 in 1996. SMB1 supersedes this version.
SMB 1.0 (or SMB1) – The version used in Windows 2000, Windows XP, Windows Server 2003 and Windows Server 2003 R2
SMB 2.0 (or SMB2) – The version used in Windows Vista (SP1 or later) and Windows Server 2008
SMB 2.1 (or SMB2.1) – The version used in Windows 7 and Windows Server 2008 R2
SMB 3.0 (or SMB3) – The version used in Windows 8 and Windows Server 2012
SMB 3.02 (or SMB3) – The version used in Windows 8.1 and Windows Server 2012 R2

Windows NT is no longer supported, so CIFS is definitely out. Windows Server 2003 R2 with a current service pack is under Extended Support, so SMB1 is still around for a little while. SMB 2.x in Windows Server 2008 and Windows Server 2008 R2 are under Mainstream Support until 2015. You can find the most current information on the support lifecycle page for Windows Server. The information is subject to the Microsoft Policy Disclaimer and Change Notice. You can use the support pages to also find support policy information for Windows XP, Windows Vista, Windows 7 and Windows 8.

In Windows 8.1 and Windows Server 2012 R2, we introduced the option to completely disable CIFS/SMB1 support, including the actual removal of the related binaries. While this is not the default configuration, we recommend disabling this older version of the protocol in scenarios where it’s not useful, like Hyper-V over SMB. You can find details about this new option in item 7 of this blog post: What’s new in SMB PowerShell in Windows Server 2012 R2.

3. Negotiated Versions

Here’s a table to help you understand what version you will end up using, depending on what Windows version is running as the SMB client and what version of Windows is running as the SMB server:

OS	Windows 8.1 WS 2012 R2	Windows 8 WS 2012	Windows 7 WS 2008 R2	Windows Vista WS 2008	Previous versions
Windows 8.1 WS 2012 R2	SMB 3.02	SMB 3.0	SMB 2.1	SMB 2.0	SMB 1.0
Windows 8 WS 2012	SMB 3.0	SMB 3.0	SMB 2.1	SMB 2.0	SMB 1.0
Windows 7 WS 2008 R2	SMB 2.1	SMB 2.1	SMB 2.1	SMB 2.0	SMB 1.0
Windows Vista WS 2008	SMB 2.0	SMB 2.0	SMB 2.0	SMB 2.0	SMB 1.0
Previous versions	SMB 1.0	SMB 1.0	SMB 1.0	SMB 1.0	SMB 1.0

* WS = Windows Server

4. Using PowerShell to check the SMB version

In Windows 8 or Windows Server 2012, there is a new PowerShell cmdlet that can easily tell you what version of SMB the client has negotiated with the File Server. You simply access a remote file server (or create a new mapping to it) and use Get-SmbConnection. Here’s an example:

PS C:\> Get-SmbConnection

ServerName   ShareName UserName            Credential          Dialect   NumOpens
----------   --------- --------            ----------          -------   --------
FileServer1 IPC$       DomainName\UserN... DomainName.Testi... 3.00      0
FileServer1 FileShare DomainName\UserN... DomainName.Testi... 3.00      14
FileServ2    FS2        DomainName\UserN... DomainName.Testi... 3.02      3
VNX3         Share1     DomainName\UserN... DomainName.Testi... 3.00      6
Filer2       Library    DomainName\UserN... DomainName.Testi... 3.00      8
DomainCtrl1 netlogon   DomainName\Compu... DomainName.Testi... 2.10      1

In the example above, a server called “FileServer1” was able to negotiate up to version 3.0. FileServ2 can use version 3.02. That means that both the client and the server support the latest version of the SMB protocol. You can also see that another server called “DomainCtrl1” was only able to negotiate up to version 2.1. You can probably guess that it’s a domain controller running Windows Server 2008 R2. Some of the servers on the list are not running Windows, showing the dialect that these non-Windows SMB implementations negotiated with this specific Windows client.

If you just want to find the version of SMB running on your own computer, you can use a loopback share combined with the Get-SmbConnection cmdlet. Here’s an example:

PS C:\> dir \\localhost\c$

Directory: \\localhost\c$

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----         5/19/2012   1:54 AM            PerfLogs
d-r--          6/1/2012 11:58 PM            Program Files
d-r--          6/1/2012 11:58 PM            Program Files (x86)
d-r--         5/24/2012   3:56 PM            Users
d----          6/5/2012   3:00 PM            Windows

PS C:\> Get-SmbConnection -ServerName localhost

ServerName ShareName UserName            Credential          Dialect NumOpens
---------- --------- --------            ----------          ------- --------
localhost   c$         DomainName\UserN... DomainName.Testi... 3.02     0

You have about 10 seconds after you issue the “dir” command to run the “Get-SmbConnection” cmdlet. The SMB client will tear down the connections if there is no activity between the client and the server. It might help to know that you can use the alias “gsmbc” instead of the full cmdlet name.

5. Features and Capabilities

Here’s a very short summary of what changed with each version of SMB:

From SMB 1.0 to SMB 2.0 - The first major redesign of SMB
- Increased file sharing scalability
- Improved performance
  - Request compounding
  - Asynchronous operations
  - Larger reads/writes
- More secure and robust
  - Small command set
  - Signing now uses HMAC SHA-256 instead of MD5
  - SMB2 durability
From SMB 2.0 to SMB 2.1
- File leasing improvements
- Large MTU support
- BranchCache
From SMB 2.1 to SMB 3.0
- Availability
  - SMB Transparent Failover
  - SMB Witness
  - SMB Multichannel
- Performance
  - SMB Scale-Out
  - SMB Direct (SMB 3.0 over RDMA)
  - SMB Multichannel
  - Directory Leasing
  - BranchCache V2
- Backup
  - VSS for Remote File Shares
- Security
  - SMB Encryption using AES-CCM (Optional)
  - Signing now uses AES-CMAC
- Management
  - SMB PowerShell
  - Improved Performance Counters
  - Improved Eventing
From SMB 3.0 to SMB 3.02
- Automatic rebalancing of Scale-Out File Server clients
- Improved performance of SMB Direct (SMB over RDMA)
- Support for multiple SMB instances on a Scale-Out File Server

You can get additional details on the SMB 2.0 improvements listed above at
http://blogs.technet.com/b/josebda/archive/2008/12/09/smb2-a-complete-redesign-of-the-main-remote-file-protocol-for-windows.aspx

You can get additional details on the SMB 3.0 improvements listed above at
http://blogs.technet.com/b/josebda/archive/2012/05/03/updated-links-on-windows-server-2012-file-server-and-smb-3-0.aspx

You can get additional details on the SMB 3.02 improvements in Windows Server 2012 R2 at
http://technet.microsoft.com/en-us/library/hh831474.aspx

6. Recommendation

We strongly encourage you to update to the latest version of SMB, which will give you the most scalability, the best performance, the highest availability and the most secure SMB implementation.

Keep in mind that Windows Server 2012 Hyper-V and Windows Server 2012 R2 Hyper-V only support SMB 3.0 for remote file storage. This is due mainly to the availability features (SMB Transparent Failover, SMB Witness and SMB Multichannel), which did not exist in previous versions of SMB. The additional scalability and performance is also very welcome in this virtualization scenario. The Hyper-V Best Practices Analyzer (BPA) will warn you if an older version is detected.

7. Conclusion

We’re excited about SMB3, but we are also always concerned about keeping as much backwards compatibility as possible. Both SMB 3.0 and SMB 3.02 bring several key new capabilities and we encourage you to learn more about them. We hope you will be convinced to start planning your upgrades as early as possible.

Note 1: Protocol Documentation

If you consider yourself an SMB geek and you actually want to understand the SMB NEGOTIATE command in greater detail, you can read the [MS-SMB2-Preview] protocol documentation (which covers SMB 2.0, 2.1, 3.0 and 3.02), currently available from http://msdn.microsoft.com/en-us/library/ee941641.aspx. In regards to protocol version negotiation, you should pay attention to the following sections of the document:

1.7: Versioning and Capability Negotiation
2.2.3: SMB2 Negotiate Request
2.2.4: SMB2 Negotiate Response

Section 1.7 includes this nice state diagram describing the inner workings of protocol negotiation:

Note 2: Third-party implementations

There are several implementations of the SMB protocol from someone other than Microsoft. If you use one of those implementations of SMB, you should ask whoever is providing the implementation which version of SMB they implement for each version of their product. Here are a few of these implementations of SMB:

Apple – Up to SMB2 implemented in OS X 10 Mavericks - http://images.apple.com/osx/preview/docs/OSX_Mavericks_Core_Technology_Overview.pdf
EMC– Up to SMB3 implemented in VNX - http://www.emc.com/collateral/white-papers/h11427-vnx-introduction-smb-30-support-wp.pdf
Linux (Client) – SMB 2.1 and SMB 3.0 (even minimum SMB 3.02 support) implemented in the Linux kernel 3.11 or higher – http://www.snia.org/sites/default/files2/SDC2013/presentations/Revisions/StevenFrench_SMB3_Meets_Linux_ver3_revision.pdf
NetApp– Up to SMB3 implemented in Data ONTAP 8.2 - https://communities.netapp.com/community/netapp-blogs/cloud/blog/2013/06/11/clustered-ontap-82-with-windows-server-2012-r2-and-system-center-2012-r2-innovation-in-storage-and-the-cloud
Samba (Server) – Up to SMB3 implemented in Samba 4.1 - http://www.samba.org/samba/history/samba-4.1.0.html

Please note that is not a complete list of implementations and the list is bound to become obsolete the minute I post it. Please refer to the specific implementers for up-to-date information on their specific implementations and which version and optional portions of the protocol they offer.

You also want to review the SNIA Tutorial SMB Remote File Protocol (including SMB 3.0). The SNIA Data Storage Innovation Conference (DSI’14) in April 22-24 2014 is offering an updated version of this tutorial.

↧

Storage Developer Conference - SDC 2013 slides now publicly available. Here are the links to Microsoft slides...

December 12, 2013, 12:07 pm

≫ Next: Don’t miss the SNIA Storage Developer Conference 2014

≪ Previous: Windows Server 2012 R2: Which version of the SMB protocol (SMB 1.0, SMB 2.0, SMB 2.1, SMB 3.0 or SMB 3.02) are you using?

The Storage Networking Industry Association (SNIA) hosted the 10th Storage Developer Conference (SDC) in the Hyatt Regency in beautiful Santa Clara, CA (Silicon Valley) on the week of September 16th 2013.

This week, the presentation slides were made publicly available. You can find them all at http://snia.org/events/storage-developer2013/presentations13

For those focused on Microsoft technologies, here are some direct links to slides for the talks delivered by Microsoft this year:

Title	Presenters
Advancements in Windows File Systems	Neal Christiansen, Principal Development Lead, Microsoft
LRC Erasure Coding in Windows Storage Spaces	Cheng Huang, Researcher, Microsoft Research
SMB3 Update	David Kruse, Development Lead, Microsoft
Cluster Shared Volumes	Vladimir Petter, Principal Software Design Engineer, Microsoft
Tunneling SCSI over SMB: Shared VHDX files for Guest Clustering in Windows Server 2012 R2	Jose Barreto, Principal Program Manager, Microsoft Matt Kurjanowicz, Software Development Engineer, Microsoft
Windows Azure Storage - Speed and Scale in the Cloud	Joe Giardino, Senior Development Lead, Microsoft
SMB Direct update	Greg Kramer, Sr. Software Engineer, Microsoft
Scaled RDMA Performance & Storage Design with Windows Server SMB 3.0	Dan Lovinger, Principal Software Design Engineer, Microsoft
Data Deduplication as a Platform for Virtualization and High Scale Storage	Adi Oltean, Principal Software Design Engineer, Microsoft Sudipta Sengupta, Sr. Researcher, Microsoft

↧

Don’t miss the SNIA Storage Developer Conference 2014

September 2, 2014, 7:37 pm

≫ Next: The Deprecation of SMB1 – You should be planning to get rid of this old SMB dialect

≪ Previous: Storage Developer Conference - SDC 2013 slides now publicly available. Here are the links to Microsoft slides...

It must be September once again… We are only a couple of weeks away from the Storage Developer Conference (SDC) hosted by the Storage Networking Industry Association (SNIA). The event will happen at the Hyatt Regency in beautiful Santa Clara, CA (Silicon Valley) on the week of September 15th. As usual, We’ll also have the SNIA SMB2/SMB3 PlugFest co-located with the SDC event.

For developers working with storage-related technologies, this event gathers a unique crowd and includes a rich agenda that you can find at http://www.storagedeveloper.org. Many of the key industry players are represented and this year’s agenda lists presentations from Dell, EMC, Fujifilm, Fujitsu, GE, Google, HGST, Hitachi, Hortonworks, HP, Huawei, IBM, Inktank, Intel, Mellanox, Microsoft, NetApp, Oracle, Red Hat, Samsung, SanDisk, Seagate, Symantec, Tata, ZTE and many others.

Presentations by Microsoft this year include:

Title	Presenters
StorScore: SSD Qualification for Cloud Applications	Laura Caulfield, Firmware Dev. Engineer 2, Microsoft Mark Santaniello, Sr. Performance Engineer, Microsoft
Introduction to SMB 3.1	David Kruse, Software Developer, Microsoft Greg Kramer, Sr. Software Engineer, Microsoft
iSCSI Protocol Advancements from IETF Storm WG	Mallikarjun Chadalapaka, Principal Program Manager, Microsoft Frederick Knight, NetApp
Storage Quality of Service for Enterprise Workloads	Tom Talpey, Architect, Microsoft Eno Thereska, Researcher, Microsoft
SPECsfs2014 An Under-the-Hood Review	Spencer Shepler, Architect, Microsoft Nick Principe, Senior Software Engineer, EMC
Private Cloud Storage Management using SMI-S, Windows Server, and System Center	Hector Linares, Principal Program Manager, Microsoft
Cloud Scale Testing Infrastructure – Cloud Simulation, Fault Injection and Capacity Planning	Sujit Kuruvilla, Principal Quality Lead, Microsoft Anitha Adusumilli, Senior Test Lead, Microsoft
Evolution of Message Analyzer and Windows Interoperability	Paul Long, Senior Program Manager, Microsoft

For a taste of what SDC presentations look like, make sure to visit the site for last year’s event, where you can find PDF files for most talks. Download them from http://www.snia.org/events/storage-developer2013/presentations13.

Registration for SDC 2014 is still open at http://www.storagedeveloper.org and you should definitely plan to attend. If you are registered, leave a comment and let’s plan to meet when we get there!

↧

The Deprecation of SMB1 – You should be planning to get rid of this old SMB dialect

April 21, 2015, 11:28 am

≫ Next: Drive Performance Report Generator – PowerShell script using DiskSpd by Arnaud Torres

≪ Previous: Don’t miss the SNIA Storage Developer Conference 2014

I regularly get a question about when will SMB1 be completely removed from Windows. This blog post summarizes the current state of this old SMB dialect in Windows client and server.

1) SMB1 is deprecated, but not yet removed

We already added SMB1 to the Windows Server 2012 R2 deprecation list in June 2013. That does not mean it’s fully removed, but that the feature is “planned for potential removal in subsequent releases”. You can find the Windows Server 2012 R2 deprecation list at https://technet.microsoft.com/en-us/library/dn303411.aspx.

2) Windows Server 2003 is going away

The last supported Windows operating system that can only negotiate SMB1 is Windows Server 2003. All other currently supported Windows operating systems (client and server) are able to negotiate SMB2 or higher. Windows Server 2003 support will end on July 14 of this year, as you probably heard.

3) SMB versions in current releases of Windows and Windows Server

Aside from Windows Server 2003, all other versions of Windows (client and server) support newer versions of SMB:

Windows Server 2008 or Windows Vista – SMB1 or SMB2
Windows Server 2008 R2 or Windows 7 – SMB1 or SMB2
Windows Server 2012 and Windows 8 – SMB1, SMB2 or SMB3
Windows Server 2012 R2 and Windows 8.1 – SMB1, SMB2 or SMB3

For details on specific dialects and how they are negotiated, see this blog post on SMB dialects and Windows versions.

4) SMB1 removal in Windows Server 2012 R2 and Windows 8.1

In Windows Server 2012 R2 and Windows 8.1, we made SMB1 an optional component that can be completely removed. That optional component is enabled by default, but a system administrator now has the option to completely disable it. For more details, see this blog post on how to completely remove SMB1 in Windows Server 2012 R2.

5) SMB1 removal in Windows 10 Technical Preview and Windows Server Technical Preview

SMB1 will continue to be an optional component enabled by default with Windows 10, which is scheduled to be released in 2015. The next version of Windows Server, which is expected in 2016, will also likely continue to have SMB as an optional component enabled by default. In that release we will add an option to audit SMB1 usage, so IT Administrators can assess if they can disable SMB1 on their own.

6) What you should be doing about SMB1

If you are a systems administrator and you manage IT infrastructure that relies on SMB1, you should prepare to remove SMB1. Once Windows Server 2003 is gone, the main concern will be third party software or hardware like printers, scanners, NAS devices and WAN accelerators. You should make sure that any new software and hardware that requires the SMB protocol is able to negotiate newer versions (at least SMB2, preferably SMB3). For existing devices and software that only support SMB1, you should contact the manufacturer for updates to support the newer dialects.

If you are a software or hardware manufacturer that has a dependency on the SMB1 protocol, you should have a clear plan for removing any such dependencies. Your hardware or software should be ready to operate in an environment where Windows clients and servers only support SMB2 or SMB3. While it’s true that today SMB1 still works in most environments, the fact that the feature is deprecated is a warning that it could go away at any time.

7) Complete removal of SMB1

Since SMB1 is a deprecated component, we will assess for its complete removal with every new release.

↧

Drive Performance Report Generator – PowerShell script using DiskSpd by Arnaud Torres

July 3, 2015, 12:40 pm

≫ Next: Twenty years as a Microsoft Certified Professional – time flies when you’re having fun

≪ Previous: The Deprecation of SMB1 – You should be planning to get rid of this old SMB dialect

Arnaud Torres is a Senior Premier Field Engineer at Microsoft in France who sent me the PowerShell script below called “Drive Performance Report Generator”.

He created the script to test a wide range of profiles in one run to allow people to build a baseline of their storage using DiskSpd.EXE.

The script is written in PowerShell v1 and was tested on a Windows Server 2008 SP2 (really!), Windows Server 2012 R2 and Windows 10.

It displays results in real time, is highly documented and creates a text report which can be imported as CSV in Excel.

Thanks to Arnaud for sharing!

———————-

# Drive performance Report Generator

# by Arnaud TORRES

# Microsoft provides script, macro, and other code examples for illustration only, without warranty either expressed or implied, including but not

# limited to the implied warranties of merchantability and/or fitness for a particular purpose. This script is provided ‘as is’ and Microsoft does not

# guarantee that the following script, macro, or code can be used in all situations.

# Script will stress your computer CPU and storage, be sure that no critical workload is running

# Clear screen

Clear

write-host “DRIVE PERFORMANCE REPORT GENERATOR” -foregroundcolor green

write-host “Script will stress your computer CPU and storage layer (including network if applciable !), be sure that no critical workload is running” -foregroundcolor yellow

write-host “Microsoft provides script, macro, and other code examples for illustration only, without warranty either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. This script is provided ‘as is’ and Microsoft does not guarantee that the following script, macro, or code can be used in all situations.” -foregroundcolor darkred

” “

“Test will use all free space on drive minus 2 GB !”

“If there are less than 4 GB free test will stop”

# Disk to test

$Disk = Read-Host ‘Which disk would you like to test ? (example : D:)’

# $Disk = “D:”

if ($disk.length -ne 2){“Wrong drive letter format used, please specify the drive as D:”

Exit}

if ($disk.substring(1,1) -ne “:”){“Wrong drive letter format used, please specify the drive as D:”

Exit}

$disk = $disk.ToUpper()

# Reset test counter

$counter = 0

# Use 1 thread / core

$Thread = “-t”+(Get-WmiObject win32_processor).NumberofCores

# Set time in seconds for each run

# 10-120s is fine

$Time = “-d1″

# Outstanding IOs

# Should be 2 times the number of disks in the RAID

# Between 8 and 16 is generally fine

$OutstandingIO = “-o16″

# Disk preparation

# Delete testfile.dat if it exists

# The test will use all free space -2GB

$IsDir = test-path -path “$Disk\TestDiskSpd”

$isdir

if ($IsDir -like “False”){new-item -itemtype directory -path “$Disk\TestDiskSpd\”}

# Just a little security, in case we are working on a compressed drive …

compact /u /s $Disk\TestDiskSpd\

$Cleaning = test-path -path “$Disk\TestDiskSpd\testfile.dat”

if ($Cleaning -eq “True”)

{“Removing current testfile.dat from drive”

remove-item $Disk\TestDiskSpd\testfile.dat}

$Disks = Get-WmiObject win32_logicaldisk

$LogicalDisk = $Disks | where {$_.DeviceID -eq $Disk}

$Freespace = $LogicalDisk.freespace

$FreespaceGB = [int]($Freespace / 1073741824)

$Capacity = $freespaceGB – 2

$CapacityParameter = “-c”+$Capacity+”G”

$CapacityO = $Capacity * 1073741824

if ($FreespaceGB -lt “4”)

{

“Not enough space on the Disk ! More than 4GB needed”

Exit

}

write-host ” “

$Continue = Read-Host “You are about to test $Disk which has $FreespaceGB GB free, do you wan’t to continue ? (Y/N) “

if ($continue -ne “y” -or $continue -ne “Y”){“Test Cancelled !!”

Exit}

” “

“Initialization can take some time, we are generating a $Capacity GB file…”

” “

# Initialize outpout file

$date = get-date

# Add the tested disk and the date in the output file

“Disque $disk, $date” >> ./output.txt

# Add the headers to the output file

“Test N#, Drive, Operation, Access, Blocks, Run N#, IOPS, MB/sec, Latency ms, CPU %” >> ./output.txt

# Number of tests

# Multiply the number of loops to change this value

# By default there are : (4 blocks sizes) X (2 for read 100% and write 100%) X (2 for Sequential and Random) X (4 Runs of each)

$NumberOfTests = 64

” “

write-host “TEST RESULTS (also logged in .\output.txt)” -foregroundcolor yellow

# Begin Tests loops

# We will run the tests with 4K, 8K, 64K and 512K blocks

(4,8,64,512) | % {

$BlockParameter = (“-b”+$_+”K”)

$Blocks = (“Blocks “+$_+”K”)

# We will do Read tests and Write tests

(0,100) | % {

if ($_ -eq 0){$IO = “Read”}

if ($_ -eq 100){$IO = “Write”}

$WriteParameter = “-w”+$_

# We will do random and sequential IO tests

(“r”,”si”) | % {

if ($_ -eq “r”){$type = “Random”}

if ($_ -eq “si”){$type = “Sequential”}

$AccessParameter = “-“+$_

# Each run will be done 4 times

(1..4) | % {

# The test itself (finally !!)

$result = .\diskspd.exe $CapacityPArameter $Time $AccessParameter $WriteParameter $Thread $OutstandingIO $BlockParameter -h -L $Disk\TestDiskSpd\testfile.dat

# Now we will break the very verbose output of DiskSpd in a single line with the most important values

foreach ($line in $result) {if ($line -like “total:*”) { $total=$line; break } }

foreach ($line in $result) {if ($line -like “avg.*”) { $avg=$line; break } }

$mbps = $total.Split(“|”)[2].Trim()

$iops = $total.Split(“|”)[3].Trim()

$latency = $total.Split(“|”)[4].Trim()

$cpu = $avg.Split(“|”)[1].Trim()

$counter = $counter + 1

# A progress bar, for the fun

Write-Progress -Activity “.\diskspd.exe $CapacityPArameter $Time $AccessParameter $WriteParameter $Thread $OutstandingIO $BlockParameter -h -L $Disk\TestDiskSpd\testfile.dat” -status “Test in progress” -percentComplete ($counter / $NumberofTests * 100)

# Remove comment to check command line “.\diskspd.exe $CapacityPArameter $Time $AccessParameter $WriteParameter $Thread -$OutstandingIO $BlockParameter -h -L $Disk\TestDiskSpd\testfile.dat”

# We output the values to the text file

“Test $Counter,$Disk,$IO,$type,$Blocks,Run $_,$iops,$mbps,$latency,$cpu” >> ./output.txt

# We output a verbose format on screen

“Test $Counter, $Disk, $IO, $type, $Blocks, Run $_, $iops iops, $mbps MB/sec, $latency ms, $cpu CPU”

}

↧

Twenty years as a Microsoft Certified Professional – time flies when you’re having fun

August 16, 2015, 9:03 am

≫ Next: Raw notes from the Storage Developer Conference 2015 (SNIA SDC 2015)

≪ Previous: Drive Performance Report Generator – PowerShell script using DiskSpd by Arnaud Torres

I just noticed that last week was the 20th anniversary of my first Microsoft certification. I had to travel nearly 500 miles (from Fortaleza to Recife) to reach the closest official testing center available in Brazil in August 1995.

You’re probably thinking that I started by taking the Windows 95 exam, but it was actually the Windows 3.1 exam (which included a lot of MS-DOS 6.x stuff). The Windows 95 exam was my next one, but that only happened over a year later in December 1996.

I went on to take absolutely all of the Windows NT 4.0 and Windows 2000 exams (many of them in their beta version). At that point we had multiple Microsoft Certified Partners in Fortaleza and I worked for one of them.

I continued to take lots of exams even after moved to the US in October 2000 and after I joined Microsoft in October 2002. I only slowed down a bit after joining the Windows Server engineering team in October 2007.

In 2009 I achieved my last certification as a Microsoft Certified Master on SQL Server 2008. That took a few weeks of training, a series of written exams and a final, multi-hour lab exam. Exciting stuff! That also later granted me a charter certification as Microsoft Certified Solutions Master (Data Platform), Microsoft Certified Solutions Expert (Data Platform) and Microsoft Certified Solutions Associate (SQL Server 2012).

My full list is shown below. In case you’re wondering, the Windows 10 exam (Configuring Windows Devices) is already in development and you can find the details at https://www.microsoft.com/learning/en-us/exam-70-697.aspx.

↧

Raw notes from the Storage Developer Conference 2015 (SNIA SDC 2015)

September 21, 2015, 6:16 am

≫ Next: My Top Reasons to Use OneDrive

≪ Previous: Twenty years as a Microsoft Certified Professional – time flies when you’re having fun

Notes and disclaimers:

This blog post contains raw notes for some of the SNIA’s SDC 2015 presentations (SNIA’s Storage Developers Conference 2015)
These notes were typed during the talks and they may include typos and my own misinterpretations.
Text in the bullets under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
I spent limited time fixing typos or correcting the text after the event. There are only so many hours in a day…
I have not attended all sessions (since there are many being delivered at a time, that would actually not be possible :-)…
SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.
You can find the event agenda at http://www.snia.org/events/storage-developer/agenda

Understanding the Intel/Micron 3D XPoint Memory
Jim Handy, General Director, Objective Analysis

Memory analyst, SSD analyst, blogs: http://thememoryguy.com, http://thessdguy.com
Not much information available since the announcement in July: http://newsroom.intel.com/docs/DOC-6713
Agenda: What? Why? Who? Is the world ready for it? Should I care? When?
What: Picture of the 3D XPoint concept (pronounced 3d-cross-point). Micron’s photograph of “the real thing”.
Intel has researched PCM for 45 years. Mentioned in an Intel article at “Electronics” in Sep 28, 1970.
The many elements that have been tried shown in the periodic table of elements.
NAND laid the path to the increased hierarchy levels. Showed prices of DRAM/NAND from 2001 to 2015. Gap is now 20x.
Comparing bandwidth to price per gigabytes for different storage technologies: Tape, HDD, SSD, 3D XPoint, DRAM, L3, L2, L1
Intel diagram mentions PCM-based DIMMs (far memory) and DDR DIMMs (near memory).
Chart with latency for HDD SAS/SATA, SSD SAS/SATA, SSD NVMe, 3D XPoint NVMe – how much of it is the media, how much is the software stack?
3D Xpoint’s place in the memory/storage hierarchy. IOPS x Access time. DRAM, 3D XPoint (Optane), NVMe SSD, SATA SSD
Great gains at low queue depth. 800GB SSD read IOPS using 16GB die. IOPS x queue depth of NAND vs. 3D XPoint.
Economic benefits: measuring $/write IOPS for SAS HDD, SATA SSD, PCIe SSD, 3D XPoint
Timing is good because: DRAM is running out of speed, NVDIMMs are catching on, some sysadmins understand how to use flash to reduce DRAM needs
Timing is bad because: Nobody can make it economically, no software supports SCM (storage class memory), new layers take time to establish Why should I care: better cost/perf ratio, lower power consumption (less DRAM, more perf/server, lower OpEx), in-memory DB starts to make sense
When? Micron slide projects 3D XPoint at end of FY17 (two months ahead of CY). Same slide shows NAND production surpassing DRAM production in FY17.
Comparing average price per GB compared to the number of GB shipped over time. It takes a lot of shipments to lower price.
Looking at the impact in the DRAM industry if this actually happens. DRAM slows down dramatically starting in FY17, as 3D XPoint revenues increase (optimistic).

Next Generation Data Centers: Hyperconverged Architectures Impact On Storage
Mark OConnell, Distinguished Engineer, EMC

History: Client/Server –> shared SANs –> Scale-Out systems
>> Scale-Out systems: architecture, expansion, balancing
>> Evolution of the application platform: physical servers à virtualization à Virtualized application farm
>> Virtualized application farms and Storage: local storage à Shared Storage (SAN) à Scale-Out Storage à Hyper-converged
>> Early hyper-converged systems: HDFS (Hadoop) à JVM/Tasks/HDFS in every node
Effects of hyper-converged systems
>> Elasticity (compute/storage density varies)
>> App management, containers, app frameworks
>> Storage provisioning: frameworks (openstack swift/cinder/manila), pure service architectures
>> Hybrid cloud enablement. Apps as self-describing bundles. Storage as a dynamically bound service. Enables movement off-prem.

Implications of Emerging Storage Technologies on Massive Scale Simulation Based Visual Effects
Yahya H. Mirza, CEO/CTO, Aclectic Systems Inc

Steve Jobs quote: “You‘ve got to start with the customer experience and work back toward the technology”.
Problem 1: Improve customer experience. Higher resolution, frame rate, throughput, etc.
Problem 2: Production cost continues to rise.
Problem 3: Time to render single frame remains constant.
Problem 4: Render farm power and cooling increasing. Coherent shared memory model.
How do you reduce customer CapEx/OpEx. Low efficiency: 30% CPU. Prooblem is memory access latency and I/O.
Production workflow: modeling, animation/simulation/shading, lighting, rendering, compositing. More and more simulation.
Concrete production experiment: 2005. Story boards. Attempt to create a short film. Putting himself in the customer’s shoes. Shot decomposition.
Real 3-minute short costs $2 million. Animatic to pitch the project.
Character modeling and development. Includes flesh and muscle simulation. A lot of it done procedurally.
Looking at Disney’s “Big Hero 6”, DreamWorks’ “Puss in Boots” and Weta’s “The Hobbit”, including simulation costs, frame rate, resolution, size of files, etc.
Physically based rendering: global illumination effects, reflection, shadows. Comes down to light transport simulation, physically based materials description.
Exemplary VFX shot pipeline. VFX Tool (Houdini/Maya), Voxelized Geometry (OpenVDB), Scene description (Alembic), Simulation Engine (PhysBam), Simulation Farm (RenderFarm), Simulation Output (OpenVDB), Rendering Engine (Mantra), Render Farm (RenderFarm), Output format (OpenEXR), Compositor (Flame), Long-term storage.
One example: smoke simulation – reference model smoke/fire VFX. Complicated physical model. Hotspot algorithms: monte-carlo integration, ray-intersection test, linear algebra solver (multigrid).
Storage implications. Compute storage (scene data, simulation data), Long term storage.
Is public cloud computing viable for high-end VFX?
Disney’s data center. 55K cores across 4 geos.
Vertically integrated systems are going to be more and more important. FPGAs, ARM-based servers.
Aclectic Colossus smoke demo. Showing 256x256x256.
We don’t want coherency; we don’t want sharing. Excited about Intel OmniPath.
http://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-fabric-overview.html

How Did Human Cells Build a Storage Engine?
Sanjay Joshi, CTO Life Sciences, EMC

Human cell, Nuclear DNA, Transcription and Translation, DNA Structure
The data structure: [char(3*10^9) human_genome] strand
3 gigabases [(3*10^9)*2]/8 = ~750MB. With overlaps, ~1GB per cell. 15-70 trillion cells.
Actual files used to store genome are bigger, between 10GB and 4TB (includes lots of redundancy).
Genome sequencing will surpass all other data types by 2040
Protein coding portion is just a small portion of it. There’s a lot we don’t understand.
Nuclear DNA: Is it a file? Flat file system, distributed, asynchronous. Search header, interpret, compile, execute.
Nuclear DNA properties: Large:~20K genes/cell, Dynamic: append/overwrite/truncate, Semantics: strict, Consistent: No, Metadata: fixed, View: one-to-many
Mitochondrial DNA: Object? Distributed hash table, a ring with 32 partitions. Constant across generations.
Mitochondrial DNA: Small: ~40 genes/cell, Static: constancy, energy functions, Semantics: single origin, Consistent: Yes, Metadata: system based, View: one-to-one
File versus object. Comparing Nuclear DNA and Mitochondrial DNA characteristics.
The human body: 7500 names parts, 206 regularly occurring bones (newborns close to 300), ~640 skeletal muscles (320 pairs), 60+ organs, 37 trillion cells. Distributed cluster.
Mapping the ISO 7 layers to this system. Picture.
Finite state machine: max 10^45 states at 4*10^53 state-changes/sec. 10^24 NOPS (nucleotide ops per second) across biosphere.
Consensus in cell biology: Safety: under all conditions: apoptosis. Availability: billions of replicate copies. Not timing dependent: asynchronous. Command completion: 10 base errors in every 10,000 protein translation (10 AA/sec).
Object vs. file. Object: Maternal, Static, Haploid. Small, Simple, Energy, Early. File: Maternal and paternal, Diploid. Scalable, Dynamic, Complex. All cells are female first.

Move Objects to LTFS Tape Using HTTP Web Service Interface
Matt Starr, Chief Technical Officer, Spectra Logic
Jeff Braunstein, Developer Evangelist, Spectra Logic

Worldwide data growth: 2009 = 800 EB, 2015 = 6.5ZB, 2020 = 35ZB
Genomics. 6 cows = 1TB of data. They keep it forever.
Video data. SD to Full HD to 4K UHD (4.2TB per hours) to 8K UHD. Also kept forever.
Intel slide on the Internet minute. 90% of the people of the world never took a picture with anything but a camera phone.
IOT – Total digital info create or replicated.
$1000 genome scan take 780MB fully compressed. 2011 HiSeq-2000 scanner generates 20TB per month. Typical camera generates 105GB/day.
More and more examples.
Tape storage is the lowest cost. But it’s also complex to deploy. Comparing to Public and Private cloud…
Pitfalls of public cloud – chart of $/PB/day. OpEx per PB/day reaches very high for public cloud.
Risk of public cloud: Amazon has 1 trillion objects. If they lose 1% it would 10 billion objects.
Risk of public cloud: Nirvanix. VC pulled the plug in September 2013.
Cloud: Good: toolkits, naturally WAN friendly, user expectation: put it away.
What if: Combine S3/Object with tape. Spectra S3 – Front end is REST, backend is LTFS tape.
Cost: $.09/GB. 7.2PB. Potentially a $0.20 two-copy archive.
Automated: App or user-built. Semi-Automated: NFI or scripting.
Information available at https://developer.spectralogic.com
All the tools you need to get started. Including simulator of the front end (BlackPearl) in a VM.
S3 commands, plus data to write sequentially in bulk fashion.
Configure user for access, buckets.
Deep storage browser (source code on GitHub) allows you to browse the simulated storage.
SDK available in Java, C#, many others. Includes integration with Visual Studio (demonstrated).
Showing sample application. 4 lines of code from the SDK to move a folder to tape storage.
Q: Access times when not cached? Hours or minutes. Depends on if the tape is already in the drive. You can ask to pull those to cache, set priorities. By default GET has higher priority than PUT. 28TB or 56TB of cache.
Q: Can we use CIFS/NFS? Yes, there is an NFI (Network File Interface) using CIFS/NFS, which talks to the cache machine. Manages time-outs.
Q: Any protection against this being used as disk? System monitors health of the tape. Using an object-based interface helps.
Q: Can you stage a file for some time, like 24h? There is a large cache. But there are no guarantees on the latency. Keeping it on cache is more like Glacier. What’s the trigger to bring the data?
Q: Glacier? Considering support for it. Data policy to move to lower cost, move it back (takes time). Not a lot of product or customers demanding it. S3 has become the standard, not sure if Glacier will be that for archive.
Q: Drives are a precious resource. How do you handle overload? By default, reads have precedence over writes. Writes usually can wait.

Taxonomy of Differential Compression
Liwei Ren, Scientific Adviser, Trend Micro

Mathematical model for describing file differences
Lossless data compression categories: data compression (one file), differential compression (two files), data deduplication (multiple files)
Purposes: network data transfer acceleration and storage space reduction
Areas for DC – mobile phones’ firmware over the air, incremental update of files for security software, file synchronization and transfer over WAN, executable files
Math model – Diff procedure: Delta = T – R, Merge procedure: T = R + Delta. Model for reduced network bandwidth, reduced storage cost.
Applications: backup, revision control system, patch management, firmware over the air, malware signature update, file sync and transfer, distributed file system, cloud data migration
Diff model. Two operations: COPY (source address, size [, destination address] ), ADD (data block, size [, destination address] )
How to create the delta? How to encode the delta into a file? How to create the right sequence of COPY/ADD operations?
Top task is an effective algorithm to identify common blocks. Not covering it here, since it would take more than half an hour…
Modeling a diff package. Example.
How do you measure the efficiency of an algorithm? You need a cost model.
Categorizing: Local DC – LDC (xdelta, zdelta, bsdiff), Remote DC – RDC (rsync, RDC protocol, tsync), Iterative – IDC (proposed)
Categorizing: Not-in-place merging: general files (xdelta, zdelta, bsdiff), executable files (bsdiff, courgette)
Categorizing: In place merging: firmware as general files (FOTA), firmware as executable files (FOTA)
Topics in depth: LDC vs RDC vs IDC for general files
Topics in depth: LDC for executable files
Topics in depth: LDC for in-place merging

New Consistent Hashing Algorithms for Data Storage
Jason Resch, Software Architect, Cleversafe

Introducing a new algorithm for hashing.
Hashing is useful. Used commonly is distributed storage, distributed caching.
Independent users can coordinate (readers know where writers would write without talking to them).
Typically, resizing a Hash Table is inefficient. Showing example.
That’s why we need “Stable Hashing”. Showing example. Only a small portion of the keys need to be re-mapped.
Stable hashing becomes a necessity when system is stateful and/or transferring state is expensive,
Used in Caching/Routing (CARP), DHT/Storage (Gluster, DynamoDB, Cassandra, ceph, openstack)
Stable Hashing with Global Namespaces. If you have a file name, you know what node has the data.
Eliminates points of contention, no metadata systems. Namespace is fixed, but the system is dynamic.
Balances read/write load across nodes, as well as storage utilization across nodes.
Perfectly Stable Hashing (Rendezvous Hashing, Consistent Hashing). Precisely weighted (CARP, RUSH, CRUSH).
It would be nice to have something that would offer the characteristics of both.
Consistent: buckets inserted in random positions. Keys maps to the next node greater than that key. With a new node, only neighbors as disrupted. But neighbor has to send data to new node, might not distribute keys evenly.
Rendezvous: Score = Hash (Bucket ID || Key). Bucket with the highest score wins. When adding a new node, some of the keys will move to it. Every node is disrupted evenly.
CARP is rendezvous hashing with a twist. It multiples the scores by a “Load Factor” for each node. Allows for some nodes being more capable than others. Not perfectly stable: if node’s weighting changes or node is added, then all load factor must be recomputed.
RUSH/CRUSH: Hierarchical tree, with each node assigned a probability to go left/right. CRUSH makes the tree match the fault domains of the system. Efficient to add nodes, but not to remove or re-weight nodes.
New algorithm: Weighted Rendezvous Hashing (WRH). Both perfectly stable and precisely weighted.
WRH adjusts scores before weighting them. Unlike CARP, scores aren’t relatively scaled.
No unnecessary transfer of keys when adding/removing nodes. If adding node or increasing weight on node, other nodes will move keys to it, but nothing else. Transfers are equalized and perfectly efficient.
WRH is simple to implement. Whole python code showed in one slide.
All the magic is in one line: “Score = 1.0 / -math.log(hash_f)” – Proof of correctness provided for the math inclined.
How Cleversafe uses WRH. System is grown by set of devices. Devices have a lifecycle: added, possibly expanded, then retired.
Detailed explanation of the lifecycle and how keys move as nodes are added, expanded, retired.
Storage Resource Map. Includes weight, hash_seed. Hash seed enables a clever trick to retire device sets more efficiently.
Q: How to find data when things are being moved? If clients talk to the old node while keys are being moved. Old node will proxy the request to the new node.

Storage Class Memory Support in the Windows Operating System
Neal Christiansen, Principal Development Lead, Microsoft

Windows support for non-volatile storage medium with RAM-like performance is a big change.
Storage Class Memory (SCM): NVDIMM, 3D XPoint, others
Microsoft involved with the standardization efforts in this space.
New driver model necessary: SCM Bus Driver, SCM Disk Driver.
Windows Goals for SCM: Support zero-copy access, run most user-mode apps unmodified, option for 100% backward compatibility (new types of failure modes), sector granular failure modes for app compat.
Applications make lots of assumptions on the underlying storage
SCM Storage Drivers will support BTT – Block Translation Table. Provides sector-level atomicity for writes.
SCM is disruptive. Fastest performance and application compatibility can be conflicting goals.
SCM-aware File Systems for Windows. Volume modes: block mode or DAS mode (chosen at format time).
Block Mode Volumes – maintain existing semantics, full application compatibility
DAS Mode Volumes – introduce new concepts (memory mapped files, maximizes performance). Some existing functionality is lost. Supported by NTFS and ReFS.
Memory Mapped IO in DAS mode. Application can create a memory mapped section. Allowed when volumes resides on SCM hardware and the volume has been formatted for DAS mode.
Memory Mapped IO: True zero copy access. BTT is not used. No paging reads or paging writes.
Cached IO in DAS Mode: Cache manager creates a DAS-enabled cache map. Cache manager will copy directly between user’s buffer and SCM. Coherent with memory-mapped IO. App will see new failure patterns on power loss or system crash. No paging reads or paging writes.
Non-cached IO in DAS Mode. Will send IO down the storage stack to the SCM driver. Will use BTT. Maintains existing storage semantics.
If you really want the performance, you will need to change your code.
DAS mode eliminates traditional hook points used by the file system to implement features.
Features not in DAS Mode: NTFS encryption, NTS compression, NTFS TxF, ReFS integrity streams, ReFS cluster band, ReFS block cloning, Bitlocker volume encryption, snapshot via VolSnap, mirrored or parity via storage spaces or dynamic disks
Sparse files won’t be there initially but will come in the future.
Updated at the time the file is memory mapped: file modification time, mark file as modified in the USN journal, directory change notification
File System Filters in DAS mode: no notification that a DAS volume is mounted, filter will indicate via a flag if they understand DAS mode semantics.
Application compatibility with filters in DAS mode: No opportunity for data transformation filters (encryption, compression). Anti-virus are minimally impacted, but will need to watch for creation of writeable mapped sections (no paging writes anymore).
Intel NVLM library. Open source library implemented by Intel. Defines set of application APIs for directly manipulating files on SCM hardware.
NVLM library available for Linux today via GitHub. Microsoft working with Intel on a Windows port.
Q: XIP (Execute in place)? It’s important, but the plans have not solidified yet.
Q: NUMA? Can be in NUMA nodes. Typically, the file system and cache are agnostic to NUMA.
Q: Hyper-V? Not ready to talk about what we are doing in that area.
Q: Roll-out plan? We have one, but not ready to talk about it yet.
Q: Data forensics? We’ve yet to discuss this with that group. But we will.
Q: How far are you to completion? It’s running and working today. But it is not complete.
Q: Windows client? To begin, we’re targeting the server. Because it’s available there first.
Q: Effect on performance? When we’re ready to announce the schedule, we will announce the performance. The data about SCM is out there. It’s fast!
Q: Will you backport? Probably not. We generally move forward only. Not many systems with this kind of hardware will run a down level OS.
Q: What languages for the Windows port of NVML? Andy will cover that in his talk tomorrow.
Q: How fast will memory mapped be? Potentially as fast as DRAM, but depends on the underlying technology.

The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production
Sudipta Sengupta, Principal Research Scientist, Microsoft Research

The B-Tree: key-ordered access to records. Balanced tree via page split and merge mechanisms.
Design tenets: Lock free operation (high concurrency), log-structure storage (exploit flash devices with fast random reads and inefficient random writes), delta updates to pages (reduce cache invalidation, garbage creation)
Bw-Tree Architecture: 3 layers: B-Tree (expose API, B-tree search/update, in-memory pages), Cache (logical page abstraction, move between memory and flash), Flash (reads/writes from/to storage, storage management).
Mapping table: Expose logical pages to access method layer. Isolates updates to single page. Structure for lock-free multi-threaded concurrency control.
Highly concurrent page updates with Bw-Tree. Explaining the process using a diagram.
Bw-Tree Page Split: No hard threshold for splitting unlike in classical B-Tree. B-link structure allows “half-split” without locking.
Flash SSDs: Log-Structured storage. Use log structure to exploit the benefits of flash and work around its quirks: random reads are fast, random in-place writes are expensive.
LLAMA Log-Structured Store: Amortize cost of writes over many page updates. Random reads to fetch a “logical page”.
Depart from tradition: logical page formed by linking together records on multiple physical pages on flash. Adapted from SkimpyStash.
Detailed diagram comparing traditional page writing with the writing optimized storage organization with Bw-Tree.
LLAMA: Optimized Logical Page Reads. Multiple delta records are packed when flushed together. Pages consolidated periodically in memory also get consolidated on flash when flushed.
LLAMA: Garbage collection on flash. Two types of record units in the log: Valid or Orphaned. Garbage collection starts from the oldest portion of the log. Earliest written record on a logical page is encountered first.
LLAMA: cache layer. Responsible for moving pages back and forth from storage.
Bw-Tree Checkpointing: Need to flush to buffer and to storage. LLAMA checkpoint for fast recovery.
Bw-Tree Fast Recovery. Restore mapping table from latest checkpoint region. Warm-up using sequential I/O.
Bw-Tree: Support for transactions. Part of the Deuteronomy Architecture.
End-to-end crash recovery. Data component (DC) and transactional component (TC) recovery. DC happens before TC.
Bw-Tree in production: Key-sequential index in SQL Server in-memory database
Bw-Tree in production: Indexing engine in Azure DocumentDB. Resource governance is important (CPU, Memory, IOPS, Storage)
Bw-Tree in production: Sorted key-value store in Bing ObjectStore.
Summary: Classic B-Tree redesigned for modern hardware and cloud. Lock-free, delta updating of pages, log-structure, flexible resource governor, transactional. Shipping in production.
Going forward: Layer transactional component (Deuteronomy Architecture, CIDR 2015), open-source the codebase

ReFS v2: Cloning, Projecting, and Moving Data
J.R. Tipton, Development Lead, Microsoft

Agenda: ReFS v1 primer, ReFS v2 at a glance, motivations for ReFS v2, cloning, translation, transformation
ReFS v1 primer: Windows allocate-on-write file system, Merkel trees verify metadata integrity, online data correction from alternate copies, online chkdsk
ReFS v2: Available in Windows Server 2016 TP4. Efficient, reliable storage for VMs, efficient parity, write tiering, read caching, block cloning, optimizations
Motivations for ReFS v2: cheap storage does not mean slow, VM density, VM provisioning, more hardware flavors (SLC, MLC, TLC flash, SMR)
Write performance. Magic does not work in a few environments (super fast hardware, small random writes, durable writes/FUA/sync/write-through)
ReFS Block Cloning: Clone any block of one file into any other block in another file. Full file clone, reorder some or all data, project data from one area into another without copy
ReFS Block Cloning: Metadata only operation. Copy-on-write used when needed (ReFS knows when).
Cloning examples: deleting a Hyper-V VM checkpoint, VM provisioning from image.
Cloning observations: app directed, avoids data copies, metadata operations, Hyper-V is the first but not the only one using this
Cloning is no free lunch: multiple valid copies will copy-on-write upon changes. metadata overhead to track state, slam dunk in most cases, but not all
ReFS cluster bands. Volume internally divvied up into bands that contain regular FS clusters (4KB, 64KB). Mostly invisible outside file system. Bands and clusters track independently (per-band metadata). Bands can come and go.
ReFS can move bands around (read/write/update band pointer). Efficient write caching and parity. Writes to bands in fast tier. Tracks heat per band. Moves bands between tiers. More efficient allocation. You can move from 100% triple mirroring to 95% parity.
ReFS cluster bands: small writes accumulate where writing is cheap (mirror, flash, log-structured arena), bands are later shuffled to tier where random writes are expensive (band transfers are fully sequential).
ReFS cluster bands: transformation. ReFS can do stuff to the data in a band (can happen in the background). Examples: band compaction (put cold bands together, squeeze out free space), band compression (decompress on read).
ReFS v2 summary: data cloning, data movement, data transformation. Smart when smart makes sense, switches to dumb when dumb is better. Takes advantages of hardware combinations. And lots of other stuff…

Innovator, Disruptor or Laggard, Where Will Your Storage Applications Live? Next Generation Storage
Bev Crair, Vice President and General Manager, Storage Group, Intel

The world is changing: information growth, complexity, cloud, technology.
Growth: 44ZB of data in all systems. 15% of the data is stored, since perceived cost is low.
Every minute of every day: 2013 : 8h of of video uploaded to YouTube, 47,000 apps downloaded, 200 million e-mails
Every minute of every day: 2015 : 300h of of video uploaded to YouTube, 51,000 apps downloaded, 204 million e-mails
Data never sleeps: the internet in real time. tiles showing activities all around the internet.
Data use pattern changes: sense and generate, collect and communicate, analyze and optimize. Example: HADRON collider
Data use pattern changes: from collection to analyzing data, valuable data now reside outside the organization, analyzing and optimizing unstructured data
Cloud impact on storage solutions: business impact, technology impact. Everyone wants an easy button
Intelligent storage: Deduplication, real-time compression, intelligent tiering, thin provisioning. All of this is a software problem.
Scale-out storage: From single system with internal network to nodes working together with an external network
Non-Volatile Memory (NVM) accelerates the enterprise: Examples in Virtualization, Private Cloud, Database, Big Data and HPC
Pyramid: CPU, DRAM, Intel DIMM (3D XPoint), Intel SSD (3D XPoint), NAND SSD, HDD, …
Storage Media latency going down dramatically. With NVM, the bottleneck is now mostly in the software stack.
Future storage architecture: complex chart with workloads for 2020 and beyond. New protocols, new ways to attach.
Intel Storage Technologies. Not only hardware, but a fair amount of software. SPDK, NVMe driver, Acceleration Library, Lustre, others.
Why does faster storage matter? Genome testing for cancer takes weeks, and the cancer mutates. Genome is 10TB. If we can speed up the time it takes to test it to one day, it makes a huge difference and you can create a medicine that saves a person’s life. That’s why it matters.

The Long-Term Future of Solid State Storage
Jim Handy, General Director, Objective Analysis

How we got here? Why are we in the trouble we’re at right now? How do we get ahead of it? Where is it going tomorrow?
Establishing a schism: Memory is in bytes (DRAM, Cache, Flash?), Storage is in blocks (Disk Tape, DVD, SAN, NAS, Cloud, Flash)
Is it really about block? Block, NAND page, DRAM pages, CPU cache lines. It’s all in pages anyway…
Is there another differentiator? Volatile vs. Persistent. It’s confusing…
What is an SSD? SSDs are nothing new. Going back to DEC Bulk Core.
Disk interfaces create delays. SSD vs HDD latency chart. Time scale in milliseconds.
Zooming in to tens of microseconds. Different components of the SSD delay. Read time, Transfer time, Link transfer, platform and adapter, software
Now looking at delays for MLC NAND ONFi2, ONFi3, PCIe x4 Gen3, future NVM on PCIe x4 Gen3
Changing the scale to tens of microseconds on future NVM. Link Transfer, Platform & adapter and Software now accounts for most of the latency.
How to move ahead? Get rid of the disk interfaces (PCIe, NVMe, new technologies). Work on the software: SNIA.
Why now? DRAM Transfer rates. Chart transfer rates for SDRAM, DDR, DDR2, DDR3, DDR4. Designing the bus takes most of the time.
DRAM running out of speed? We probably won’t see a DDR5. HMC or HBM a likely next step. Everything points to fixed memory sizes.
NVM to the rescue. DRAM is not the only upgrade path. It became cheaper to use NAND flash than DRAM to upgrade a PC.
NVM to be a new memory layer between DRAM & NAND: Intel/Micron 3D XPoint – “Optane”
One won’t kill the other. Future systems will have DRAM, NVM, NAND, HDD. None of them will go away…
New memories are faster than NAND. Chart with read bandwidth vs write bandwidth. Emerging NVRAM: FeRAM, eMRAM, RRAM, PRAM.
Complex chart with emerging research memories. Clock frequency vs. Cell Area (cost).
The computer of tomorrow. Memory or storage? In the beginning (core memory), there was no distinction between the two.
We’re moving to an era where you can turn off the computer, turn it back on and there’s something in memory. Do you trust it?
SCM – Storage Class Memory: high performance with archival properties. There are many other terms for it: Persistent Memory, Non-Volatile Memory.
New NVM has disruptively low latency: Log chart with latency budgets for HDD, SATA SSD, NVMe, Persistent. When you go below 10 microseconds (as Persistent does), context switching does not make sense.
Non-blocking I/O. NUMA latencies up to 200ns have been tolerated. Latencies below these cause disruption.
Memory mapped files eliminate file system latency.
The computer of tomorrow. Fixed DRAM size, upgradeable NVM (tomorrow’s DIMM), both flash and disk (flash on PCIe or own bus), much work needed on SCM software
Q: Will all these layers survive? I believe so. There are potential improvements in all of them (cited a few on NAND, HDD).
Q: Shouldn’t we drop one of the layers? Usually, adding layers (not removing them) is more interesting from a cost perspective.
Q: Do we need a new protocol for SCM? NAND did well without much of that. Alternative memories could be put on a memory bus.

Concepts on Moving From SAS connected JBOD to an Ethernet Connected JBOD
Jim Pinkerton, Partner Architect Lead, Microsoft

What if we took a JBOD, a simple device, and just put it on Ethernet?
Re-Thinking the Software-defined Storage conceptual model definition: compute nodes, storage nodes, flakey storage devices
Front-end fabric (Ethernet, IB or FC), Back-end fabric (directly attached or shared storage)
Yesterday’s Storage Architecture: Still highly profitable. Compute nodes, traditional SAN/NAS box (shipped as an appliance)
Today: Software Defined Storage (SDS) – “Converged”. Separate the storage service from the JBOD.
Today: Software Defined Storage (SDS) – “Hyper-Converged” (H-C). Everything ships in a single box. Scale-out architecture.
H-C appliances are a dream for the customer to install/use, but the $/GB storage is high.
Microsoft Cloud Platform System (CPS). Shipped as a packaged deal. Microsoft tested and guaranteed.
SDS with DAS – Storage layer divided into storage front-end (FE) and storage back-end (BE). The two communicate over Ethernet.
SDS Topologies. Going from Converged and Hyper-Converged to a future EBOD topology. From file/block access to device access.
Expose the raw device over Ethernet. The raw device is flaky, but we love it. The storage FE will abstract that, add reliability.
I would like to have an EBOD box that could provide the storage BE.
EBOD works for a variety of access protocols and topologies. Examples: SMB3 “block”, Lustre object store, Ceph object store, NVMe fabric, T10 objects.
Shared SAS Interop. Nightmare experience (disk multi-path interop, expander multi-path interop, HBA distributed failure). This is why customers prefers appliances.
To share or not to share. We want to share, but we do not want shared SAS. Customer deployment is more straightforward, but you have more traffic on Ethernet.
Hyper-Scale cloud tension – fault domain rebuild time. Depends on number of disks behind a node and how much network you have.
Fault domain for storage is too big. Required network speed offsets cost benefits of greater density. Many large disks behind a single node becomes a problem.
Private cloud tension – not enough disks. Entry points at 4 nodes, small number of disks. Again, fault domain is too large.
Goals in refactoring SDS – Storage back-end is a “data mover” (EBOD). Storage front-end is “general purpose CPU”.
EBOD goals – Can you hit a cost point that’s interesting? Reduce storage costs, reduce size of fault domain, build a more robust ecosystem of DAS. Keep topology simple, so customer can build it themselves.
EBOD: High end box, volume box, capacity box.
EBOD volume box should be close to what a JBOD costs. Basically like exposing raw disks.
Comparing current Hyper-Scale to EBOD. EBOD has an NIC and an SOC, in addition to the traditional expander in a JBOD.
EBOD volume box – Small CPU and memory, dual 10GbE, SOC with RDMA NIC/SATA/SAS/PCIe, up to 20 devices, SFF-8639 connector, management (IPMI, DMTF Redfish?)
Volume EBOD Proof Point – Intel Avaton, PCIe Gen 2, Chelsio 10GbE, SAS HBA, SAS SSD. Looking at random read IOPS (local, RDMA remote and non-RDMA remote). Max 159K IOPS w/RDMA, 122K IOPS w/o RDMA. Latency chart showing just a few msec.
EBOD Performance Concept – Big CPU, Dual attach 40GbE, Possibly all NVME attach or SCM. Will show some of the results this afternoon.
EBOD is an interesting approach that’s different from what we’re doing. But it’s nicely aligned with software-defined storage.
Price point of EBOD must be carefully managed, but the low price point enables a smaller fault domain.

Planning for the Next Decade of NVM Programming
Andy Rudoff, SNIA NVM Programming TWG, Intel

Looking at what’s coming up in the next decade, but will start with some history.
Comparison of data storage technologies. Emerging NV technologies with read times in the same order of magnitude as DRAM.
Moving the focus to software latency when using future NVM.
Is it memory or storage? It’s persistent (like storage) and byte-addressable (like memory).
Storage vs persistent memory. Block IO vs. byte addressable, sync/async (DMA master) vs. sync (DMA slave). High capacity vs. growing capacity.
pmem: The new Tier. Byte addressable, but persistent. Not NAND. Can do small I/O. Can DMA to it.
SNIA TWG (lots of companies). Defining the NVM programming model: NVM.PM.FILE mode and NVM.PM.VOLUME mode.
All the OSes created in the last 30 years have a memory mapped file.
Is this stuff real? Why are we spending so much time on this? Yes – Intel 3D XPoint technology, the Intel DIMM. Showed a wafer on stage. 1000x faster than NAND. 1000X endurance of NAND, 10X denser than conventional memory. As much as 6TB of this stuff…
Timeline: Big gap between NAND flash memory (1989) and 3D XPoint (2015).
Diagram of of the model with Management, Block, File and Memory access. Link at the end to the diagram.
Detecting pmem: Defined in the ACPI 6.0. Linux support upstream (generic DIMM driver, DAX, ext4+DAX, KVM). Neal talked about Windows support yesterday.
Heavy OSV involvement in TWG, we wrote the spec together.
We don’t want every application to have to re-architecture itself. That’s why we have block and file there as well.
The next decade
Transparency levels: increasing barrier to adoption. increasing leverage. Could do it in layers. For instance, could be file system only, without app modification. For instance, could modify just the JVM to get significant advantages without changing the apps.
Comparing to multiple cores in hardware and multi-threaded programming. Took a decade or longer, but it’s commonplace now.
One transparent example: pmem Paging. Paging from the OS page cache (diagrams).
Attributes of paging : major page faults, memory looks much larger, page in must pick a victim, many enterprise apps opt-out, interesting example: Java GC.
What would it look like if you paged to pmem instead of paging to storage. I don’t even care that it’s persistent, just that there’s a lot of it.
I could kick a page out synchronously, probably faster than a context switch. But the app could access the data in pmem without swapping it in (that‘s new!). Could have policies for which app lives in which memory. The OS could manage that, with application transparency.
Would this really work? It will when pmem costs less, performance is close, capacity is significant and it is reliable. “We’re going to need a bigger byte” to hold error information.
Not just for pmem. Other memories technologies are emerging. High bandwidth memory, NUMA localities, different NVM technologies.
Extending into user space: NVM Library – pmem.io (64-bit Linux Alpha release). Windows is working on it as well.
That is a non-transparent example. It’s hard (like multi-threading). Things can fail in interesting new ways.
The library makes it easier and some of it is transactional.
No kernel interception point, for things like replication. No chance to hook above or below the file system. You could do it in the library.
Non-transparent use cases: volatile caching, in-memory database, storage appliance write cache, large byte-addressable data structures (hash table, dedup), HPC (checkpointing)
Sweet spots: middleware, libraries, in-kernel usages.
Big challenge: middleware, libraries. Is it worth the complexity.
Building a software ecosystem for pmem, cost vs. benefit challenge.
Prepare yourself: lean NVM programming model, map use cases to pmem, contribute to the libraries, software ecosystem

FS Design Around SMR: Seagate’s Journey and Reference System with EXT4
Adrian Palmer, Drive Development Engineering, Seagate Technologies

SNIA Tutorial. I’m talking about the standard, as opposed as the design of our drive.
SMR is being embraced by everyone, since this is a major change, a game changes.
From random writes to resemble the write profile of sequential-access tape.
1 new condition: forward-write preferred. ZAC/ZBD spec: T10/13. Zones, SCSI ZBC standards, ATA ZAC standards.
What is a file system? Essential software on a system, structured and unstructured data, stores metadata and data.
Basic FS requirements: Write-in-place (superblock, known location on disk), Sequential write (journal), Unrestricted write type (random or sequential)
Drive parameters: Sector (atomic unit of read/write access). Typically 512B size. Independently accessed. Read/write, no state.
Drive parameters: Zone (atomic performant rewrite unit). Typically 256 MiB in size. Indirectly addressed via sector. Modified with ZAC/ZBD commands. Each zone has state (WritePointer, Condition, Size, Type).
Write Profiles. Conventional (random access), Tape (sequential access), Flash (sequential access, erase blocks), SMR HA/HM (sequential access, zones). SMR write profile is similar to Tape and Flash.
Allocation containers. Drive capacities are increasing, location mapping is expensive. 1.56% with 512B blocks or 0.2% with 4KB blocks.
Remap the block device as a… block device. Partitions (w*sector size), Block size (x*sector size), Group size (y*Block size), FS (z*group size, expressed as blocks).
Zones are a good fit to be matched with Groups. Absorb and mirror the metadata, don’t keep querying drive for metadata.
Solving the sequential write problem. Separate the problem spaces with zones.
Dedicate zones to each problem space: user data, file records, indexes, superblock, trees, journal, allocation containers.
GPT/Superblocks: First and last zone (convention, not guaranteed). Update infrequently, and at dismount. Looks at known location and WritePointer. Copy-on-update. Organized wipe and update algorithm.
Journal/soft updates. Update very frequently, 2 or more zones, set up as a circular buffer. Checkpoint at each zone. Wipe and overwrite oldest zone. Can be used as NV cache for metadata. Requires lots of storage space for efficient use and NV.
Group descriptors: Infrequently changed. Changes on zone condition change, resize, free block counts. Write cached, butwritten at WritePointer. Organized as a B+Tree, not an indexed array. The B+Tree needs to be stored on-disk.
File Records: POSIX information (ctime, mtime, atime, msize, fs specific attributes), updated very frequently. Allows records to be modified in memory, written to journal cache, gather from journal, write to new blocks at WritePointer.
Mapping (file records to blocks). File ideally written as a single chunk (single pointer), but could become fragmented (multiple pointers). Can outgrow file record space, needs its own B+Tree. List can be in memory, in the journal, written out to disk at WritePointer.
Data: Copy-on-write. Allocator chooses blocks at WritePointer. Writes are broken at zone boundary, creating new command and new mapping fragment.
Cleanup: Cannot clean up as you go, need a separate step. Each zone will have holes. Garbage collection: Journal GC, Zones GC, Zone Compaction, Defragmentation.
Advanced features: indexes, queries, extended attributes, snapshots, checksums/parity, RAID/JBOD.

Azure File Service: ‘Net Use’ the Cloud
David Goebel, Software Engineer, Microsoft

Agenda: features and API (what), scenarios enabled (why), design of an SMB server not backed by a conventional FS (how)
It’s not the Windows SMB server (srv2.sys). Uses Azure Tables and Azure Blobs for the actual files.
Easier because we already have a highly available and distributed architecture.
SMB 2.1 in preview since last summer. SMB 3.0 (encryption, persistent handles) in progress.
Azure containers mapped as shares. Clients work unmodified out-of-the-box. We implemented the spec.
Share namespace is coherently accessible
MS-SMB2, not SMB1. Anticipates (but does not require) a traditional file system on the other side.
In some ways it’s harder, since what’s there is not a file system. We have multiple tables (for leases, locks, etc). Nice and clean.
SMB is a stateful protocol, while REST is all stateless. Some state is immutable (like FileId), some state is transient (like open counts), some is maintained by the client (like CreateGuid), some state is ephemeral (connection).
Diagram with the big picture. Includes DNS, load balancer, session setup & traffic, front-end node, azure tables and blobs.
Front-end has ephemeral and immutable state. Back-end has solid and fluid durable state.
Diagram with two clients accessing the same file and share, using locks, etc. All the state handled by the back-end.
Losing a front-end node considered a regular event (happens during updates), the client simple reconnects, transparently.
Current state, SMB 2.1 (SMB 3.0 in the works). 5TB per share and 1TB per file. 1,000 8KB IOPS per share, 60MB/sec per share. Some NTFS features not supported, some limitations on characters and path length (due to HTTP/REST restrictions).
Demo: I’m actually running my talk using a PPTX file on Azure File. Robocopy to file share. Delete, watch via explorer (notifications working fine). Watching also via wireshark.
Currently Linux Support. Lists specific versions Ubuntu Server, Ubuntu Core, CentOS, Open SUSE, SUSE Linux Enterprise Server.
Why: They want to move to cloud, but they can’t change their apps. Existing file I/O applications. Most of what was written over the last 30 years “just works”. Minor caveats that will become more minor over time.
Discussed specific details about how permissions are currently implemented. ACL support is coming.
Example: Encryption enabled scenario over the internet.
What about REST? SMB and REST access the same data in the same namespace, so a gradual application transition without disruption is possible. REST for container, directory and file operations.
The durability game. Modified state that normally exists only in server memory, which must be durably committed.
Examples of state tiering: ephemeral state, immutable state, solid durable state, fluid durable state.
Example: Durable Handle Reconnect. Intended for network hiccups, but stretched to also handles front-end reconnects. Limited our ability because of SMB 2.1 protocol compliance.
Example: Persistent Handles. Unlike durable handles, SMB 3 is actually intended to support transparent failover when a front-end dies. Seamless transparent failover.
Resource Links: Getting started blog (http://blogs.msdn.com/b/windowsazurestorage/archive/2014/05/12/introducing-microsoft-azure-file-service.aspx) , NTFS features currently not supported (https://msdn.microsoft.com/en-us/library/azure/dn744326.aspx), naming restrictions for REST compatibility (https://msdn.microsoft.com/library/azure/dn167011.aspx).

Software Defined Storage – What Does it Look Like in 3 Years?
Richard McDougall, Big Data and Storage Chief Scientist, VMware

How do you come up with a common, generic storage platform that serves the needs of application?
Bringing a definition of SDS. Major trends in hardware, what the apps are doing, cloud platforms
Storage workloads map. Many apps on 4 quadrants on 2 axis: capacity (10’s of Terabytes to 10’s of Petabytes) and IOPS (1K to 1M)
What are cloud-native applications? Developer access via API, continuous integration and deployment, built for scale, availability architected in the app, microservices instead of monolithic stacks, decoupled from infrastructure
What do Linux containers need from storage? Copy/clone root images, isolated namespace, QoS controls
Options to deliver storage to containers: copy whole root tree (primitive), fast clone using shared read-only images, clone via “Another Union File System” (aufs), leverage native copy-on-write file system.
Shared data: Containers can share file system within host or across hots (new interest in distributed file systems)
Docker storage abstractions for containers: non-persistent boot environment, persistent data (backed by block volumes)
Container storage use cases: unshared volumes, shared volumes, persist to external storage (API to cloud storage)
Eliminate the silos: converged big data platform. Diagram shows Hadoop, HBase, Impala, Pivotal HawQ, Cassandra, Mongo, many others. HDFS, MAPR, GPFS, POSIX, block storage. Storage system common across all these, with the right access mechanism.
Back to the quadrants based on capacity and IOPS. Now with hardware solutions instead of software. Many flash appliances in the upper left (low capacity, high IOPS). Isilon in the lower right (high capacity, low IOPS).
Storage media technologies in 2016. Pyramid with latency, capacity per device, capacity per host for each layer: DRAM (1TB/device, 4TB/host, ~100ns latency), NVM (1TB, 4TB, ~500ns), NVMe SSD (4TB, 48TB, ~10us), capacity SSD (16TB, 192TB, ~1ms), magnetic storage (32TB, 384TB, ~10ms), object storage (?, ?, ~1s).
Back to the quadrants based on capacity and IOPS. Now with storage media technologies.
Details on the types of NVDIMM (NVIMM-N – Type 1, NVDIMM-F – Type 2, Type 4). Standards coming up for all of these. Needs work to virtualize those, so they show up properly inside VMs.
Intel 3D XPoint Technology.
What are the SDS solutions than can sit on top of all this? Back to quadrants with SDS solutions. Nexenta, Mentions ScaleiO, VSAN, ceph, Scality, MAPR, HDFS. Can you make one solution that works well for everything?
What’s really behind a storage array? The value from the customer is that it’s all from one vendor and it all works. Nothing magic, but the vendor spent a ton of time on testing.
Types of SDS: Fail-over software on commodity servers (lists many vendors), complexity in hardware, interconnects. Issues with hardware compatibility.
Types of SDS: Software replication using servers + local disks. Simpler, but not very scalable.
Types of SDS: Caching hot core/cold edge. NVMe flash devices up front, something slower behind it (even cloud). Several solutions, mostly startups.
Types of SDS: Scale-out SDS. Scalable, fault-tolerant, rolling updates. More management, separate compute and storage silos. Model used by ceph, ScaleiO. Issues with hardware compatibility. You really need to test the hardware.
Types of SDS: Hyper-converged SDS. Easy management, scalable, fault-tolerant, rolling upgrades. Fixed compute to storage ration. Model used by VSAN, Nutanix. Amount of variance in hardware still a problem. Need to invest in HCL verification.
Storage interconnects. Lots of discussion on what’s the right direction. Protocols (iSCSI, FC, FCoE, NVMe, NVMe over Fabrics), Hardware transports (FC, Ethernet, IB, SAS), Device connectivity (SATA, SAS, NVMe)
Network. iSCSI, iSER, FCoE, RDMA over Ethernet, NVMe Fabrics. Can storage use the network? RDMA debate for years. We’re at a tipping point.
Device interconnects: HCA with SATA/SAS. NVMe SSD, NVM over PCIe. Comparing iSCSI, FCoE and NVMe over Ethernet.
PCIe rack-level Fabric. Devices become addressable. PCIe rack-scale compute and storage, with host-to-host RDMA.
NVMe – The new kid on the block. Support from various vendors. Quickly becoming the all-purpose stack for storage, becoming the universal standard for talking block.
Beyond block: SDS Service Platforms. Back to the 4 quadrants, now with service platforms.
Too many silos: block, object, database, key-value, big data. Each one is its own silo with its own machines, management stack, HCLs. No sharing of infrastructure.
Option 1: Multi-purpose stack. Has everything we talked about, but it’s a compromise.
Option 2: Common platform + ecosystem of services. Richest, best-of-breed services, on a single platform, manageable, shared resources.

Why the Storage You Have is Not the Storage Your Data Needs
Laz Vekiarides, CTO and Co-founder, ClearSky Data

ClearSky Data is a tech company, consumes what we discussed in this conference.
The problem we’re trying to solve is the management of the storage silos
Enterprise storage today. Chart: Capacity vs. $/TB. Flash, Mid-Range, Scale-Out. Complex, costly silos
Describe the lifecycle of the data, the many copies you make over time, the rebuilding and re-buying of infrastructure
What enterprises want: buy just enough of the infrastructure, with enough performance, availability, security.
Cloud economics – pay only for the stuff that you use, you don’t have to see all the gear behind the storage, someone does the physical management
Tiering is a bad answer – Nothing remains static. How fast does hot data cool? How fast does it re-warm? What is the overhead to manage it? It’s a huge overhead. It’s not just a bandwidth problem.
It’s the latency, stupid. Data travels at the speed of light. Fast, but finite. Boston to San Francisco: 29.4 milliseconds of round-trip time (best case). Reality (with switches, routers, protocols, virtualization) is more like 70 ms.
So, where exactly is the cloud? Amazon East is near Ashburn, VA. Best case is 10ms RTT. Worst case is ~150ms (does not include time to actually access the storage).
ClearSky solution: a global storage network. The infrastructure becomes invisible to you, what you see is a service level agreement.
Solution: Geo-distributed data caching. Customer SAN, Edge, Metro POP, Cloud. Cache on the edge (all flash), cache on the metro POP.
Edge to Metro POP are private lines (sub millisecond latency). Addressable market is the set of customers within a certain distance to the Metro POP.
Latency math: Less than 1ms to the Metro POP, cache miss path is between 25ms and 50ms.
Space Management: Edge (hot, 10%, 1 copy), POP (warm, <30%, 1-2 copies), Cloud (100%, n copies). All data is deduplicated and encrypted.
Modeling cache performance: Miss ratio curve (MRC). Performance as f(size), working set knees, inform allocation policy.
Reuse distance (unique intervening blocks between use and reuse). LRU is most of what’s out there. Look at stacking algorithms. Chart on cache size vs. miss ratio. There’s a talk on this tomorrow by CloudPhysics.
Worked with customers to create a heat map data collector. Sizing tool for VM environments. Collected 3-9 days of workload.
~1,400 virtual disks, ~800 VMs, 18.9TB (68% full), avg read IOPS 5.2K, write IOPS 5.9K. Read IO 36KB, write IO 110KB. Read Latency 9.7ms, write latency 4.5ms.
This is average latency, maximum is interesting, some are off the chart. Some were hundred of ms, even 2 second.
Computing the cache miss ratio. How much cache would we need to get about 90% hit ratio? Could do it with less than 12% of the total.
What is cache hit for writes? What fits in the write-back cache. You don’t want to be synchronous with the cloud. You’ll go bankrupt that way.
Importance of the warm tier. Hot data (Edge, on prem, SSD) = 12%, warm data (Metro PoP, SSD and HDD) = 6%, cold data (Cloud) = 82%. Shown as a “donut”.
Yes, this works! We’re having a very successful outcome with the customers currently engaged.
Data access is very tiered. Small amounts of flash can yield disproportionate performance benefits. Single tier cache in front of high latency storage can’t work. Network latency is as important as bounding media latency.
Make sure your caching is simple. Sometimes you are overthinking it.
Identifying application patterns is hard. Try to identify the sets of LBA that are accessed. Identify hot spots, which change over time. The shape of the miss ratio remains similar.

Emerging Trends in Software Development
Donnie Berkholz, Research Director, 451 Research

How people are building applications. How storage developers are creating and shipping software.
Technology adoption is increasingly bottom-up. Open source, cloud. Used to be like building a cathedral, now it’s more like a bazaar.
App-dev workloads are quickly moving to the cloud. Chart from all-on-prem at the top to all-cloud at the bottom.
All on-prem going from 59% now to 37% in a few years. Moving to different types of clouds (private cloud, Public cloud (IaaS), Public cloud (SaaS).
Showing charts for total data at organization, how much in off-premises cloud (TB and %). 64% of people have less than 20% on the cloud.
The new stack. There’s a lot of fragmentation. 10 languages in the top 80%. Used to be only 3 languages. Same thing for databases. It’s more composable, right tool for the right job.
No single stack. An infinite set of possibilities.
Growth in Web APIs charted since 2005 (from ProgrammableWeb). Huge growth.
What do enterprises think of storage vendors. Top vendors. People not particularly happy with their storage vendors. Promise index vs. fulfillment index.
Development trends that will transform storage.
Containers. Docker, docker, docker. Whale logos everywhere. When does it really make sense to use VMs or containers? You need lots of random I/O for these to work well. 10,000 containers in a cluster? Where do the databases go?
Developers love Docker. Chart on configuration management GitHub totals (CFEngine, Puppet, Chef, Ansible, Salt, Docker). Shows developer adoption. Docker is off the charts.
It’s not just a toy. Survey of 1,000 people on containers. Docker is only 2.5 years old now. 20% no plans, 56% evaluating. Total doing pilot or more add up to 21%. That’s really fast adoption
Docker to microservices.
Amazon: “Every single data transfer between teams has to happen through an API or you’re fired”. Avoid sending spreadsheets around.
Microservices thinking is more business-oriented, as opposed to technology-oriented.
Loosely couple teams. Team organization has a great influence in your development.
The foundation of microservices. Terraform, MANTL, Apache Mesos, Capgemini Appollo, Amazon EC2 Container Service.
It’s a lot about scheduling. Number of schedulers that use available resources. Makes storage even more random.
Disruption in data processing. Spark. It’s a competitor to Hadoop, really good at caching in memory, also very fast on disk. 10x faster than map-reduce. People don’t have to be big data experts. Chart: Spark came out of nowhere (mining data from several public forums).
The market is coming. Hadoop market as a whole growing 46% (CAGR).
Storage-class memory. Picture of 3D XPoint. Do app developer care? Not sure. Not many optimize for cache lines in memory. Thinking about Redis in-memory database for caching. Developers probably will use SCM that way. Caching in the order of TB instead of GB.
Network will be incredibly important. Moving bottlenecks around.
Concurrency for developers. Chart of years vs. Percentage of Ohlon. Getting near to 1%. That’s a lot single the most popular is around 10%.
Development trends
DevOps. Taking agile development all the way to production. Agile, truly tip to tail. You want to iterate while involving your customers. Already happening with startups, but how do you scale?
DevOps: Culture, Automation (Pets vs. Cattle), Measurement
Automation: infrastructure as code. Continuous delivery.
Measurement: Nagios, graphite, Graylog2, splunk, Kibana, Sensu, etsy/statsd
DevOps is reaching DBAs. #1 stakeholder in recent survey.
One of the most popular team structure change. Dispersing the storage team.
The changing role of standards
The changing role of benchmarks. Torturing databases for fun and profit.
I would love for you to join our panel. If you fill our surveys, you get a lot of data for free.

Learnings from Nearly a Decade of Building Low-cost Cloud Storage
Gleb Budman, CEO, Backblaze

What we learned, specifically the cost equation
150+ PB of customer data. 10B files.
In 2007 we wanted to build something that would backup your PC/Mac data to the cloud. $5/month.
Originally we wanted to put it all on S3, but we would lose money on every single customer.
Next we wanted to buy SANs to put the data on, but that did not make sense either.
We tried a whole bunch of things. NAS, USB-connected drives, etc.
Cloud storage has a new player, with a shockingly low price: B2. One fourth of the cost of S3.
Lower than Glacier, Nearline, S3-Infrequent Access, anything out there. Savings here add up.
Datacenter: convert kilowatts-to-kilobits
Datacenter Consideration: local cost to power, real state, taxes, climate, building/system efficiency, proximity to good people, connectivity.
Hardware: Connect hard drives to the internet, with as little as possible in between.
Blackblaze storage box, costs about $3K. As simple as possible, don’t make the hardware itself redundant. Use commodity parts (example: desktop power supply), use consumer hard drives, insource & use math for drive purchases.
They told us we could not use consumer hard drives. But reality is that the failure rate was actually lower. They last 6 years on average. Even if the enterprise HDD never fail, they still don’t make sense.
Insource & use math for drive purchases. Drives are the bulk of the cost. Chart with time vs. price per gigabyte. Talking about the Thailand Hard Drive Crisis.
Software: Put all intelligence here.
Blackblaze vault: 20 hard drives create 1 tome that share parts of a file, spread across racks.
Avoid choke point. Every single storage pods is a first class citizen. We can parallelize.
Algorithmically monitor SMART stats. Know which SMART codes correlate to annual failure rate. All the data is available on the site (all the codes for all the drives). https://www.backblaze.com/SMART
Plan for silent corruption. Bad drive looks exactly like a good drive.
Put replication above the file system.
Run out of resources simultaneous. Hardware and software together. Avoid having CPU pegged and your memory unused. Have your resources in balance, tweak over time.
Model and monitor storage burn. It’s important not to have too much or too little storage. Leading indicator is not storage, it’s bandwidth.
Business processes. Design for failure, but fix failures quickly. Drives will die, it’s what happens at scale.
Create repeatable repairs. Avoid the need for specialized people to do repair. Simple procedures: either swap a drive or swap a pod. Requires 5 minutes of training.
Standardize on the pod chassis. Simplifies so many things…
Use ROI to drive automation. Sometimes doing things twice is cheaper than automation. Know when it makes sense.
Workflow for storage buffer. Treat buffer in days, not TB. Model how many days of space available you need. Break into three different buffer types: live and running vs. in stock but not live vs. parts.
Culture: question “conventional wisdom”. No hardware worshippers. We love our red storage boxes, but we are a software team.
Agile extends to hardware. Storage Pod Scrum, with product backlog, sprints, etc.
Relentless focus on cost: Is it required? Is there a comparable lower cost option? Can business processes work around it? Can software work around it?

f4: Facebook’s Warm BLOB Storage System
Satadru Pan, Software Engineer, Facebook

White paper “f4: Facebook’s Warm BLOB Storage System” at http://www-bcf.usc.edu/~wyattllo/papers/f4-osdi14.pdf
Looking at how data cools over time. 100x drop in reads in 60 days.
Handling failure. Replication: 1.2 * 3 = 3.6. To lose data we need to lose 9 disks or 3 hosts. Hosts in different racks and datacenters.
Handling load. Load spread across 3 hosts.
Background: Data serving. CDN protects storage, router abstracts storage, web tier adds business logic.
Background: Haystack [OSDI2010]. Volume is a series of blobs. In-memory index.
Introducing f4: Haystack on cells. Cells = disks spread over a set of racks. Some compute resource in each cell. Tolerant to disk, host, rack or cell failures.
Data splitting: Split data into smaller blocks. Reed Solomon encoding, Create stripes with 5 data blocks and 2 parity blocks.
Blobs laid out sequentially in a block. Blobs do not cross block boundary. Can also rebuild blob, might not need to read all of the block.
Each stripe in a different rack. Each block/blob split into racks. Mirror to another cell. 14 racks involved.
Read. Router does Index read, Gets physical location (host, filename, offset). Router does data read. If data read fails, router sends request to compute (decoders).
Read under datacenter failure. Replica cell in a different data center. Router proxies read to a mirror cell.
Cross datacenter XOR. Third cell has a byte-by-byte XOR of the first two. Now mix this across 3 cells (triplet). Each has 67% data and 33% replica. 1.5 * 1.4 = 2.1X.
Looking at reads with datacenter XOR. Router sends two read requests to two local routers. Builds the data from the reads from the two cells.
Replication factors: Haystack with 3 copies (3.6X), f4 2.8 (2.8X), f4 2.1 (2.1X). Reduced replication factor, increased fault tolerance, increase load split.
Evaluation. What and how much data is “warm”?
CDN data: 1 day, 0.5 sampling. BLOB storage data: 2 week, 0.1%, Random distribution of blobs assumed, the worst case rates reported.
Hot data vs. Warm data. 1 week – 350 reads/sec/disk, 1 month – 150r/d/s, 3 months – 70r/d/s, 1 year 20r/d/s. Wants to keep above 80 reads/sec/disk. So chose 3 months as divider between hot and warm.
It is warm, not cold. Chart of blob age vs access. Even old data is read.
F4 performance: most loaded disk in cluster: 35 reads/second. Well below the 80r/s threshold.
F4 performance: latency. Chart of latency vs. read response. F4 is close to Haystack.
Conclusions. Facebook blob storage is big and growing. Blobs cool down with age very rapidly. 100x drop in reads in 60 days. Haystack 3.6 replication over provisioning for old, warm data. F4 encodes data to lower replication to 2.1X, without compromising performance significantly.

Pelican: A Building Block for Exascale Cold Data Storage
Austin Donnelly, Principal Research Software Development Engineer, Microsoft

White paper “Pelican: A building block for exascale cold data storage” at http://research.microsoft.com/pubs/230697/osdi2014-Pelican.pdf
This is research, not a product. No product announcement here. This is a science project that we offer to the product teams.
Background: Cold data in the cloud. Latency (ms. To hours) vs. frequency of access. SSD, 15K rpm HDD, 7.2K rpm HDD, Tape.
Defining hot, warm, archival tiers. There is a gap between warm and archival. That’s were Pelican (Cold) lives.
Pelican: Rack-scale co-design. Hardware and software (power, cooling, mechanical, HDD, software). Trade latency for lower cost. Massive density, low per-drive overhead.
Pelican rack: 52U, 1152 3.5” HDD. 2 servers, PCIe bus stretched rack wide. 4 x 10Gb links. Only 8% of disks can spin.
Looking at pictures of the rack. Very little there. Not many cables.
Interconnect details. Port multiplier, SATA controller, Backplane switch (PCIe), server switches, server, datacenter network. Showing bandwidth between each.
Research challenges: Not enough cooling, power, bandwidth.
Resource use: Traditional systems can have all disks running at once. In Pelican, a disk is part of a domain: power (2 of 16), cooling (1 of 12), vibration (1 of 2), bandwidth (tree).
Data placement: blob erasure-encoded on a set of concurrently active disks. Sets can conflict in resource requirement.
Data placement: random is pretty bad for Pelican. Intuition: concentrate conflicts over a few set of disks. 48 groups of 24 disk. 4 classes of 12 fully-conflicting groups. Blob storage over 18 disks (15+3 erasure coding).
IO scheduling: “spin up is the new seek”. All our IO is sequential, so we only need to optimize for spin up. Four schedulers, with 12 groups per scheduler, only one active at a time.
Naïve scheduler: FIFO. Pelican scheduler: request batching – trade between throughput and fairness.
Q: Would this much spin up and down reduce endurance of the disk. We’re studying it, not conclusive yet, but looking promising so far.
Q: What kind of drive? Archive drives, not enterprise drives.
Demo. Showing system with 36 HBAs in device manager. Showing Pelican visualization tool. Shows trays, drives, requests. Color-coded for status.
Demo. Writing one file: drives spin up, request completes, drives spin down. Reading one file: drives spin up, read completes, drives spin down.
Performance. Compare Pelican to a mythical beast. Results based on simulation.
Simulator cross-validation. Burst workload.
Rack throughput. Fully provisioned vs. Pelican vs. Random placement. Pelican works like fully provisioned up to 4 requests/second.
Time to first byte. Pelican adds spin-up time (14.2 seconds).
Power consumption. Comparing all disks on standby (1.8kW) vs. all disks active (10.8kW) vs. Pelican (3.7kW).
Trace replay: European Center for Medium-range Weather Forecast. Every request for 2.4 years. Run through the simulator. Tiering model. Tiered system with Primary storage, cache and pelican.
Trace replay: Plotting highest response time for a 2h period. Response time was not bad, simulator close to the rack.
Trace replay: Plotting deepest queues for a 2h period. Again, simulator close to the rack.
War stories. Booting a system with 1152 disks (BIOS changes needed). Port multiplier – port 0 (firmware change needed). Data model for system (serial numbers for everything). Things to track: slots, volumes, media.

Torturing Databases for Fun and Profit
Mai Zheng, Assistant Professor Computer Science Department – College of Arts and Sciences, New Mexico State University

White paper “Torturing Databases for Fun and Profit” at https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-zheng_mai.pdf
Databases are used to store important data. Should provide ACID properties: atomicity, consistency, isolation, durability – even under failures.
List of databases that passed the tests: <none>. Everything is broken under simulated power faults.
Power outages are not that uncommon. Several high profile examples shown.
Fault model: clean termination of I/O stream. Model does not introduce corruption/dropping/reorder.
How to test: Connect database to iSCSI target, then decouple the database from the iSCSI target.
Workload example. Key/value table. 2 threads, 2 transactions per thread.
Known initial state, each transaction updates N random work rows and 1 meta row. Fully exercise concurrency control.
Simulates power fault during our workload. Is there any ACID violation after recovery? Found atomicity violation.
Capture I/O trace without kernel modification. Construct a post-fault disk image. Check the post-fault DB.
This makes testing different fault points easy. But enhanced it with more context, to figure out what makes some fault points special.
With that, five patterns found. Unintended update to the mmap’ed blocks. Pattern-based ranking of where fault injections will lead to pattern.
Evaluated 8 databases (open source and commercial). Not a single database could survive.
The most common violation was durability. Some violations are difficult to trigger, but the framework helped.
Case study: A TokyoCabinet Bug. Looking at the fault and why the database recovery did not work.
Pattern-based fault injection greatly reduced test points while achieving similar coverage.
Wake up call: Traditional testing methodology may not be enough for today’s complex storage systems.
Thorough testing requires purpose-built workloads and intelligent fault injection techniques.
Different layers in the OS can help in different ways. For instance, iSCSI is an ideal place for fault injection.
We should bridge the gaps in understanding and assumptions. For instance, durability might not be provided by the default DB configuration.

Personal Cloud Self-Protecting Self-Encrypting Storage Devices
Robert Thibadeau, Ph.D., Scientist and Entrepreneur, CMU, Bright Plaza
http://www.snia.org/sites/default/files/DSS-Summit-2015/presentations/RobertThibadeau_Personal%20Cloud.pdf

This talk is about personal devices, not enterprise storage.
The age of uncontrolled data leaks. Long list of major hacks recently. All phishing initiated.
Security ~= Access Control. Security should SERVE UP privacy.
Computer security ~= IPAAAA, Integrity, Private, Authentication, Authorization, Audit, Availability. The first 3 are encryption, the other aren’t.
A storage device is a computing device. Primary host interface, firmware, special hardware functions, diagnostic parts, probe points.
For years, there was a scripting language inside the drives.
TCG Core Spec. Core (Data Structures, Basic Operations) + Scripting (Amazing use cases).
Security Provider: Admin, Locking, Clock, Forensic Logging, Crypto services, internal controls, others.
What is an SED (Self-Encrypting Device)? Drive Trust Alliance definition: Device uses built-in hardware encryption circuits to read/write data in/out of NV storage.
At least one Media Encryption Key (MEK) is protected by at least one Key Encryption Key (KEK, usually a “password”).
Self-Encrypting Storage. Personal Storage Landscape. People don’t realize how successful it is.
All self-encrypting today: 100% of all SSDs, 100% of all enterprise storage (HDD, SSD, etc), all iOS devices, 100% of WD USB HDDs,
Much smaller number of personal HDDs are Opal or SED. But Microsoft Bitlocker supports “eDrive” = Opal 2.0 drives of all kinds.
You lose 40% of performance of a phone if you’re doing software encryption. You must do it in hardware.
Working on NVM right now.
Drive Trust Alliance: sole purpose to facilitate adoption of Personal SED. www.drivetrust.org
SP-SED Rule 1 – When we talk about cloud things, every personal device is actually in the cloud so… Look in the clouds for what should be in personal storage devices.
TCG SED Range. Essentially partitions in the storage devices that have their own key. Bitlocker eDrive – 4 ranges. US Government uses DTA open source for creating resilient PCs using ranges. BYOD and Ransomware protection containers.
Personal Data Storage (PDS). All data you want to protect can be permitted to be queried under your control.
Example: You can ask if you are over 21, but not what your birthday is or how old you are, although data is in your PDS.
MIT Media Lab, OpenPDS open source offered by Kerberos Consortium at MIT.
Homomorphic Encryption. How can you do computing operations on encrypted data without ever decrypting the data. PDS: Ask questions without any possibility of getting at the data.
It’s so simple, but really hard to get your mind wrapped around it. The requests come encrypted, results are encrypted and you can never see the plaintext over the line.
General solution was discovered but it is not computationally infeasible (like Bitcoin). Only in the last few years (2011) it improved.
HE Cloud Model and SP-DED Model. Uses OAuth. You can create personal data and you can get access to questions to your personal data. No plain text.
Solution for Homomorphic Encryption. Examples – several copies of the data. Multiple encryption schemes. Each operation (Search, Addition, Multiplication) uses a different scheme.
There’s a lot of technical work on this now. Your database will grow a lot to accommodate these kinds of operations.
SP-SED Rule 2 – Like the internet cloud: if anybody can make money off an SP-SED, then people get really smart really fast… SP-SED should charge $$ for access to the private data they protect.
The TCG Core Spec was written with this in mind. PDS and Homomorphic Encryption provide a conceptual path.
Challenges to you: The TCG Core was designed to provide service identical to the Apple App Store, but in Self-Protecting Storage devices. Every personal storage device should let the owner of the device make money off his private data on it.

Hitachi Data Systems – Security Directions and Trends
Eric Hibbard, Chair SNIA Security Technical Working Group, CTO Security and Privacy HDS

Protecting critical infrastructure. No agreement on what is critical.
What are the sections of critical infrastructure (CI)? Some commonality, but no agreement. US=16 sectors, CA=10, EU=12, UK=9, JP=10.
US Critical Infrastructure. Less than 20% controlled by the government. Significant vulnerabilities. Good news is that cybersecurity is a focus now. Bad news: a lot of interdependencies (lots of things depend on electric power).
Threat landscape for CI. Extreme weather, pandemics, terrorism, accidents/technical failures, cyber threats.
CI Protection – Catapulted to the forefront. Several incidents, widespread concern, edge of cyber-warfare, state-sponsored actions.
President Obama declared a National Emergency on 04/01/2015 due to rising number of cyberattacks.
CI protection initiatives. CI Decision-making organizations, CIP decisions. CIP decision-support system. The goal is to learn from attacks, go back and analyze what we could have done better.
Where is the US public sector going? Rethinking strategy, know what to protect, understand value of information, beyond perimeter security, cooperation.
Disruptive technologies: Mobile computing, cloud computing, machine-to-machine, big data analytics, industrial internet, Internet of things, Industry 4.0, software defined “anything”. There are security and privacy issues for each. Complexity compounded if used together.
M2M maturity. Machine-to-machine communication between devices that are extremely intelligent, maybe AI.
M2M analytics building block. Big Data + M2M. This is the heart and soul of smart cities. This must be secured.
IoT. 50 billion connected objects expected by 2020. These will stay around for a long time. What if they are vulnerable and inside a wall?
IoT will drive big data adoption. Real time and accurate data sensing. They will know where you are at any point in time.
CI and emerging technology. IoT helps reduce cost, but it increases risks.
Social Infrastructure (Hitachi View). Looking at all kinds of technologies and their interplay. It requires collaborative system.
Securing smart sustainable cities. Complex systems, lots of IoT and cloud and big data, highly vulnerable. How to secure them?

Enterprise Key Management & KMIP: The Real Story – Q&A with EKM Vendors
Moderator: Tony Cox, Chair SNIA Storage Security Industry Forum, Chair OASIS KMIP Technical Committee
Panelists: Tim Hudson, CTO, Cryptsoft
Nathan Turajski, Senior Product Manager, HP
Bob Lockhart, Chief Solutions Architect, Thales e-Security, Inc
Liz Townsend, Director of Business Development, Townsend Security
Imam Sheikh, Director of Product Management, Vormetric Inc

Goal: Q&A to explore perspective in EKM, KMIP.
What are the most critical concerns and barriers to adoption?

Some of developers that built the solution are no longer there. Key repository is an Excel spreadsheet. Need to explain that there are better key management solutions.
Different teams see this differently (security, storage). Need a set of requirements across teams.
Concern with using multiple vendors, interoperability.
Getting the right folks educated about basic key management, standards, how to evaluate solutions.
Understanding the existing solutions already implemented.

Would you say that the OASIS key management has progressed to a point where it can be implemented with multiple venders?

Yes, we have demonstrated this many times.
Trend to use KMIP to pull keys down from repository.
Different vendors excel in different areas and complex system do use multiple vendors.
We have seen migrations from one vendor to another. The interoperability is real.
KMIP has become a cost of entry. Vendors that do not implement it are being displaced.
It’s not just storage. Mobile and Cloud as well.

What’s driving customer purchasing? Is it proactive or reactive? With interoperability, where is the differentiation?

It’s a mix of proactive and reactive. Each vendor has different background and different strengths (performance, clustering models). There are also existing vendor relationships.
Organizations still buy for specific applications.
It’s mixed, but some customers are planning two years down the line. One vendor might not be able to solve all the problems.
Compliance is driving a lot of the proactive work, although meeting compliance is a low bar.
Storage drives a lot of it, storage encryption drives a lot of it.

What benefits are customers looking for when moving to KMIP? Bad guy getting to the key, good guy losing the key, reliably forget the key to erase data?

There’s quote a mix of priorities. The operational requirements not to disrupt operations. Assurances that a key has been destroyed and are not kept anywhere.
Those were all possible before. KMIP is about making those things easier to use and integrate.
Motivation is to follow the standard, auditing key transitions across different vendors.

When I look at the EU regulation, cloud computing federating key management. Is KMIP going to scale to billions of keys in the future?

We have vendors that work today with tens of billions of key and moving beyond that. The underlying technology to handle federation is there, the products will mature over time.
It might actually be trillions of keys, when you count all the applications like the smart cities, infrastructure.

When LDAP is fully secure and everything is encrypted. How does secure and unsecure merge?

Having conversations about different levels of protections for different attributes and objects.

What is the different from a local key management to a remote or centralized approaches?

There are lots of best practices in the high scale solutions (like separation of duties), and not all of them are there for the local solution.
I don’t like to use simple and enterprise to classify. It’s better to call them weak and strong.
There are scenarios where the key needs to local for some reason, but need to secure the key, maybe have a hybrid solution with a cloud component.
Some enterprises think in terms of individual projects, local key management. If they step back, they will see the many applications and move to centralized.

With the number of keys grows will we need a lot more repositories with more interop?

Yes. It is more and more a requirement, like in cloud and mobile.
Use KMIP layer to communicate between them.

We’re familiar with use cases? What about abuse cases? How to protect that infrastructure?

It goes back to not doing security by obscurity.
You use a standard and audit the accesses. The system will be able to audit, analyze and alert you when it sees these abuses.
The repository has to be secure, with two-factor authentication, real time monitoring, allow lists for who can access the system. Multiple people to control your key sets.
Key management is part of the security strategy, which needs to be multi-layered.
Simple systems and a common language is a vector for attack, but we need to do it.
Key management and encryption is not the end all and be all. There must be multiple layers. Firewall, access control, audit, logging, etc. It needs to be comprehensive.

Lessons Learned from the 2015 Verizon Data Breach Investigations Report
Suzanne Widup, Senior Analyst, Verizon
http://www.snia.org/sites/default/files/DSS-Summit-2015/presentations/SuzanneWidupLearned_Lessons_Verizon.pdf

Fact based research, gleaned from case reports. Second year that we used data visualization. Report at http://www.verizonenterprise.com/DBIR/2015/
2015 DBIR: 70 contributed organizations, 79,790 security incidents, 2,122 confirmed data breaches, 61 countries
The VERIS framework (actor – who did it, action – how they did it, asset – what was affected, attribute – how it was affected). Given away for free.
We can’t share all the data. But some if it publicly disclosed and it’s in a GitHub repository as JSON files. http://www.vcdb.org.
You can be a part of it. Vcdb.org needs volunteers – be a security hero.
Looking at incidents vs. breaches. Divided by industry. Some industries have higher vulnerabilities, but a part of it is due to visibility.
Which industries exhibit similar threat profiles? There might be other industries that look similar to yours…
Zooming into healthcare and other industries with similar threat profiles.
Threat actors. Mostly external. Less than 20% internal.
Threat actions. Credentials (down), RAM scrapers (up), spyware/keyloggers (down), phishing (up).
The detection deficit. Overall trend is still pretty depressing. The bad guys are innovating faster than we are.
Discovery time line (from 2015). Mostly discovered in days or less.
The impact of breaches. We’re were not equipped to measure impact before. This year we partnered with insurance partners. We only have 50% of what is going on here.
Plotting the impact of breaches. If you look at the number of incidents, it was going down. If you look at the records lost, it is growing.
Charting number of records (1 to 100M) vs. expected loss (US$). There is a band from optimist to pessimist.
The nefarious nine: misc errors, crimeware, privilege misuse, lost/stolen assets, web applications, denial of service, cyber-espionage, point of sale, payment card skimmers.
Looks different if you use just breaches instead of all incidents. Point of sale is higher, for instance.
All incidents, charted over time (graphics are fun!)
More charts. Actors and the nine patterns. Breaches by industry.
Detailed look at point of sale (highest in accommodation, entertainment and retail), crimeware, cyber-espionage (lots of phishing), insider and privilege misuse (financial motivation), lost/stolen devices, denial of service.
Threat intelligence. Share early so it’s actionable.
Phishing for hire companies (23% of recipients open phishing messages, 11% click on attachments)
10 CVEs account for 97% of exploits. Pay attention to the old vulnerabilities.
Mobile malware. Android “wins” over iOS.
Two-factor authentication and patching web servers mitigates 24% of vulnerabilities each.

↧

My Top Reasons to Use OneDrive

January 26, 2016, 12:55 am

≫ Next: Perhaps OneDrive

≪ Previous: Raw notes from the Storage Developer Conference 2015 (SNIA SDC 2015)

As you might have noticed, I am now in the OneDrive team. Since I’ve been here for a few months, I think I earned the right to start sharing a few blogs about OneDrive. I’ll do that over the next few months, focusing on the user’s view of OneDrive (as opposed to the view we have from the inside).

To get things started, this post shares my top reasons to use OneDrive. As you probably already heard, OneDrive is a cloud storage solution by Microsoft. You can upload, download, sync, and share files from your PC, Mac, Phone or Tablet. Here are a few reasons why I like to use OneDrive.

1) Your files in the cloud.The most common reason for using OneDrive is to upload or synchronize your local data to the cloud. This will give you one extra copy of your documents, pictures and videos, which you could use if your computer breaks. Remember the 3-2-1 rules: have 3 copies of your important files, 2 in different media, 1 in another site. For instance, you could have one copy of your files on your PC, one copy in an external drive and one copy in OneDrive.

2) View and edit Office documents. OneDrive offers a great web interface that you can access anywhere you have a OneDrive client or using the http://onedrive.comweb site. The site includes viewers for common data types like videos and pictures. For your Office documents, you can use the great new Office apps for Windows, Mac OSX, Windows Phone, iOS and Android. You can also use the web versions of Word, Excel, PowerPoint or OneNote right from the OneDrive.com web site to create, view and edit your documents (even if Office is not installed on the machine).

3) Share files with others. Once your data is in the cloud, you have the option to share a file or an entire folder with others. You can use this to share pictures with your family or to share a document with a colleague. It’s simple to share, simple to access and you can stop sharing at any time. OneDrive has a handy feature to show files shared with you as part of your drive and it’s quite useful.

4) Upload your photos automatically. If you use a phone or tablet to take pictures and video, you can configure it to automatically upload them to OneDrive. This way your cherished memories will be preserved in the cloud. If you’re on vacation and you phone is lost or stolen, you can replace the phone, knowing that your files were already preserved. We have OneDrive clients for Windows Phone, iOS and Android.

5) Keep in sync across devices. If you have multiple computers, you know how hard it is to keep data in sync. With OneDrive, you can keep your desktop, you laptop and your tablet in sync, automatically. We have OneDrive sync clients for Windows and Mac OSX. You also have an option to sync only a subset of your folders. This will help you have all files on a computer with a large drive, but only a few folders on another computer with limited storage.

6) Search.OneDrive offers a handy search feature that can help you find any of your files. Beyond simply searching for document names or text inside your documents, OneDrive will index the text inside your pictures, the types of picture (using tags like #mountain, #people, #car or #building) or the place where a picture was taken.

Did I forget something important? Use the comments to share other reasons why you like to use OneDrive…

↧

Perhaps OneDrive

February 8, 2016, 11:51 am

≫ Next: PowerShell for finding the size of your local OneDrive folder

≪ Previous: My Top Reasons to Use OneDrive

Perhaps OneDrive

Perhaps OneDrive’s like a place to save
A shelter from the storm
It exists to keep your files
In their clean and tidy form
And in those times of trouble
When your PC is gone
The memory in OneDrive
will bring you home

Perhaps OneDrive is like a window
Perhaps like one full screen
On a watch or on a Surface Hub
Or anywhere in between
And even if you lose your cell
With pictures you must keep
The memory in OneDrive
will stop your weep.

OneDrive to some is like a cloud
To some as strong as steel
For some a way of sharing
For some a way to view
And some use it on Windows 10
Some Android, some iPhone
Some browse it on a friend’s PC
When away from their own

Perhaps OneDrive is like a workbench
Full of projects, full of plans
Like the draft of a great novel
your first rocket as it lands
If I should live forever
And all my dreams prevail
The memory in OneDrive
will tell my tale

↧

PowerShell for finding the size of your local OneDrive folder

February 23, 2016, 9:38 pm

≫ Next: The ABC language, thirty years later…

≪ Previous: Perhaps OneDrive

I would just like to share a couple of PowerShell scripts to find the size of your local OneDrive folder. Note that this just looks at folders structures and does not interact with the OneDrive sync client or the OneDrive service.

First, a one-liner to show the total files, bytes and GBs under the local OneDrive folder (typically C:\Users\Username\OneDrive):

$F=0;$B=0;$N=(Type Env:\UserProfile)+"\OneDrive";Dir $N -R -Fo|%{$F++;$B+=$_.Length};$G=$B/1GB;"$F Files, $B Bytes, $G GB" #PS OneDrive Size

Second, a slightly longer script that shows files, folders, bytes and GBs for all folders under the profile folder that starts with “One”. That typically includes both your regular OneDrive folder and any OneDrive for Business folders:

$OneDrives = (Get-Content Env:\USERPROFILE)+"\One*"
Dir $OneDrives | % {
   $Files=0
   $Bytes=0
   $OneDrive = $_
   Dir $OneDrive -Recurse -File -Force | % {
       $Files++
       $Bytes += $_.Length
   }
   $Folders = (Dir $OneDrive -Recurse -Directory -Force).Count
   $GB = [System.Math]::Round($Bytes/1GB,2)
   Write-Host "Folder ‘$OneDrive’ has $Folders folders, $Files files, $Bytes bytes ($GB GB)"
}

Here is a sample output of the code above:

Folder ‘C:\Users\jose\OneDrive’ has 4239 folders, 33967 files, 37912177448 bytes (35.31 GB)
Folder ‘C:\Users\jose\OneDrive-Microsoft’ has 144 folders, 974 files, 5773863320 bytes (5.38 GB)

↧

The ABC language, thirty years later…

March 17, 2016, 10:44 pm

≫ Next: Splitting logs with PowerShell

≪ Previous: PowerShell for finding the size of your local OneDrive folder

Back in March 1986, I was in my second year of college (Data Processing at the Universidade Federal do Ceara in Brazil). I was also teaching programming night classes at a Brazilian technical school. On that year, I created a language called ABC, complete with a little compiler. It compiled the ABC code into pseudo code and ran it right away.

I actually used this language for a few years to teach an introductory programming class. Both the commands of the ABC language and the messages of the compiler were written in Portuguese. This made it easier for my Brazilian students to start in computer programming without having to know any English. Once they were familiar with the basic principles, they would start using conventional languages like Basic and Pascal.

The students would write some ABC code using a text editor and run the command “ABC filename” to compile and immediately run the code if no errors were found. The tool wrote a binary log entry for every attempt to compile/run a program with the name of the file, the error that stopped the compilation or how many instructions were executed. The teachers had a tool to read this binary log and examine the progress of a student over time.

I remember having a lot of fun with this project. The language was very simple and each command would have up to two parameters, followed by a semicolon. There were dozens of commands including:

Inicio (start, no action)
Fim (end, no action)
* (comment, no action)
Mova (move, move register to another register)
Troque (swap, swap contents of two registers)
Salve (save, put data into a register)
Restore (restore, restore data from a register)
Entre (enter, receive input from the keyboard)
Escreva (write, write to the printer)
Escreva> (writeline, write to the printer and jump to the next line)
Salte (jump, jump to the next printed page)
Mostre (display, display on the screen)
Mostre> (displayline, display on the screen and jump to the next line)
Apague (erase, erase the screen)
Cursor (cursor, position the cursor at the specified screen coordinates)
Pausa (pause, pause for the specified seconds)
Bip (beep, make a beeping sound)
Pare (stop, stop executing the program)
Desvie (goto, jump to the specified line number)
Se (if, start a conditional block)
FimSe (endif, end a conditional block)
Enquanto (while, start a loop until a condition is met)
FimEnq (endwhile, end of while loop)
Chame (call, call a subroutine)
Retorne (return, return from a subroutine)
Repita (repeat, start a loop that repeats a number of times)
FimRep (endrepeat, end of repeat loop)
AbraSai (openwrite, open file for writing)
AbraEnt (openread, open file for reading)
Feche (close, close file)
Leia (read, read from file)
Grave (write, write to file)
Ponha (poke, write to memory address)
Pegue (peek, read from memory address)

The language used 26 pre-defined variables named after each letter. There were also 100 memory positions you could read/write into. I was very proud of how you could use complex expressions with multiple operators, parenthesis, different numeric bases (binary, octal, decimal, hex) and functions like:

Raiz (square root)
Inverso (reverse string)
Caractere (convert number into ASCII character)
Codigo (convert ASCII character into a number)
FimArq (end of file)
Qualquer (random number generator)
Tamanho (length of a string)
Primeiro (first character of a string)
Restante (all but the first character of a string)

I had a whole lot of samples written in ABC, showcasing each of the command, but I somehow lost them along the way. I also had a booklet that we used in the programming classes, with a series of concept followed by examples in ABC. I also could not find it. Oh, well…

At least the source code survived (see below). I used an old version of Microsoft Basic running on a CP/M 2.2 operating system on a TRS-80 clone. Here are a few comments for those not familiar with that 1980’s language:

Line numbers were required. Colons were used to separate multiple commands in a single line.
Variables ending in $ were of type string. Variable with no suffix were of type integer.
Your variable names could be any length, but only the first 4 characters were actually used. Periods were allowed in variable names.
DIM was used to create arrays. Array dimensions were predefined and fixed. There wasn’t a lot of memory.
READ command was used to read from DATA lines. RESTORE would set the next DATA line to READ.
Files could be OPEN for sequential read (“I” mode), sequential write (“O” mode) or random access (“R” mode).

It compiled into a single ABC.COM file (that was the executable extension then). It also used the ABC.OVR file, which contained the error message and up to 128 compilation log entries. Comments are in Portuguese, but I bet you can understand most of it. The code is a little messy, but keep in mind this was written 30 years ago…

2 '************************************************************
3 '*   COMPILADOR/EXECUTOR DE LINGUAGEM ABC - MARCO/1986      *
4 '*               Jose Barreto de Araujo Junior              *
5 '*     com calculo recursivo de expressoes aritmeticas      *
6 '************************************************************
10 ' Versao  2.0 em 20/07/86
11 ' Revisao 2.1 em 31/07/86
12 ' Revisao 2.2 em 05/08/86
13 ' Revisao 2.3 em 15/02/87
14 ' Revisao 2.4 em 07/06/87, em MSDOS
20 '********** DEFINICOES INICIAIS
21 DEFINT A-Z:CLS:LOCATE 1,1,1:ON ERROR GOTO 63000
22 C.CST=1:C.REGIST=2:LT$=STRING$(51,45)
25 DIM ENT$(30),RET$(30),TP(30),P1$(30),P2$(30)
30 DIM CMD(200),PR1$(199),PR2$(199)
35 DIM MEM$(99),REGIST$(26),PRM$(4),MSG$(99)
36 DIM CT(40),REP(10),REPC(10),ENQ(10),ENQ$(10),CHA(10)
40 DEF FNS$(X)=MID$(STR$(X),2)
55 OPER$="!&=#><+-*/^~":MAU$=";.[]()?*"
60 FUNC$="RAIZ     INVERSO  CARACTER CODIGO   FIMARQ   QUALQUER "
62 FUNC$=FUNC$+"TAMANHO  PRIMEIRO RESTANTE ARQUIVO  "
65 ESC$=CHR$(27):BIP$=CHR$(7):TABHEX$="FEDCBA9876543210"
66 OK$=CHR$(5)+CHR$(6)+CHR$(11)
70 M.LN=199:M.CMD=37:MAX=16^4/2-1
75 ESP$=" ":BK$=CHR$(8):RN$="R":IN$="I":OU$="O":NL$=""
80 OPEN RN$,1,"ABC2.OVR",32:FIELD 1,32 AS ER$
85 IF LOF(1)=0 THEN CLOSE:KILL"ABC2.OVR":PRINT "ABC2.OVR NAO ENCONTRADO":END
90 GOSUB 10000 '********** MOSTRA MENSAGEM INICIAL
95 PRINT "Nome do programa: ";:BAS=1:GOSUB 18000:AR$=RI$:GOSUB 10205
99 '********** DEFINICAO DOS COMANDOS
100 DIM CMD$(37),PR$(37):CHQ=0:RESTORE 125
105 FOR X=1 TO M.CMD:READ CMD$(X),PR$(X)
110    CHQ=CHQ+ASC(CMD$(X))+VAL(PR$(X))
115 NEXT : IF CHQ<>3402 THEN END
120 '********** TABELA DOS COMANDOS E PARAMETROS
125 DATA INICIO,10,FIM,10,"*",10
130 DATA MOVA,54,TROQUE,55
135 DATA SALVE,30,RESTAURE,30," ",00
140 DATA ENTRE,52,ESCREVA,42,ESCREVA>,42,MOSTRE,42,MOSTRE>,42
145 DATA SALTE,00,APAGUE,00,CURSOR,22,PAUSA,20,BIP,00
150 DATA PARE,00,DESVIE,40,SE,20," ",00,FIMSE,00
155 DATA ENQUANTO,20," ",00,FIMENQ,00,CHAME,20,RETORNE,00
160 DATA REPITA,20,FIMREP,00
165 DATA ABRASAI,30,ABRAENT,30,FECHE,00,LEIA,50,GRAVE,40
170 DATA PONHA,42,PEGUE,52
190 '********** ABRE ARQUIVO PROGRAMA
200 IF LEN(ARQ$)=0 THEN ERROR 99:GOTO 64000
210 OPEN RN$,2,ARQ$:ULT=LOF(2):CLOSE#2
220 IF ULT=0 THEN KILL ARQ$:ERROR 109:GOTO 64000
390 '********** COMPILACAO
400 N.ERR=0:N.LN=0:IDT=0:CT.SE=0:CT.REP=0:CT.ENQ=0:I.CT=0:LN.ANT=0:CMP=1
405 PRINT:PRINT:PRINT "Compilando ";ARQ$
406 IF DEPUR THEN PRINT "Depuracao"
407 PRINT
410 OPEN IN$,2,ARQ$
415 WHILE NOT EOF(2)
420     LN.ERR=0:LINE INPUT#2,LN$
422     IF INKEY$=ESC$ THEN PRINT "*** Interrompido":GOTO 64000
425     N.LN=N.LN+1:GOSUB 20000 '*ANALISE SINTATICA DA LINHA
430 WEND:CLOSE#2
435 FOR X=IDT TO 1 STEP -1
440     ERROR CT(X)+115
445 NEXT X
450 PRINT:PRINT FNS$(N.LN);" linha(s) compilada(s)"
490 '********** EXECUCAO
500 IF N.ERR THEN PRINT FNS$(N.ERR);" erro(s)":GOTO 64000
510 PRINT "0 erros"
515 PRINT "Executando ";ARQ$:PRINT
520 NL=1:CMP=0:N.CMD=0:CHA=0:ENQ=0:REP=0:SE=0:ESC=0
525 FOR X=1 TO 99:MEM$(X)="":NEXT:FOR X=1 TO 26:REGIST$(X)="":NEXT
530 WHILE NL<=M.LN
535     PNL=NL+1:CMD=CMD(NL):PR1$=PR1$(NL):PR2$=PR2$(NL)
540     IF CMD>3 THEN GOSUB 30000:N.CMD=N.CMD+1 '****** EXECUTA COMANDO
550     NL=PNL:REGIST$(26)=INKEY$
555     IF REGIST$(26)=ESC$ OR ESC=1 THEN NL=M.LN+1:PRINT "*** Interrompido"
560 WEND
570 PRINT:PRINT ARQ$;" executado"
580 PRINT FNS$(N.CMD);" comando(s) executado(s)"
590 PRINT:PRINT "Executar novamente? ";
600 A$=INPUT$(1):IF A$="S" OR A$="s" THEN PRINT "sim":GOTO 515
610 PRINT "nao";:GOTO 64000
9999 '********** ROTINA DE MENSAGEM INICIAL
10000 CLS:PRINT LT$
10020 XA$="| COMPILADOR/EXECUTOR DE LINGUAGEM ABC VERSAO 2.4 |"
10030 PRINT XA$:PRINT LT$:PRINT
10040 CHQ=0:FOR X=1 TO LEN(XA$):CHQ=CHQ+ASC(MID$(XA$,X,1)):NEXT
10050 IF CHQ<>3500 THEN END ELSE RETURN
10199 '********** ROTINA PARA PEGAR NOME DO ARQUIVO
10200 AR$=NL$:K=PEEK(128):FOR X=130 TO 128+K:AR$=AR$+CHR$(PEEK(X)):NEXT
10205 IF AR$="" THEN ERROR 99:GOTO 64000
10210 AR$=AR$+ESP$:PS=INSTR(AR$,ESP$)
10220 ARQ$=LEFT$(AR$,PS-1):RESTO$=MID$(AR$,PS+1)
10221 IF LEFT$(RESTO$,1)="?" THEN DEPUR=1
10230 FOR X=1 TO LEN(MAU$):P$=MID$(MAU$,X,1)
10240   IF INSTR(ARQ$,P$) THEN ERROR 100:GOTO 64000
10250 NEXT
10270 IF LEN(ARQ$)>12 THEN ERROR 100:GOTO 64000
10280 IF INSTR(ARQ$,".")=0 THEN ARQ$=ARQ$+".ABC"
10290 RETURN
17999 '********** ROTINA DE ENTRADA DE DADOS
18000 BAS$=FNS$(BAS):RI$=NL$
18010 A$=INPUT$(1)
18020 WHILE LEN(RI$)<255 AND A$<>CHR$(13) AND A$<>ESC$
18030    RET$=RI$
18040    IF A$=BK$ AND RI$<>NL$ THEN RI$=LEFT$(RI$,LEN(RI$)-1):PRINT ESC$;"[D ";ESC$;"[D";
18050    IF BAS=1 AND A$>=ESP$ THEN RI$=RI$+A$:PRINT A$;
18070    IF BAS>1 AND INSTR(17-BAS,TABHEX$,A$) THEN RI$=RI$+A$:PRINT A$;
18090    A$=INPUT$(1)
18100 WEND
18105 IF A$=ESC$ THEN ESC=1
18110 A$=RI$:GOSUB 42030:RI$=RC$:RETURN
18120 RETURN
18499 '********** CONVERTE PARA BASE ESTRANHA
18500 IF BAS=0 THEN BAS=1
18505 IF BAS=1 OR BAS=10 THEN RETURN
18510 A=VAL(A$):A$=""
18520 WHILE A>0:RS=A MOD BAS:A$=MID$(TABHEX$,16-RS,1)+A$:A=A\BAS:WEND
18525 IF A$="" THEN A$="0"
18530 RETURN
18999 '********** EXECUTA PROCURA DE FIMREP,FIMSE,FIMENQ
19000 IDT=0
19010 WHILE (CMD(PNL)<>FIM OR IDT>0) AND PNL<100
19020    IF CMD(PNL)=INI THEN IDT=IDT+1
19030    IF CMD(PNL)=FIM THEN IDT=IDT-1
19040    PNL=PNL+1
19050 WEND:PNL=PNL+1
19060 RETURN
19500 FOR X=1 TO LEN(UP$)
19510     PP$=MID$(UP$,X,1)
19520     IF PP$>="a" AND PP$<="z" THEN MID$(UP$,X,1)=CHR$(ASC(PP$)-32)
19530 NEXT X:RETURN
19600 N.PRM=N.PRM+1:PRM$(N.PRM)=LEFT$(A$,C-1):A$=MID$(A$,C+1)
19610 C=1:WHILE MID$(A$,C,1)=ESP$:C=C+1:WEND:A$=MID$(A$,C):C=0
19620 IF LEN(PRM$(N.PRM))=1 THEN PRM$(N.PRM)=CHR$(ASC(PRM$(N.PRM))+(PRM$(N.PRM)>"Z")*32)
19630 RETURN
19990 '********** ANALISE SINTATICA DA LINHA
19999 '********** RETIRA BRANCOS FINAIS E INICIAIS
20000 N.PRM=0:A$=LN$:PRM$(1)=NL$:PRM$(2)=NL$
20010 C=1:WHILE MID$(A$,C,1)=ESP$:C=C+1:WEND:A$=MID$(A$,C)
20020 C=LEN(A$)
20040 WHILE MID$(A$,C,1)=ESP$ AND C>0:C=C-1:WEND
20050 A$=LEFT$(A$,C):LN$=A$
20100 '********** ISOLA O NUMERO DA LINHA
20105 C=INSTR(A$,ESP$):NUM$=LEFT$(A$,C):A$=MID$(A$,C+1)
20110 C=1:WHILE MID$(A$,C,1)=ESP$:C=C+1:WEND:A$=MID$(A$,C)
20115 IF NUM$="" AND A$="" THEN RETURN
20120 PRINT NUM$;TAB(5+IDT*3);A$
20130 NL=VAL(NUM$):IF NL<1 OR NL>M.LN THEN ERROR 111:RETURN
20135 IF NL<=LN.ANT THEN ERROR 122:RETURN ELSE LN.ANT=NL
20140 IF MID$(A$,LEN(A$))<>";" THEN PRINT TAB(5+IDT*3);"*** ponto e virgula assumido aqui":A$=A$+";"
20200 '********** ISOLA COMANDO
20210 C=1:P=ASC(MID$(A$,C,1))
20220 WHILE P>59 OR P=42:C=C+1:P=ASC(MID$(A$,C,1)):WEND
20230 CMD$=LEFT$(A$,C-1):A$=MID$(A$,C):A$=LEFT$(A$,LEN(A$)-1)
20240 C=1:WHILE MID$(A$,C,1)=ESP$:C=C+1:WEND:A$=MID$(A$,C)
20300 '********** ISOLA PARAMETROS
20310 IF INSTR(A$,CHR$(34)) THEN GOSUB 27000
20315 PAR=0:C=1
20320 WHILE C<=LEN(A$) AND NPRM<4
20340    P$=MID$(A$,C,1)
20350    IF P$="(" THEN PAR=PAR+1
20360    IF P$=")" THEN PAR=PAR-1
20380    IF P$=ESP$ AND PAR=0 THEN GOSUB 19600
20390    C=C+1
20400 WEND
20410 IF A$<>NL$ THEN N.PRM=N.PRM+1:PRM$(N.PRM)=A$
20420 IF N.PRM>2 THEN ERROR 112:RETURN
20430 PR1$=PRM$(1):PR2$=PRM$(2)
20990 '********** IDENTIFICA COMANDO, 99=ERRO
21000 C.CMD=99:UP$=CMD$:GOSUB 19500:CMD$=UP$
21010 FOR X=1 TO M.CMD
21020   IF CMD$=CMD$(X) THEN C.CMD=X
21030 NEXT X
21040 IF C.CMD=99 THEN ERROR 114:RETURN
21050 CMD(NL)=C.CMD:PR1$(NL)=PR1$:PR2$(NL)=PR2$
21060 '********** ANALISE DE COMANDOS PARENTESIS
21100 C=C.CMD
21110 INI=-(C=21)-2*(C=24)-3*(C=29)
21120 FIM=-(C=23)-2*(C=26)-3*(C=30)
21130 IF INI THEN IDT=IDT+1:CT(IDT)=INI
21140 IF FIM THEN GOSUB 26000:IF LN.ERR THEN RETURN
21990 '********** IDENTIFICA PARAMETROS
22000 PR1=VAL(LEFT$(PR$(C.CMD),1)):PR2=VAL(RIGHT$(PR$(C.CMD),1))
22010 PR$=PR1$:PR=PR1:GOSUB 25000:IF LN.ERR THEN RETURN
22020 TIP.ANT=TIP2:PR$=PR2$:PR=PR2:GOSUB 25000
22025 IF PR1+PR2>7 AND TIP2<>TIP.ANT THEN ERROR 110
22030 RETURN
24990 '********** ANALISE DO PARAMETRO
25000 IF PR=0 AND PR$<>NL$ THEN ERROR 112:RETURN
25010 IF PR=1 OR PR=0 THEN RETURN
25020 ENT$(I)=PR$:GOSUB 41000:IF LN.ERR THEN RETURN
25030 TIP1=TP(I)
25040 I=I+1:ENT$(I)=PR$:GOSUB 40000:IF LN.ERR THEN RETURN
25050 TIP2=TP(I+1)
25060 IF PR=4 THEN RETURN
25070 IF PR=2 AND TIP2=1 THEN RETURN
25080 IF PR=3 AND TIP2=-1 THEN RETURN
25090 IF PR=5 AND TIP1=C.REGIST THEN RETURN
25110 ERROR 115:RETURN
25990 '********** ANALISE DE FIMSE,FIMENQ E FIMREP
26000 IF IDT=0 THEN ERROR 115+FIM:RETURN
26010 IF CT(IDT)<>FIM THEN ERROR 118+CT(IDT):IDT=IDT-1:GOTO 26000
26020 IDT=IDT-1:IF IDT<0 THEN IDT=0
26030 RETURN
26999 '********** TROCA "" POR ()1
27000 ASP=0
27010 WHILE INSTR(A$,CHR$(34))
27020     P=INSTR(A$,CHR$(34))
27030     IF ASP=0 THEN MID$(A$,P,1)="(" ELSE A$=LEFT$(A$,P-1)+")1"+MID$(A$,P+1)
27040     ASP=NOT ASP
27050 WEND
27060 RETURN
29999 '********** EXECUTA COMANDO
30000 IF DEPUR THEN PRINT USING "### & & &;";NL;CMD$(CMD);PR1$;PR2$
30005                ON CMD    GOSUB 30100,30200,30300,30400,30500
30010 IF CMD>5  THEN ON CMD-5  GOSUB 30600,30700,30800,30900,31000
30020 IF CMD>10 THEN ON CMD-10 GOSUB 31100,31200,31300,31400,31500
30030 IF CMD>15 THEN ON CMD-15 GOSUB 31600,31700,31800,31900,32000
30040 IF CMD>20 THEN ON CMD-20 GOSUB 32100,32200,32300,32400,32500
30050 IF CMD>25 THEN ON CMD-25 GOSUB 32600,32700,32800,32900,33000
30060 IF CMD>30 THEN ON CMD-30 GOSUB 33100,33200,33300,33400,33500,33600,33700
30080 RETURN
30099  ' COMANDO INICIO
30100 RETURN
30199  ' COMANDO FIM
30200 RETURN
30299  ' COMANDO *
30300 RETURN
30399  ' COMANDO MOVA
30400 I=I+1:ENT$(I)=PR2$:GOSUB 40000
30410 X1=ASC(PR1$):REGIST$(X1-64)=RET$(I+1):RETURN
30499  ' COMANDO TROQUE
30500 X1=ASC(PR1$)-64:X2=ASC(PR2$)-64:SWAP REGIST$(X1),REGIST$(X2):RETURN
30599  ' COMANDO SALVE
30600 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X$=RET$(I+1)
30602 OPEN OU$,3,X$:FOR X=0 TO 99:WRITE#3,MEM$(X):NEXT:CLOSE#3:RETURN
30699  ' COMANDO RESTAURE
30700 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X$=RET$(I+1)
30702 OPEN IN$,3,X$:FOR X=0 TO 99:LINE INPUT#3,MEM$(X):NEXT:CLOSE#3:RETURN
30799  ' COMANDO INDEFINIDO 3
30800 RETURN
30899  ' COMANDO ENTRE
30900 I=I+1:ENT$(I)=PR2$:GOSUB 40000:BAS=VAL(RET$(I+1))
30905 IF BAS=0 THEN IF PR1$>"M" THEN BAS=1 ELSE BAS=10
30910 GOSUB 18000:X1=ASC(PR1$)-64:REGIST$(X1)=RI$:PRINT:RETURN
30999  ' COMANDO ESCREVA
31000 IF PR1$<>"" THEN I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1$=RET$(I+1) ELSE X1$=""
31010 I=I+1:ENT$(I)=PR1$:GOSUB 40000:BAS=VAL(RET$(I+1))
31015 IF BAS>16 THEN ERROR 107
31020 A$=X1$:GOSUB 18500:LPRINT A$;:RETURN
31099  ' COMANDO ESCREVA>
31100 IF PR1$<>"" THEN I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1$=RET$(I+1) ELSE X1$=""
31110 I=I+1:ENT$(I)=PR2$:GOSUB 40000:BAS=VAL(RET$(I+1))
31115 IF BAS>16 THEN ERROR 107
31120 A$=X1$:GOSUB 18500:LPRINT A$:RETURN
31199  ' COMANDO MOSTRE
31200 I=I+1:ENT$(I)=PR2$:GOSUB 40000:X1=VAL(RET$(I+1))
31201 IF X1>16 THEN BAS=X1:ERROR 107
31205 IF PR1$<>"" THEN I=I+1:ENT$(I)=PR1$:GOSUB 40000:A$=RET$(I+1) ELSE A$=""
31210 BAS=X1:GOSUB 18500:PRINT A$;:RETURN
31299  ' COMANDO MOSTRE>
31300 I=I+1:ENT$(I)=PR2$:GOSUB 40000:X1=VAL(RET$(I+1))
31301 IF X1>16 THEN BAS=X1:ERROR 107
31305 IF PR1$<>"" THEN I=I+1:ENT$(I)=PR1$:GOSUB 40000:A$=RET$(I+1) ELSE PR1$=""
31310 BAS=X1:GOSUB 18500:PRINT A$:RETURN
31399  ' COMANDO SALTE
31400 LPRINT CHR$(12);:RETURN
31499  ' COMANDO APAGUE
31500 CLS:RETURN
31599  ' COMANDO CURSOR
31600 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
31610 I=I+1:ENT$(I)=PR2$:GOSUB 40000:X2=VAL(RET$(I+1))
31620 LOCATE X1,X2:RETURN
31699  ' COMANDO PAUSA
31700 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
31710 FOR X!=1 TO X1*1000:NEXT:RETURN
31799  ' COMANDO BIP
31800 BEEP:RETURN
31899  ' COMANDO PARE
31900 PNL=M.LN+1:RETURN
31999  ' COMANDO DESVIE
32000 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
32010 IF X1<1 OR X1>M.LN THEN ERROR 108
32020 PNL=X1:RETURN
32099  ' COMANDO SE
32100 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
32110 IF X1=0 THEN INI=21:FIM=23:GOSUB 19000:RETURN
32120 RETURN
32199  ' COMANDO INDEFINIDO 4
32200 RETURN
32299  ' COMANDO FIMSE
32300 RETURN
32399  ' COMANDO ENQUANTO
32400 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
32410 IF X1=0 THEN INI=24:FIM=26:GOSUB 19000:RETURN
32420 ENQ=ENQ+1:ENQ$(ENQ)=PR1$:ENQ(ENQ)=PNL:RETURN
32499  ' COMANDO INDEFINIDO 5
32500 RETURN
32599  ' COMANDO FIMENQ
32600 IF ENQ=0 THEN ERROR 120
32605 I=I+1:ENT$(I)=ENQ$(ENQ):GOSUB 40000:X1=VAL(RET$(I+1))
32610 IF X1>0 THEN PNL=ENQ(ENQ):RETURN
32620 ENQ=ENQ-1:RETURN
32699  ' COMANDO CHAME
32700 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
32710 IF X1<1 OR X1>M.LN THEN ERROR 108
32720 CHA=CHA+1:CHA(CHA)=PNL:PNL=X1:RETURN
32799  ' COMANDO RETORNE
32800 IF CHA=0 THEN ERROR 109
32810 PNL=CHA(CHA):CHA=CHA-1:RETURN
32899  ' COMANDO REPITA
32900 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X1=VAL(RET$(I+1))
32905 IF X1=0 THEN INI=29:FIM=30:GOSUB 19000:RETURN
32910 REP=REP+1:REPC(REP)=X1:REP(REP)=PNL:RETURN
32999  ' COMANDO FIMREP
33000 IF REP=0 THEN ERROR 118
33010 REPC(REP)=REPC(REP)-1:IF REPC(REP)>0 THEN PNL=REP(REP):RETURN
33020 REP=REP-1:RETURN
33099  ' COMANDO ABRASAI
33100 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X$=RET$(I+1)
33110 OPEN OU$,3,X$:RETURN
33199  ' COMANDO ABRAENT
33200 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X$=RET$(I+1)
33210 OPEN IN$,3,X$:RETURN
33299  ' COMANDO FECHE
33300 CLOSE#3:RETURN
33399  ' COMANDO LEIA
33400 LINE INPUT #3,X$
33410 X1=ASC(PR1$)-64:REGIST$(X1)=X$:RETURN
33499  ' COMANDO GRAVE
33500 I=I+1:ENT$(I)=PR1$:GOSUB 40000:X$=RET$(I+1)
33510 PRINT#3,X$:RETURN
33599  ' COMANDO PONHA
33600 I=I+1:ENT$(I)=PR1$:GOSUB 40000:XXXX$=RET$(I+1)
33610 I=I+1:ENT$(I)=PR2$:GOSUB 40000:X1=VAL(RET$(I+1))
33615 IF X1>99 THEN ERROR 124
33620 MEM$(X1)=XXXX$:RETURN
33699  ' COMANDO PEGUE
33700 X1=ASC(PR1$)-64
33710 I=I+1:ENT$(I)=PR2$:GOSUB 40000:X2=VAL(RET$(I+1))
33720 IF X2>99 THEN ERROR 124
33730 REGIST$(X1)=MEM$(X2):RETURN
39990 '********** AVALIA EXPRESSAO (RECURSIVA)
40000 GOSUB 41000:'**********AVALIA SINTAXE
40010 IF TP(I)=C.CST THEN GOSUB 42000:RET$(I)=RC$:I=I-1:RETURN
40020 IF TP(I)=C.REGIST THEN GOSUB 43000:RET$(I)=RR$:I=I-1:RETURN
40030 IF TP(I)<199   THEN GOSUB 40100:RETURN
40040 IF TP(I)<255   THEN GOSUB 40200:RETURN
40050 ERROR 101
40090 '********** FUNCAO
40100 I=I+1:ENT$(I)=P1$(I-1):GOSUB 40000
40110 P1$(I)=RET$(I+1):GOSUB 45000:RET$(I)=RP$:I=I-1:RETURN
40190 '********** OPERADOR
40200 I=I+1:ENT$(I)=P1$(I-1):GOSUB 40000:P1$(I)=RET$(I+1):TP(I)=TP(I)*TP(I+1)
40220 I=I+1:ENT$(I)=P2$(I-1):GOSUB 40000:P2$(I)=RET$(I+1)
40230 IF SGN(TP(I))<>TP(I+1) THEN ERROR 110
40240 GOSUB 47000:RET$(I)=RP$:I=I-1:RETURN
40990 '********** AVALIA SINTAXE
41000 A$=ENT$(I)
41010 IF LEN(A$)=1 AND VAL(A$)=0 AND A$<>"0" THEN TP(I)=2:ENT$(I)=CHR$(ASC(ENT$(I))+(ENT$(I)>"Z")*32):RETURN
41025 FOR XX=1 TO 6:B$=MID$(OPER$,XX*2-1,2):PAR=0
41030   FOR X=LEN(A$) TO 1 STEP -1:P$=MID$(A$,X,1)
41050     IF P$="(" THEN PAR=PAR+1
41060     IF P$=")" THEN PAR=PAR-1
41080     IF INSTR(B$,P$) AND PAR=0 THEN 41500
41090   NEXT X:IF PAR<>0 THEN ERROR 105
41105 NEXT XX
41110 P$=MID$(A$,1,1):PAR=0
41120 IF P$<>"(" THEN P1$(I)=A$:P2$(I)="10":TP(I)=1:RETURN
41130 FOR X=1 TO LEN(A$):P$=MID$(A$,X,1)
41140   IF P$="(" THEN PAR=PAR+1  
41160   IF P$=")" THEN PAR=PAR-1  
41170   IF P$=")" AND PAR=0 THEN 41200 
41180 NEXT X:ERROR 105
41200 P1$(I)=MID$(A$,2,X-2):P2$(I)=MID$(A$,X+1)
41220 IF VAL(P2$(I))>0 AND VAL(P2$(I))<17 THEN TP(I)=1:RETURN
41230 IF VAL(P2$(I))>16 THEN ERROR 107
41235 IF P2$(I)=NL$ THEN TP(I)=100:RETURN
41250 UP$=P2$(I):GOSUB 19500:FUN$=UP$:X=INSTR(FUNC$,FUN$):IF X=0 THEN ERROR 108
41260 TP(I)=(X-1)\9+1:IF (X MOD 9<>1)AND X>0 THEN ERROR 108
41270 TP(I)=100+TP(I):RETURN
41500 K=INSTR(OPER$,P$):P1$(I)=LEFT$(A$,X-1):P2$(I)=MID$(A$,X+1)
41530 TP(I)=200+K:RETURN
41990 '********** AVALIA CONSTANTE
42000 A$=P1$(I):BAS=VAL(P2$(I))
42030 VALOR=0:DIG=-1:IF BAS=1 THEN RC$=A$:TP(I)=-1:RETURN
42070 FOR X=LEN(A$) TO 1 STEP -1:DIG=DIG+1:P$=MID$(A$,X,1)
42080   IF INSTR(17-BAS,TABHEX$,P$)=0 THEN ERROR 103:GOTO 42120
42100   Y=16-INSTR(TABHEX$,P$):VALOR=VALOR+Y*BAS^DIG
42101   IF VALOR>MAX THEN ERROR 6:GOTO 42120
42110 NEXT X
42120 RC$=FNS$(VALOR):TP(I)=1:RETURN
42990 '********** AVALIA REGISTISTRADOR
43000 X=ASC(ENT$(I)):IF X<65 OR X>90 THEN ERROR 102:RETURN
43010 IF X-64>12 THEN TP(I)=-1 ELSE TP(I)=1
43020 RR$=REGIST$(X-64):RETURN
44990 '********** CALCULA FUNCAO
45000 P1$=P1$(I):P2$=P2$(I):TP=TP(I+1)
45005 ON TP(I)-99 GOSUB 45100,45110,45120,45130,45140,45150,45160,45170,45180,45190,45200
45010 RETURN
45100 RP$=P1$:TP(I)=TP:RETURN
45110 IF TP=-1 THEN ERROR 110
45115 RP$=FNS$(INT(SQR(VAL(P1$)))):TP(I)=1:RETURN
45120 IF TP=1  THEN ERROR 110
45125 RP$=CHR$(15)+P1$+CHR$(14):TP(I)=-1:RETURN
45130 IF TP=-1 THEN ERROR 110
45135 RP$=CHR$(VAL(P1$)):TP(I)=-1:RETURN
45140 IF TP=1  THEN ERROR 110
45145 RP$=FNS$(INT(ASC(P1$))):TP(I)=1:RETURN
45150 TP(I)=1:IF CMP=1 THEN RETURN
45151 IF EOF(3) THEN RP$="1" ELSE RP$="0"
45155 RETURN
45160 IF TP=-1 THEN ERROR 110
45165 RP$=FNS$(INT(RND(1)*VAL(P1$))+1):TP(I)=1:RETURN
45170 IF TP=1  THEN ERROR 110
45175 RP$=FNS$(LEN(P1$)):TP(I)=1:RETURN
45180 IF TP=1 THEN ERROR 110
45185 RP$=LEFT$(P1$,1):TP(I)=-1:RETURN
45190 IF TP=1 THEN ERROR 110
45195 RP$=MID$(P1$,2):TP(I)=-1:RETURN
45200 TP(I)=1:IF CMP=1 THEN RETURN
45203 OPEN RN$,2,P1$:RP$=FNS$(LOF(2)):CLOSE#2
45206 IF VAL(RP$)=0 THEN KILL P1$
45208 RETURN
46990 '********** CALCULA OPERADOR (OPERA)
47000 P1$=P1$(I):P2$=P2$(I):TP=SGN(TP(I)):TP(I)=ABS(TP(I))
47002 ON TP(I)-200 GOSUB 47210,47220,47230,47240,47250,47260
47005 IF TP(I)>206 THEN ON TP(I)-206 GOSUB 47110,47120,47130,47140,47150,47160
47010 IF TP(I)<212 AND VAL(RP$)>MAX THEN ERROR 6
47020 IF TP(I)<212 AND VAL(RP$)<0 THEN ERROR 123
47030 RETURN
47110 IF TP=-1 THEN ERROR 110
47115 RP$=FNS$(VAL(P1$)+VAL(P2$)):TP(I)=1:RETURN
47120 IF TP=-1 THEN ERROR 110
47125 RP$=FNS$(VAL(P1$)-VAL(P2$)):TP(I)=1:RETURN
47130 IF TP=-1 THEN ERROR 110
47135 RP$=FNS$(VAL(P1$)*VAL(P2$)):TP(I)=1:RETURN
47140 IF TP=-1 THEN ERROR 110
47145 RP$=FNS$(VAL(P1$)/VAL(P2$)):TP(I)=1:RETURN
47150 IF TP=-1 THEN ERROR 110
47155 RP$=FNS$(VAL(P1$)^VAL(P2$)):TP(I)=1:RETURN
47160 IF TP= 1 THEN ERROR 110
47165 RP$=P1$+P2$:TP(I)=-1:RETURN
47210 IF TP=-1 THEN ERROR 110
47215 RP$=FNS$(VAL(P1$) OR VAL(P2$)):TP(I)=1:RETURN
47220 IF TP=-1 THEN ERROR 110
47225 RP$=FNS$(VAL(P1$) AND VAL(P2$)):TP(I)=1:RETURN
47230 RP$=FNS$(P1$=P2$):TP(I)=1:RETURN
47240 RP$=FNS$(P1$<>P2$):TP(I)=1:RETURN
47250 IF TP=-1 THEN RP$=FNS$(P1$>P2$):TP(I)=1:RETURN
47255 RP$=FNS$(VAL(P1$)>VAL(P2$)):TP(I)=1:RETURN
47260 IF TP=-1 THEN RP$=FNS$(P1$<P2$):TP(I)=1:RETURN
47265 RP$=FNS$(VAL(P1$)<VAL(P2$)):TP(I)=1:RETURN
62990 '********** ROTINA DE ERRO
63000 E$=CHR$(ERR):IF INSTR(OK$,E$) AND CMP=1 THEN RESUME NEXT
63009 E=ERR:N.ERR=N.ERR+1:LN.ERR=1:PRINT TAB(5+IDT*3);"*** erro *** ";
63010 IF CMP=0 OR E<100 THEN PRINT "fatal *** ";
63020 GET #1,E:PRINT ER$
63440 IF CMP=1 AND E>99 THEN RESUME NEXT
64000 GET 1,129:ULT=VAL(ER$):IF E>0 THEN GET 1,E ELSE LSET ER$=""
64050 REGIST=(ULT MOD 128)+130
64055 ARQ$=MID$(ARQ$,2):P=INSTR(ARQ$,"."):IF P>0 THEN ARQ$=LEFT$(ARQ$,P-1)
64060 LSET ER$=FNS$(N.ERR)+" "+FNS$(N.CMD)+" "+ARQ$+" "+ER$
64070 PUT 1,REGIST
64080 LSET ER$=STR$(ULT+1):PUT 1,129
64090 PRINT:PRINT "Final de execucao"
64100 END

↧

Splitting logs with PowerShell

March 21, 2016, 5:38 pm

≫ Next: Build 2016 videos related to OneDrive

≪ Previous: The ABC language, thirty years later…

I did some work to aggregate some logs from a group of servers for the whole month of February. This took a while, but I ended up with a nice CSV file that I was ready to load into Excel to create some Pivot Tables. See more on Pivot Tables at: Using PowerShell and Excel PivotTables to understand the files on your disk.

However, when I tried to load the CSV file into Excel, I got one of the messages I hate the most: “File not loaded completely”. That means that the file I was loading had more than one million rows, which means it cannot be loaded into a single spreadsheet. Bummer… Looking at the partially loaded file in Excel I figure I had about 80% of everything in the one million rows that did load.

Now I had to split the log file into two files, but I wanted to do it in a way that made sense for my analysis. The first column in the CSV file was actually the date (although the data was not perfectly sorted by date). So it occurred to me that it was simple enough to write a PowerShell script to do the job, instead of trying to reprocess all that data again in two batches.

In the end, since it was all February data and the date was in the mm/dd/yyyy format, I could just split the line by “/” and get the second item. There’s a PowerShell function for that. I also needed to convert that item to an integer, since a string comparison would not work (using the string type, “22” is less than “3”). I also had to add an encoding option to my out-file cmdlet. This preserved the log’ s original format, avoided doubling size of the resulting file and kept Excel happy.

Here is what I used to split the log into two files (one with data up to 02/14/15 and the other with the rest of the month):

Type .\server.csv |
? { [int] $_.Split("/")[1]) -lt 15 } |
Out-File .\server1.csv -Encoding utf8

Type .\server.csv |
? { [int] $_.Split("/")[1]) -ge 15 } |
Out-File .\server2.csv -Encoding utf8

That worked well, but I lost the first line of the log with the column headers. It would be simple enough to edit the files with Notepad (which is surprisingly capable of handling very large log files), but at this point I was trying to find a way to do the whole thing using just PowerShell. The solution was to introduce a line counter variable to add to the filter:

$l=0; type .\server.csv |
? { ($l++ -eq 0) -or ( ([int] $_.Split("/")[1]) -lt 15 ) } |
Out-File .\server1.csv -Encoding utf8

$l=0; type .\server.csv |
? { ($l++ -eq 0) -or ( ([int] $_.Split("/")[1]) -ge 15 ) } |
Out-File .\server2.csv -Encoding utf8

PowerShell was actually quick to process the large CSV file and the resulting files worked fine with Excel. In case you’re wondering, you could easily adapt the filter to use full dates. You would split by the comma separator (instead of “/”) and you would use the datetime type instead of int. I imagine that the more complex data type would probably take a little longer, but I did not measure it. The filter would look like this:

$l=0; type .\server.csv |
? { ($l++ -eq 0) -or ([datetime] $_.Split(",")[0] -gt [datetime] "02/15/2016")  } |
Out-File .\server1.csv -Encoding utf8

Now let me get back to my Pivot Tables…

↧

Build 2016 videos related to OneDrive

April 3, 2016, 1:34 pm

≫ Next: Visuality Systems and Microsoft expand SMB collaboration to storage systems

≪ Previous: Splitting logs with PowerShell

Here is an unofficial list of Build 2016 videos that are related to OneDrive:

Session Title: OneDrive API – Overview, What’s New, and Scenarios
Presenter: Sean Maloney
Description: Overview of the OneDrive API for OneDrive and OneDrive for Business, highlights on what’s new, and a quick overview of the range of scenarios you can enable.
Session Title: Introducing the File Picker for OneDrive and OneDrive for Business
Presenter: Ketaki Sheode
Description: Drill into the new Unified OneDrive file picker and what’s possible.
Session Title: Introducing Webhooks for OneDrive and OneDrive for Business
Presenter: Ryan Gregg
Description: Find out how to get near real time notifications to changes in OneDrive and OneDrive for Business with service to service webhooks and OneDrive API.
Session Title: Build Smarter Apps by Connecting to Office Services
Presenters: Yina Arenas, Gareth Jones
Description: Join this session to learn how easy it is to use the Microsoft Graph to build people and group centric applications that use the latest APIs from Azure AD, OneDrive, Outlook, OneNote and more, all under a single endpoint and enabling high impact business scenarios. The Microsoft Graph is the gateway…

Bonus Azure Storage session:

Session Title: Learn How to Store and Serve PBs of Object Data with Azure Block Blobs
Presenters: Jason Hogg, Vamshidhar Kommineni
Description: Azure Storage Block Blobs is the object storage platform for all of Azure and currently stores trillions of objects with millions of transactions per second. Nearly all Microsoft services including OneDrive, Office365, XBox and Skype use Block Blobs for persistent storage. In this talk, you will learn…

For a full list of build session, check https://channel9.msdn.com/Events/Build/2016

↧

Visuality Systems and Microsoft expand SMB collaboration to storage systems

April 19, 2016, 7:37 am

≫ Next: SNIA’s SDC 2016: Public slides and live streaming for Storage Developer Conference

≪ Previous: Build 2016 videos related to OneDrive

Last week, Microsoft and Visuality Systems announced an expanded collaboration on SMB. Visuality is well known for their work supporting the SMB protocol in embedded devices. If you own a printer or scanner that supports the SMB protocol, there’s a good chance that device is running Visuality’s software. Visuality is now expanding into the storage device market.

This new Visuality product offers an SMB implementation that will be appealing to anyone working on a non-Windows device that offers storage, but wants to avoid spending time and effort building their own SMB protocol stack. This could be useful for a wide range of projects, from a small network attached storage device to a large enterprise storage array. Visuality’s SMB implementation includes everything a developer needs to interact with other devices running any version of the SMB protocol, including SMB3.

But why is SMB so important? Well, it’s one of the most widely adopted file protocols and the recent SMB3 version is very fast and reliable. SMB3 is popular on client side, with clients included in Windows (Windows 8 or later), Mac OS X (version 10.10 Yosemite or later) and Linux. Beyond the traditional file server scenarios, SMB3 is now also used in virtualization (Hyper-V over SMB) and databases (SQL Server over SMB) with server implementations in Windows Server (2012 or later), NetApp (Data ONTAP 8.2 or later), EMC (VNX, Isilon OneFS 7.1.1 or later) and Samba (version 4.1 or later), just to mention a few.

For a detailed description of the SMB protocol, including the SMB3 version, check out the SNIA Tutorial on the subject, available from http://www.snia.org/sites/default/files/TomTalpey_SMB3_Remote_File_Protocol-fast15final-revision.pdf.

Read more the Microsoft/Visuality partnership at http://news.microsoft.com/2016/04/11/visuality-systems-and-microsoft-expand-server-message-block-collaboration-to-storage-systems/. You can also get details on the Visuality NQ products at http://www.visualitynq.com/.

↧

SNIA’s SDC 2016: Public slides and live streaming for Storage Developer Conference

September 19, 2016, 9:38 pm

≫ Next: PowerShell script organizes pictures in your OneDrive camera roll folder

≪ Previous: Visuality Systems and Microsoft expand SMB collaboration to storage systems

SNIA’s Storage Developer Conference (SDC 2016) is happening this week in Santa Clara, CA.
This developer-focused conference cover several storage topics like Cloud, File Systems, NVM, Storage Management, and more.
You can see the agenda at http://www.snia.org/events/storage-developer/agenda/2016

However, there a few thing happening differently this time around.
First, most of the slides are available immediately. SNIA used to wait a few months before publishing them publicly.
This year you can find the PDF files available right now at http://www.snia.org/events/storage-developer/presentations16

SNIA is also offering the option to watch some of talks live via YouTube.
This Tuesday (9/20) and Wednesday (9/21), they will be streaming from 9AM to 12PM (Pacific time).
You can watch them at SNIA’s channel at https://www.youtube.com/user/SNIAVideo

One thing hasn’t changed: there are many great talks on hottest storage topics for developers.
Here is a list of the presentations including Microsoft Engineers as presenters.

Hyper-V Windows Server 2016 Storage QoS and Protocol Updates
Matt Kurjanowicz, Senior Software Engineer Lead, Microsoft
Adam Burch, Software Engineer II, Microsoft
Slides:http://www.snia.org/sites/default/files/SDC/2016/presentations/storage_management/AdamBurch-Hyper-V-Windows_Server.pdf
Storage Solutions for Private Cloud, Object Storage Implementation on Private Cloud, MAS/ACS
Ali Turkoglu, Principal Software Engineering Manager, Microsoft
Mallikarjun Chadalapaka, Principal Program Manager, Microsoft
Slides:http://www.snia.org/sites/default/files/SDC/2016/presentations/object_storage/Turkoglu_Chadalapaka_Azure-consistent_Object_Storage_Microsoft_Azure_Stack.pdf
Uncovering Distributed Storage System Bugs in Testing (not in Production!)
Shaz Qadeer, Principal Researcher, Microsoft
Cheng Huang, Principle Researcher, Microsoft
Slides:http://www.snia.org/sites/default/files/SDC/2016/presentations/testing/ShazQadeer_Uncovering_Dstributed_Storage_SDC2016.pdf
Building on The NVM Programming Model – A Windows Implementation
Paul Luse, Principal Engineer, Intel
Chandra Kumar Konamki, Sr Software Engineer, Microsoft
Slides:http://www.snia.org/sites/default/files/SDC/2016/presentations/persistent_memory/Luse_Konamki_Building_On_NVM_Programming_Modelfinal-9-7.pdf
Reducing Replication Bandwidth for Distributed Document-oriented Databases
Sudipta Sengupta, Principal Research Scientist, Microsoft
Slides: Pending
Persistent Memory Panel: PM RDMA – What’s the Big Deal?
Moderator: Doug Voigt, Hewlett Packard Enterprise
Panelists: Tom Talpey, Microsoft; Idan Burstein, Mellanox; Chet Douglas, Intel
Slides:http://www.snia.org/sites/default/files/SDC/2016/presentations/persistent_memory/Panel_PM_RDMA_Whats_big_deal.pdf

↧

PowerShell script organizes pictures in your OneDrive camera roll folder

December 12, 2016, 1:55 pm

≫ Next: Interesting OOF messages

≪ Previous: SNIA’s SDC 2016: Public slides and live streaming for Storage Developer Conference

I just published a new PowerShell script that organizes pictures in your OneDrive camera roll folder. It creates folders named after the year and month, then moves picture files to them. Existing files will be renamed in case of conflict. Empty folders left behind after the files are moved will be removed.

It defaults to your OneDrive camera roll folder, but you can use a parameter to specify another folder. There are also parameters to skip confirmation, skip existing files in case of conflict and avoid removing empty folders at the end.

*** IMPORTANT NOTE ***
This script will reorganize all the files at the given folder and all subfolders.
Files will be moved to a folder named after the year and month the file was last written.
This operation cannot be easily undone. Use with extreme caution.

You can download the script from the TechNet Gallery at
https://gallery.technet.microsoft.com/Organize-pictures-in-your-4bafd2c0

↧