>> So, our project is
actually to develop a domain-specific language
for the class of applications
comprising ML and data processing
in constrained IoT devices.
So, just to give a background, actually,
I had started an industrial IoT startup,
and this is actually my monitoring device.
So, it has Wi-Fi and does on-board data sensing,
the communication data processing,
everything is done on the device.
So, it actually had a
programmable rule engine data aggregation.
So, this is actually my starting point
for me to work into the language aspect
of IoT data processing ML applications.
So, the problem and the motivation.
So, it's computing applications are some more and
more becoming increasingly popular
in constrained IoT devices.
A prime example is Interactive Cane application
developed by Microsoft research right here.
Actually, you can see the device developed by Mecer.
So, basically it allows visually challenged people to
interact with their phone
like gesture with their walking stick,
and ML is actually implemented on the device,
which only has a few kilobytes of RAM.
So, the constrained embedded applications
have limited resources.
There is no resource virtualization.
Energy, timing are important considerations,
and our motivation is the lack of
current programming tools to support development of
modular data pipelines with
compile-time feedback to
the programmer on resource usages.
As Alan Perlis famously said,
"Programmers know the value of
everything and the cost of nothing."
So, we want to make the program aware of the cost of
memory and compute an energy overheads
of enrolling their application.
So, there are three overarching objectives.
One is to provide compile-time feedback
on power timing and resource usage.
So, obviously, through static analysis.
Second, to enable optimizations not support or
unlikely to be done by standard compilers.
Third, which is important for a programmer,
clarity, concision and programmer convenience.
Actually, as of now we have ported
the entire Interactive Cane application
written by Microsoft Research.
Original applications are written in C++.
So, we have
a fully working ported application in our DSL,
which we will demo at the end, and we
have one fourth code size reduction.
So, we just kind of important for programmer,
program maintenance and debugging and so on.
Also work in progress,
short paper on the language has been accepted in
ACM SIGBED, EMSOFT conference.
So, what is the challenge in static analysis of power,
timing, resource usage in embedded systems?
So, we need to able to reason about I/O.
Of course, we can analyze a program and
a compare the worst case execution time of computations,
but we also need to be able to reason about I/O.
It's impractical to integrate
a full hardware and model into the analysis.
Also, there is a need to use
low-level C/C++ libraries for hardware access.
So, interoperability with existing libraries
are very important for the language to be adopted.
So, we have come up with this concept
of effect specifications.
So, the effects of library calls
can be made explicit through
effect specifications and we
will show some code snippets.
So, we can actually
annotate different kinds of
effects like resource acquisition,
which is useful for analyzing resource conflicts.
Also, we'll give an example of how it can also
be used to analyze security and privacy also.
CPU time consumption, power state change,
read/write on memory objects,
so this is useful for memory optimization.
So, we know that whether
an object being passed through library code
is getting right door updated by
the library code. Timed events.
So, this is an example of our effect specification for
a few Arduino library function calls
because eigen applications actually uses Arduino library.
So, this actually the fastest statement
assigns the effect.
So, this means that if we call pinPeripheral n,
which is an Arduino library function,
it has the effect of acquiring the pin without reason.
So pin, this is actually
the resource capability class which can be refined.
So, we can for example, define with specific operations.
So, this means that this particular
the model which invokes this pinPeripheral
actually acquires pin capability
with the resource address then.
Similarly, serial print line.
So, there are different kinds of capabilities here.
The other is CPU time.
So, if we have poldio,
program dio basically that consume CPU cycles.
So that effect can actually be specified like this,
and this has a memory effect so
getFIFOBytes actually updates the buf memory effect,
CPU cycle effect power state change.
So, these kind of side effects
of library function calls can actually
be made explicit through our fixed specification.
So, that is one of our contributions, key contributions.
Second, is do optimizations
that a traditional compiler may not do.
So, actually one of the things we've done is,
we actually have a compiler managed heap.
So, if you want to reuse memory in a traditional program,
you have to actually do garbage collection
or explicit memory management.
But here the compiler actually
compares the lifetimes of objects like
matrices and vectors and
reuses them. Of course, this is similar.
As Shriram said, this is similar to register allocation,
but difference is that registered as
an atomic object whereas
an array is actually not an atomic object.
So, a part of the array could be reused
by some other object,
and the rest of the array block
could reused by some other object and so on.
An interesting issue we
encountered here was that in
traditional memory allocation,
we have this fragmentation,
internal fragmentation, external fragmentation.
Here, we also have encountered lifetime fragmentation.
So, for example, if I try to reuse
memory across two objects having different lifetimes,
and if they are far apart, actually,
we're actually losing some amount
of time when the buffer is actually free,
not used by anybody.
So, we actually want to reduce
lifetime fragmentation as well as internal fragmentation.
Second is, we can have operation of our allocation.
For example, this is a [inaudible] operation,
concatenation of two vectors.
So, if A concat B occurs,
then A and B can
actually be allocated in adjacent memory,
and the operation can actually be eliminated.
Then fusion of matrix-vector loops, then in-place update.
So, in-place for example,
let us say we are casting
an integer array to float array.
So, the alignment sizes are same.
We can actually do an in-place update,
so again, reducing the memory usage.
Then finally, generate efficient state machine
that automatically enters CPU sleep mode when it is idle.
So basically, a traditional static electricity
actually allocates along the linear address line.
We actually allocate, our allocator works in
the space where the y-axis is actually the address line,
and x-axis is actually the pipeline lifetime.
So, we actually want to fill this entire area.
So, basically we want to optimize
by filling this entire area,
and here you can see that these two.
So, this is our address,
you can see these two buffers actually share
the same address region
because they have different lifetimes.
There's actually the memory layout output
from our ID so it's automatically generated.
>> Question. Is lifetime based on the method in which-
>> Yes. So, basically we do a whole program analysis.
>> Is it later a form of
escape analysis that you are doing?
>> No. So, basically the pipeline applications,
we basically have a cyclic execution,
same pipeline repeatedly, right?
So, if you take one execution of the pipeline,
lifetime is the duration
in the program execution meant that variable active,
either defined or used.
Similarly, we have a very
basic worst case execution time analysis.
Software as a computer analysis,
pretty simple because CPU is pretty simple.
There isn't caching and those kind
of advanced features are not there.
But we have lot of I/O.
So, basically we incorporate I/O into
our timing analysis by using our annotations.
So, for example, the acquisition does a blocking I/O,
and because of our annotation we can
actually capture that into our analysis.
So, we know that mpu_acq
actually takes this many CPU cycles.
So, this is again the output from our IDE.
>> Can I ask you a question?
>> Yes.
>> Do you do this kind of analysis moderately or
you just in line everything and
just do this as one method finally?
>> Yeah. We do whole program analysis, non-modular.
>> You in line everything and you just
then analyze as one method?
>> Yeah. Actually, it's done in a modular fashion.
So, we don't really in-line anything.
>> So, for every method you compute an effect?
>> Yes.
>> Somebody has something like
that maybe just chain them up?
>> Yeah, exactly. So, it's done in a modular fashion.
But actually we follow the chain of executions.
So, we complete in a modular fashion
and then compute aggregate.
The last, but not the least,
is clarity in programmer convenience.
So, we have built-in
matrix/vector operations with dimensions.
So, we have dimensions integrated into the type system.
So, if I'm trying to multiply
two matrices which do not have compatible dimensions,
we can give a compile-time error.
We actually have a built-in circular vector.
Circular vector are usually used in
data acquisition applications because we want to keep,
in a latest, an elements.
So, and can we use like a linear vector by
the programmer and we have modular pipeline stage.
So, I'll show you some code snippets
with declarative pipeline specification.
Separation of errors handling policy from the main code,
and we can actually have a library
of pre-defined pipeline stages,
we'll show you a few examples.
Interoperability with C/C++ libraries,
and we actually have built-type browser-based IDE.
So, individual modules we
are actually calling flow modules.
So, there is a flow routine,
there's acquisition routines doesn't take any input.
So, once a stage is completed,
data generation actually, can
pass the data to the next stage by invoking the next.
So, this is like asynchronously training this thing,
and if it doesn't want it doesn't need to call.
So, the module can actually
control whether the generacy flow to the next stage
or control like should
start execution the beginning of the pipeline.
Then, there is a next stage.
So, past data's actually flows to the next stage,
and processes, and so on.
Then, icane applications actually can be
declared as a pipeline specification
here in our language.
You can see the different stages,
and these are like pre-built reusable modules.
So only if data has changed from the past,
the next stage will be executed.
We can also limit the rate.
So we can actually do
some very simple privacy kind of analysis.
For example, this our
only module which acquires sensor port.
We know that from the effects specification.
And this our only module which acquires data port,
that means this is the only module that transmits.
Here, we have rate limit in-between.
So this access like information barrier
from acquisition to output.
So even if if it reads the individual blocks us,
like black box without even really going into anything,
we know that the sensor data
can flow to the output with a rate limit
of only one data item per two seconds.
So we can do some very simple privacy,
we can make some very simple
privacy statements from the specification.
Here's a code snippet of our ring buffer.
So there's a ring buffer and then we can push new values.
So we can multiply matrix with the ring buffer,
we can just use it as a normal linear vector.
And if it access the other line,
we'll always get the oldest value.
So it's very convenient for the programmer.
This is our workshop outcome.
Actually, when we came here, we just had a concept,
we had not done any development.
So we actually build the entire via develop the grammar.
So what we first did was we actually,
we had an idea of our language.
So we ported the icon application first into our syntax.
So that informed our program language development
and then we defend the grammar.
So all the green blocks you see here,
actually, have been built during the workshop.
Rahul [inaudible] actually has developed the DSL,
which automatically convert floating point to fixed point,
we are planning to integrate with that.
Compiler actually, currently, outputs
optimized C state machine.
Since it's actually modular compiler,
we can actually have different targets.
Tomorrow we can target another target.
We also have a fully ported
MSR icane application working, we'll demo it.
So we've got the entire thing working.
Thanks to [inaudible] and Don for helping
in debugging all the algorithm implementation.
Future work, we need to be
planning to investigate the effect specification further.
Hopefully, we'll get the good inputs
from the Msoft also regarding
this particular line of investigation, IDE enhancements.
We can actually have other targets
like.NET Microframework,
LLVM IR, and so on.
Actually, right now, we're not implemented
the type checker so that is
another future work to do. Thank you.
So now, we will also briefly demo the IDE.
So currently, the board is actually running icane
ported in our DSL.
So this is IDE.
>> This our IDE?
>> Just show the serial.
See, a very simple
serial monitor we have built into the IDE.
It actually shows the sensor
of transporting, we have Bluetooth,
we are just printing. Should we connect?
So anyway the credit for the algorithm is not ours.
But anyway, the thing is we have ported it into
our DSL and it's working.
We actually have one-fourth code size reduction
from C++ to our DSL.
>> And we have memory analysis part here.
>> [inaudible] Matrix in terms of
ultimate power reduction or other.
>> No, right now, we are actually
able to do the analysis,
but optimizations we don't have.
Actually, another things is
just compiler can actually give errors,
like syntax errors and type errors.
Our compiler can actually give, for example, let us say,
that because of effect annotations, let us say,
that programmer is making a blocking function call with
an undefined amount of time.
For example, trying to connect to
our network, a blocking function.
So the compiler can actually give error,
a warning to the programmer saying,
"This function might take forever."
So those kind of compile-time feedback we
can give and probably also suggest some automatic fixes.
>> So.
>> Yeah.
>> So just trying to understand demo, right?
So icane is written in C++?
>> Yeah.
>> And you said that you sorted
the natural language alongside it.
So did you reuse some of
that C++ when you implemented your-
>> No. Actually, if you can rotate the proton.
So basically, the basic flow
is completely written in our language.
>> Okay.
>> Only the I/O libraries we
are using are in the libraries.
>> Okay.
>> Like for reading from the serial port and.
>> So in your DSL, when you said proton.
>> Yeah.
>> It wasn't calling the C++?
>> No. It's calling our implementation programs.
>> [inaudible].
>> I see.
>> So actually, all the modules
listed there are actually written in our language.
Only library functions we
are using are actually I/O libraries.
>> Okay.
>> Like reading from I2C bus,
and serial port, and so on.
>> So I had another question.
>> Yes.
>> So you mentioned that with your technique,
you can sort of do a better memory allocation?
>> Yes.
>> Right?
>> Yeah.
>> Did you see that affect [inaudible]?
>> Yeah. Actually, I'll just show you.
For this particular application,
we didn't see a big gain with that.
But one thing is that you can see
that this is a lifetime,
and this is a memory address.
So you can see many of
the large arrays are actually having the entire lifetime.
So basically, that means they are persistence of mem.
So but here, you can see some memory usage like
norm access and featureVector and these are like re-used.
So but one thing is that, of course,
whatever optimizations are useful,
we only have a few kilobytes.
Second thing is that it frees the programmer
from worrying about for example, let's say,
he has allocated a local array
and it is passed to the next stage,
whether it is stock a location or heap statically.
It frees the programmer from
worrying about where to allocate.
So basically whether it is a local array or global array,
all the non like arrays
and matrices and vectors in our language,
we allocate in the compiler managed heap.
That frees the programmer from worrying about,
the one worry is kind of taken care of.
>> So the proton code in
your DSL was a C++ working with [inaudible] code?
>> Proton.
>> Roughly speaking. I thought that we can do that-
>> Yeah, we can show it.
Because we are having declaratives syntax
for arrays, we don't need loops.
We don't- So for example-
>> All thes closer to MATLAB.
>> Exactly.
>> [inaudible]
>> Proton is just [inaudible] of coding
our language, but as in C++-
>> If we also consider the library.
For example, we need to use libraries for vector like,
circular buffers and so on.
So if we consider all that code in C++ code, would be.
>> But probably the C++ one
can be faster than MATLAB for sure.
>> Yeah.
>> But, I don't know if it is faster than yours.
>> Yeah, exactly right.
>> Yeah, but you know.
>> So, because ultimately,
we generate an optimized C state machine.
So, definitely we actually run it like bare-metal speeds.
So, we don't add any over.
Idea is that, actually have abstractions,
which have zero cost.
>> I have a few questions.
>> Yeah.
>> Wasn't this lifetime analysis?
If you look at escape analysis
that's applied to object-oriented languages.
Say for, languages like Java,
where you try to instead of deep allocation,
you'll see the lifetime is
bounded by the method in which it got
created it could be stack allocated
and likewise it could otherwise try to extend.
>> Yeah.
>> So, how much is your technique
going beyond these established ways
of doing lifetime analysis?
I mean, those are typically,
looking at only the method boundary.
>> Yeah.
>> It looks like you're also going
beyond that in terms of iteration,
counts, and all kinds of- can you clarify a bit?
>> Yes. Yeah, I'll do that.
>> How much you are going beyond that.
>> Yeah. So, I'll just show you the pipeline.
So, actually first of all, in our analysis,
if we take our analysis routine,
so this is a pipeline stage.
What we do is, we're starting with the first stage,
we basically do a whole program there.
We look at what is the next stage after the first stage?
So we do a whole program analysis.
Whenever we pass some objects,
we actually do an alias,
so that means a lot.
>> Even standard methods will do, right?
>> Right. Right. Right.
>> If you look at the escape analysis
for Java, they will all do inter procedural?
>> Right. Right. Right. But I don't think that there is
any allocation today which can
reuse part of the object memory.
>> I know of techniques developed
30 years ago which were doing these kinds of things.
It looks like you're also doing
some further iteration level analysis also,
but it wasn't very clear.
>> Okay. So, I will explain to you our algorithm.
So, first of all, basically
if you look at the escape analysis you mentioned,
you are actually looking at converting to
stack allocation the objects
used within that function, right?
So, in our case, we are actually
doing a whole program analysis.
We are taking the entire program,
and seeing how we can actually reuse the objects.
So, basically what we do is-
>> Those techniques also do good program analysis.
Don't think they only look at within
a method because when you're making other method calls,
you have to determine what's the impact of that method
on any kind of aliasing,
any kind of pointers, and so on that have been set up.
Don't think those techniques don't
do good program analysis, they do.
>> Okay.
>> That can be your novelty.
>> Okay. So, basically
this is our entire pipeline lifetime.
So, first of all
we only look at big objects
like matrix and vector objects, not primitive types.
So, basically what we do is,
let us say this is the address line.
Initially, what we do is we actually create regions,
so each region is a contiguous block of memory.
We actually allocate a separate region for each object,
then we try it and try to optimize and merge regions.
So actually our algorithm how it
works till we reach a fixed point,
where no more optimization is possible.
So, let us say we have another object,
we have another object.
This is the entire pipeline,
one cycle of the pipeline.
What we do is, so for example,
let's say that, eraser?
You cannot erase it. But in our case, we can reallocate.
Initially, we start with separate addresses,
and then we see here that
this particular object does not overlap with this-
the lifetime does not overlap with
this object, then we reuse.
So, we merge them into a single region.
So, now that region has one block with
this lifetime and another block with this lifetime.
Now, we have another block here,
which are non-overlapping lifetime with this,
so we again merge it.
So, now we have one block with this much lifetime.
We still have a remaining block with- so we actually
continue iteration like this till we- and of course,
there could be another region
which is actually partially-merged,
and then we can actually merge
partially-merged regions until we
reach a fixed point. That is basically our-
>> The question was, when you had these library calls,
you had these annotations, right?
Which was specifying the resource and and so on.
>> Yes, yes.
>> Is your compiler providing any assistance,
or the language providing
any other assistance in terms of helping come
up with those estimates in the first place?
>> So, right now, no. Basically, the idea was that,
so if you look at- so,
I developed the application program,
I should not have burdened with this.
That was the whole idea. If you
look at our effect annotation,
so this needs to be done. There is no annotation in the-
>> Isn't the programmer coming up with it?
>> No, this is not the Library Developer.
>> Yeah, okay, the library?
But still the programmer-
>> Yeah. No, so once it is done for a library,
the library functions can be
reused in application program,
just like- for example,
I can make a call to the pin peripheral and something.
So, the Application Programmer
doesn't need to worry about anything,
then it'll be automatically pulled from
the FX specification from the library.
It'll pattern match the library call
and pull the effect specification,
then include in the analysis.
So, it needs to be done once for a library,
and then reused in the application.
That was a- yes.
>> Traditionally Compilers have shied away from doing
inter-procedural whole program analysis
because that is closer to the compilation type.
So what is the compilation time frame
of a linear quadratic method programs?
>> I haven't done an analysis on that.
Of course, some of our algorithms
are definitely non-linear right now.
But one reason we are able to- actually,
if you look at NesC, for example.
NesC actually did whole program analysis
for many other- because recently,
the program size is small.
Because we are actually, unlike
a traditional folk, for example,
if we look at a traditional application,
it may run to 100,000 or millions of lines of code.
But, whereas here, the application size is less,
so that is one reason why we are
actually able to afford our whole program analysis.
Another reason is that,
unlike a traditional program,
an embedded program is
closed over all the important parameters.
For example, allocation size and
all those things are actually specified
somewhere as a constant in the program.
Whereas, if we look at
a traditional application for example,
the application we allocate a link list based
on some user input or
the size may be based on some user input,
or it may read from some file.
So, it's not really possible to do
this kind of analysis in a traditional context at all.
But here because the program is closed,
working in a closed environment,
we are actually able to do this kind of analysis.
It is closed because of two reasons, one is of course,
is because of constraints,
you don't have enough memory.
Second is safety.
Actually, if you have a look at- so there is
a C dialect called MISRA C by
Motor Industries Centers Association for safety purposes.
That actually disallows the old
dynamic aspects of C language,
pointer manipulation, memory allocation.
So, that the other concern is,
not more than memory constraints safety.
But anyway, coming back to the point is,
because the program size is small,
we are able to end because the program is closed.
>> So, I have a more high level
of question about the impact.
So, I'm trying to think about like me
as a programmer of CNIP,
and I'm trying to understand what is
the complex part of me writing this program.
I'm probably not going to write the proton library.
So, in fact, somebody at Amazon is probably
going to write an optimized proton library
and give it to me.
So, I don't need a DSL fo that part.
I'm trying to understand where is the DSL
helping me as an Application Developer.
Is it in encoding new kinds of gestures,
new kinds of algorithms?
So, where is the benefit to me as a programmer, say,
if you reduce the development time of this mark
and allocation? I'm not sure.
>> Yeah. Yeah. It's definitely a very good question.
Of course, we haven't investigated the full impact yet,
but one area is that, for example,
let us say that- so,
once we have a set of library of reusable blocks,
we can actually build pipelines
using simpler declarative specification.
And it's very easy to change.
For example, instead of proton,
we want to actually use
some other library instead of data like NPO,
since of this particular sensor,
we want to use some other- so,
it's very easy to maintain
the application chain cycle, I feel.
So, if we look at the C++ Icane application.
Actually it's a C++ state machine
which actually calls all these,
so everything is coupled with each other,
so the're is close coupling.
Whereas, we have removed the coupling,
and if you look at a particular module,
there are absolutely no reference to,
there is no coupling to any other module here.
Not only that, so the model can do two things.
One is that, it can decide what data it can process,
what data is sent to next stage,
whether it should send data to the next stage.
So, these three things can actually be decided,
programmed in the module without actually
coupling with any other existing module.
So, I think it reduces coupling impose modularity,
and we can actually
build pipelines simply by mixing and matching blocks.
I feel there could be some- and another thing is that,
because we have this declarative specification,
we have pre-built modules,
which we can give predefined meaning like rate limit.
We can also do this kind of privacy analysis and so on.
>> Yeah, maybe. I realize this is preliminary it might be
useful to understand the benefits
of your module particle things,
decouple, what happened in
the proton part and
what is happening in the application part?
And then particles outside similarity,
the core benefits and energy benefits,
whatever they're doing.
Have they managed to decoupled those too?
>> Okay.
>> Because protons probably going to be written once.
>> Right. Right.
>> The applications are
probably going to be written by many,
many people who want to use a DSL module.
>> Exactly. So, I feel that, like if you-
>> But I know it's both couple that actually don't know
where [inaudible] practical things.
>> That is true.
>> Okay. Thank you.
>> So, before I wind up,
I want to thank MSR folks for hosting us, and
being wonderful hosts for all of us. Thank you so much
No comments:
Post a Comment